[jira] [Commented] (SPARK-8503) SizeEstimator returns negative value for recursive data structures
[ https://issues.apache.org/jira/browse/SPARK-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595014#comment-14595014 ] Sean Owen commented on SPARK-8503: -- That looks near to Long.MIN_VALUE. It could be overflow, but I suspect something in the code returns Long.MIN_VALUE as a size to mean some kind of error condition and this is being added to a sum. This doesn't really have any info that helps reproduce it though. If you can't describe your exact data structure, what does debugging reveal about when and where this happens? SizeEstimator returns negative value for recursive data structures -- Key: SPARK-8503 URL: https://issues.apache.org/jira/browse/SPARK-8503 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Ilya Rakitsin When estimating size of recursive data structures like graphs, with transient fields referencing one another, SizeEstimator may return negative value if the structure if big enough. This then affects the logic of other components, e.g. SizeTracker#takeSample() and may lead to incorrect behavior and exceptions like: java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -9223372036854691384 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:810) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:991) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8511) Modify a test to remove a saved model in `regression.py`
[ https://issues.apache.org/jira/browse/SPARK-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Ishikawa updated SPARK-8511: --- Description: According to the reference of python, {{os.removedirs}} doesn't work if there are any files in the directory we want to remove. https://docs.python.org/3/library/os.html#os.removedirs Instead of that, using {{shutil.rmtree()}} would be better to remove a temporary directory to test for saving model. https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137 was: According to the reference of python, {{os.removedirs}} doesn't work if there are any files in the directory we want to remove. https://docs.python.org/3/library/os.html#os.removedirs Instead of that, using `shutil.rmtree()` would be better to remove a temporary directory to test for saving model. https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137 Modify a test to remove a saved model in `regression.py` Key: SPARK-8511 URL: https://issues.apache.org/jira/browse/SPARK-8511 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Yu Ishikawa According to the reference of python, {{os.removedirs}} doesn't work if there are any files in the directory we want to remove. https://docs.python.org/3/library/os.html#os.removedirs Instead of that, using {{shutil.rmtree()}} would be better to remove a temporary directory to test for saving model. https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6112) Provide external block store support through HDFS RAM_DISK
[ https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595034#comment-14595034 ] Bogdan Ghit commented on SPARK-6112: This is my configuration: 1. My tmpfs is mounted on /dev/shm. 2. dfs.datanode.data.dir = /local/bghit/myhdfs,[RAM_DISK]/dev/shm/ramdisk 3. dfs.datanode.max.locked.memory=10 The amount of memory I can lock is set in /etc/security/limits.conf to unlimited, so ulimit -l outputs unlimited. However, I get the exception Cannot start datanode because the configured max locked memory size (dfs.datanode.max.locked.memory) is greater than zero and native code is not available. Any ideas why? Regarding my previous comment, the documentation still has offHeap instead of externalBlock. Provide external block store support through HDFS RAM_DISK -- Key: SPARK-6112 URL: https://issues.apache.org/jira/browse/SPARK-6112 Project: Spark Issue Type: New Feature Components: Block Manager Reporter: Zhan Zhang Attachments: SparkOffheapsupportbyHDFS.pdf HDFS Lazy_Persist policy provide possibility to cache the RDD off_heap in hdfs. We may want to provide similar capacity to Tachyon by leveraging hdfs RAM_DISK feature, if the user environment does not have tachyon deployed. With this feature, it potentially provides possibility to share RDD in memory across different jobs and even share with jobs other than spark, and avoid the RDD recomputation if executors crash. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8511) Modify a test to remove a saved model in `regression.py`
Yu Ishikawa created SPARK-8511: -- Summary: Modify a test to remove a saved model in `regression.py` Key: SPARK-8511 URL: https://issues.apache.org/jira/browse/SPARK-8511 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Yu Ishikawa According to the reference of python, {{os.removedirs}} doesn't work if there are any files in the directory we want to remove. https://docs.python.org/3/library/os.html#os.removedirs Instead of that, using `shutil.rmtree()` would be better to remove a temporary directory to test for saving model. https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8511) Modify a test to remove a saved model in `regression.py`
[ https://issues.apache.org/jira/browse/SPARK-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8511: --- Assignee: (was: Apache Spark) Modify a test to remove a saved model in `regression.py` Key: SPARK-8511 URL: https://issues.apache.org/jira/browse/SPARK-8511 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Yu Ishikawa According to the reference of python, {{os.removedirs}} doesn't work if there are any files in the directory we want to remove. https://docs.python.org/3/library/os.html#os.removedirs Instead of that, using {{shutil.rmtree()}} would be better to remove a temporary directory to test for saving model. https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8472) Python API for DCT
[ https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595125#comment-14595125 ] Yu Ishikawa commented on SPARK-8472: Alright. Thanks! Python API for DCT -- Key: SPARK-8472 URL: https://issues.apache.org/jira/browse/SPARK-8472 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Priority: Minor We need to implement a wrapper for enabling the DCT feature transformer to be used from the Python API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8511) Modify a test to remove a saved model in `regression.py`
[ https://issues.apache.org/jira/browse/SPARK-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8511: --- Assignee: Apache Spark Modify a test to remove a saved model in `regression.py` Key: SPARK-8511 URL: https://issues.apache.org/jira/browse/SPARK-8511 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Yu Ishikawa Assignee: Apache Spark According to the reference of python, {{os.removedirs}} doesn't work if there are any files in the directory we want to remove. https://docs.python.org/3/library/os.html#os.removedirs Instead of that, using {{shutil.rmtree()}} would be better to remove a temporary directory to test for saving model. https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8511) Modify a test to remove a saved model in `regression.py`
[ https://issues.apache.org/jira/browse/SPARK-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595124#comment-14595124 ] Apache Spark commented on SPARK-8511: - User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/6926 Modify a test to remove a saved model in `regression.py` Key: SPARK-8511 URL: https://issues.apache.org/jira/browse/SPARK-8511 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Yu Ishikawa According to the reference of python, {{os.removedirs}} doesn't work if there are any files in the directory we want to remove. https://docs.python.org/3/library/os.html#os.removedirs Instead of that, using {{shutil.rmtree()}} would be better to remove a temporary directory to test for saving model. https://github.com/apache/spark/blob/branch-1.4/python%2Fpyspark%2Fmllib%2Fregression.py#L137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6813) SparkR style guide
[ https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595150#comment-14595150 ] Shivaram Venkataraman commented on SPARK-6813: -- Yeah we could disable those two checks but that would also disable it for all other things like `if-else` blocks where we want to encourage the other style of ending the line with `{` instead of having long one-liners. I guess line length limit will prevent some of this. Lets see if there are any other opinions on this SparkR style guide -- Key: SPARK-6813 URL: https://issues.apache.org/jira/browse/SPARK-6813 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman We should develop a SparkR style guide document based on the some of the guidelines we use and some of the best practices in R. Some examples of R style guide are: http://r-pkgs.had.co.nz/r.html#style http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html A related issue is to work on a automatic style checking tool. https://github.com/jimhester/lintr seems promising We could have a R style guide based on the one from google [1], and adjust some of them with the conversation in Spark: 1. Line Length: maximum 100 characters 2. no limit on function name (API should be similar as in other languages) 3. Allow S4 objects/methods -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8510) Store and read NumPy arrays and matrices as values in sequence files
Peter Aberline created SPARK-8510: - Summary: Store and read NumPy arrays and matrices as values in sequence files Key: SPARK-8510 URL: https://issues.apache.org/jira/browse/SPARK-8510 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Peter Aberline Priority: Minor I have extended the provided example code DoubleArrayWritable example to store NumPy double type arrays and matrices as arrays of doubles and nested arrays of doubles. Pandas DataFrames can be easily converted to NumPy matrices, so I've also added the ability to store the schema-less data from DataFrames and Series that contain double data. Other than my own use there seems to be demand for this functionality: http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E I'll be issuing a PR for this shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8419) Statistics.colStats could avoid an extra count()
[ https://issues.apache.org/jira/browse/SPARK-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595048#comment-14595048 ] Kai Sasaki commented on SPARK-8419: --- In the {{Statistics#colStats}}, the number of rows seems to be updated in {{computeColumnSummaryStatistics}} with {{updateNumRows}}. This is computed through distributed process which is calculated inside of {{RDD#treeAggregate}}. So I think there is no extra {{count()}} when just only creating {{RowMatrix}}. Is this assumption correct? Statistics.colStats could avoid an extra count() Key: SPARK-8419 URL: https://issues.apache.org/jira/browse/SPARK-8419 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Trivial Labels: starter Statistics.colStats goes through RowMatrix to compute the stats. But RowMatrix.computeColumnSummaryStatistics does an extra count() which could be avoided. Not going through RowMatrix would skip this extra pass over the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8510) Support NumPy arrays and matrices as values in sequence files
[ https://issues.apache.org/jira/browse/SPARK-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Aberline updated SPARK-8510: -- Summary: Support NumPy arrays and matrices as values in sequence files (was: Store and read NumPy arrays and matrices as values in sequence files) Support NumPy arrays and matrices as values in sequence files - Key: SPARK-8510 URL: https://issues.apache.org/jira/browse/SPARK-8510 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Peter Aberline Priority: Minor I have extended the provided example code DoubleArrayWritable example to store NumPy double type arrays and matrices as arrays of doubles and nested arrays of doubles. Pandas DataFrames can be easily converted to NumPy matrices, so I've also added the ability to store the schema-less data from DataFrames and Series that contain double data. Other than my own use there seems to be demand for this functionality: http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E I'll be issuing a PR for this shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8472) Python API for DCT
[ https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595119#comment-14595119 ] Joseph K. Bradley commented on SPARK-8472: -- Let's wait until its dependency is complete. Python API for DCT -- Key: SPARK-8472 URL: https://issues.apache.org/jira/browse/SPARK-8472 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Priority: Minor We need to implement a wrapper for enabling the DCT feature transformer to be used from the Python API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8115) Remove TestData
[ https://issues.apache.org/jira/browse/SPARK-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595052#comment-14595052 ] Benjamin Fradet commented on SPARK-8115: I've started working on this. Remove TestData --- Key: SPARK-8115 URL: https://issues.apache.org/jira/browse/SPARK-8115 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Minor TestData was from the era when we didn't have easy ways to generate test datasets. Now we have implicits on Seq + toDF, it'd make more sense to put the test datasets closer to the test suites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8503) SizeEstimator returns negative value for recursive data structures
[ https://issues.apache.org/jira/browse/SPARK-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595060#comment-14595060 ] Ilya Rakitsin commented on SPARK-8503: -- The structure is a simple cycled graph, like you would imagine it: public abstract class Edge implements Serializable { private static final long serialVersionUID = MavenVersion.VERSION.getUID(); private int id; protected Vertex fromv; protected Vertex tov; ... } public abstract class Vertex implements Serializable, Cloneable { private String name; private transient Edge[] incoming = new Edge[0]; private transient Edge[] outgoing = new Edge[0]; ... } So, as you can see, edges in vertex are transient, so are serialized correctly (basically, not serialized) when using kryo or regular serialization. But when broadcasting, size is computed in a eternal loop until it's negative (at least it seems that way) due to cycles in the graph and transient edges not being handled. Does this help? Another issue is that in SizeTracker#takeSample() negative value returned by the estimator is not handled as well. Do you think this could be a separate issue, or could you investigate it as well? Hope this helps. SizeEstimator returns negative value for recursive data structures -- Key: SPARK-8503 URL: https://issues.apache.org/jira/browse/SPARK-8503 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Ilya Rakitsin When estimating size of recursive data structures like graphs, with transient fields referencing one another, SizeEstimator may return negative value if the structure if big enough. This then affects the logic of other components, e.g. SizeTracker#takeSample() and may lead to incorrect behavior and exceptions like: java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -9223372036854691384 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:810) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:991) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8503) SizeEstimator returns negative value for recursive data structures
[ https://issues.apache.org/jira/browse/SPARK-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595060#comment-14595060 ] Ilya Rakitsin edited comment on SPARK-8503 at 6/21/15 2:28 PM: --- The structure is a simple cycled graph, like you would imagine it: {code} public abstract class Edge implements Serializable { private static final long serialVersionUID = MavenVersion.VERSION.getUID(); private int id; protected Vertex fromv; protected Vertex tov; ... } {code} {code} public abstract class Vertex implements Serializable, Cloneable { private String name; private transient Edge[] incoming = new Edge[0]; private transient Edge[] outgoing = new Edge[0]; ... } {code} So, as you can see, edges in vertex are transient, so are serialized correctly (basically, not serialized) when using kryo or regular serialization. But when broadcasting, size is computed in a eternal loop until it's negative (at least it seems that way) due to cycles in the graph and transient edges not being handled. Does this help? Another issue is that in SizeTracker#takeSample() negative value returned by the estimator is not handled as well. Do you think this could be a separate issue, or could you investigate it as well? Hope this helps. was (Author: irakitin): The structure is a simple cycled graph, like you would imagine it: {code} public abstract class Edge implements Serializable { private static final long serialVersionUID = MavenVersion.VERSION.getUID(); private int id; protected Vertex fromv; protected Vertex tov; ... } {code} public abstract class Vertex implements Serializable, Cloneable { private String name; private transient Edge[] incoming = new Edge[0]; private transient Edge[] outgoing = new Edge[0]; ... } So, as you can see, edges in vertex are transient, so are serialized correctly (basically, not serialized) when using kryo or regular serialization. But when broadcasting, size is computed in a eternal loop until it's negative (at least it seems that way) due to cycles in the graph and transient edges not being handled. Does this help? Another issue is that in SizeTracker#takeSample() negative value returned by the estimator is not handled as well. Do you think this could be a separate issue, or could you investigate it as well? Hope this helps. SizeEstimator returns negative value for recursive data structures -- Key: SPARK-8503 URL: https://issues.apache.org/jira/browse/SPARK-8503 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Ilya Rakitsin When estimating size of recursive data structures like graphs, with transient fields referencing one another, SizeEstimator may return negative value if the structure if big enough. This then affects the logic of other components, e.g. SizeTracker#takeSample() and may lead to incorrect behavior and exceptions like: java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -9223372036854691384 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:810) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:991) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8503) SizeEstimator returns negative value for recursive data structures
[ https://issues.apache.org/jira/browse/SPARK-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595060#comment-14595060 ] Ilya Rakitsin edited comment on SPARK-8503 at 6/21/15 2:28 PM: --- The structure is a simple cycled graph, like you would imagine it: {code} public abstract class Edge implements Serializable { private static final long serialVersionUID = MavenVersion.VERSION.getUID(); private int id; protected Vertex fromv; protected Vertex tov; ... } {code} public abstract class Vertex implements Serializable, Cloneable { private String name; private transient Edge[] incoming = new Edge[0]; private transient Edge[] outgoing = new Edge[0]; ... } So, as you can see, edges in vertex are transient, so are serialized correctly (basically, not serialized) when using kryo or regular serialization. But when broadcasting, size is computed in a eternal loop until it's negative (at least it seems that way) due to cycles in the graph and transient edges not being handled. Does this help? Another issue is that in SizeTracker#takeSample() negative value returned by the estimator is not handled as well. Do you think this could be a separate issue, or could you investigate it as well? Hope this helps. was (Author: irakitin): The structure is a simple cycled graph, like you would imagine it: public abstract class Edge implements Serializable { private static final long serialVersionUID = MavenVersion.VERSION.getUID(); private int id; protected Vertex fromv; protected Vertex tov; ... } public abstract class Vertex implements Serializable, Cloneable { private String name; private transient Edge[] incoming = new Edge[0]; private transient Edge[] outgoing = new Edge[0]; ... } So, as you can see, edges in vertex are transient, so are serialized correctly (basically, not serialized) when using kryo or regular serialization. But when broadcasting, size is computed in a eternal loop until it's negative (at least it seems that way) due to cycles in the graph and transient edges not being handled. Does this help? Another issue is that in SizeTracker#takeSample() negative value returned by the estimator is not handled as well. Do you think this could be a separate issue, or could you investigate it as well? Hope this helps. SizeEstimator returns negative value for recursive data structures -- Key: SPARK-8503 URL: https://issues.apache.org/jira/browse/SPARK-8503 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Ilya Rakitsin When estimating size of recursive data structures like graphs, with transient fields referencing one another, SizeEstimator may return negative value if the structure if big enough. This then affects the logic of other components, e.g. SizeTracker#takeSample() and may lead to incorrect behavior and exceptions like: java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -9223372036854691384 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:810) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:991) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8419) Statistics.colStats could avoid an extra count()
[ https://issues.apache.org/jira/browse/SPARK-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-8419. Resolution: Not A Problem Statistics.colStats could avoid an extra count() Key: SPARK-8419 URL: https://issues.apache.org/jira/browse/SPARK-8419 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Trivial Labels: starter Statistics.colStats goes through RowMatrix to compute the stats. But RowMatrix.computeColumnSummaryStatistics does an extra count() which could be avoided. Not going through RowMatrix would skip this extra pass over the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8419) Statistics.colStats could avoid an extra count()
[ https://issues.apache.org/jira/browse/SPARK-8419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595120#comment-14595120 ] Joseph K. Bradley commented on SPARK-8419: -- Oops, I misread what the count() was being called on. Thanks! I'll close this. Statistics.colStats could avoid an extra count() Key: SPARK-8419 URL: https://issues.apache.org/jira/browse/SPARK-8419 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Trivial Labels: starter Statistics.colStats goes through RowMatrix to compute the stats. But RowMatrix.computeColumnSummaryStatistics does an extra count() which could be avoided. Not going through RowMatrix would skip this extra pass over the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8514) LU factorization on BlockMatrix
Xiangrui Meng created SPARK-8514: Summary: LU factorization on BlockMatrix Key: SPARK-8514 URL: https://issues.apache.org/jira/browse/SPARK-8514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng LU is the most common method to solve a general linear system or inverse a general matrix. A distributed version could in implemented block-wise with pipelining. A reference implementation is provided in ScaLAPACK: http://netlib.org/scalapack/slug/node178.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8506) SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R
[ https://issues.apache.org/jira/browse/SPARK-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8506: --- Assignee: Apache Spark SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R - Key: SPARK-8506 URL: https://issues.apache.org/jira/browse/SPARK-8506 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: holdenk Assignee: Apache Spark Priority: Minor While packages can be specified when using the sparkR or sparkSubmit scripts, the programming guide tells people to create their spark context using the R shell + init. The init does have a parameter for jars but no parameter for packages. Setting the SPARKR_SUBMIT_ARGS overwrites some necessary information. I think a good solution would just be adding another field to the init function to allow people to specify packages in the same way as jars. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6412) Add Char support in dataTypes.
[ https://issues.apache.org/jira/browse/SPARK-6412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595280#comment-14595280 ] Naden Franciscus commented on SPARK-6412: - This needs to be reopened. The use case couldn't be more clear. Querying from a table that has a CHAR data type is unsupported. Add Char support in dataTypes. -- Key: SPARK-6412 URL: https://issues.apache.org/jira/browse/SPARK-6412 Project: Spark Issue Type: Bug Components: SQL Reporter: Chen Song We can't get the schema of case class PrimitiveData, because of ScalaReflection.schemaFor and dataTYpes doesn't support CharType. case class PrimitiveData( charField: Char,// Can't get the char schema info intField: Int, longField: Long, doubleField: Double, floatField: Float, shortField: Short, byteField: Byte, booleanField: Boolean) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6412) Add Char support in dataTypes.
[ https://issues.apache.org/jira/browse/SPARK-6412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naden Franciscus updated SPARK-6412: Comment: was deleted (was: This needs to be reopened. The use case couldn't be more clear. Querying from a table that has a CHAR data type is unsupported.) Add Char support in dataTypes. -- Key: SPARK-6412 URL: https://issues.apache.org/jira/browse/SPARK-6412 Project: Spark Issue Type: Bug Components: SQL Reporter: Chen Song We can't get the schema of case class PrimitiveData, because of ScalaReflection.schemaFor and dataTYpes doesn't support CharType. case class PrimitiveData( charField: Char,// Can't get the char schema info intField: Int, longField: Long, doubleField: Double, floatField: Float, shortField: Short, byteField: Byte, booleanField: Boolean) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7426) spark.ml AttributeFactory.fromStructField should allow other NumericTypes
[ https://issues.apache.org/jira/browse/SPARK-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7426. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6540 [https://github.com/apache/spark/pull/6540] spark.ml AttributeFactory.fromStructField should allow other NumericTypes - Key: SPARK-7426 URL: https://issues.apache.org/jira/browse/SPARK-7426 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Mike Dusenberry Priority: Minor Fix For: 1.5.0 It currently only supports DoubleType, but it should support others, at least for fromStructField (importing into ML attribute format, rather than exporting). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition
[ https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8513: -- Description: To reproduce this issue, we need a node with relatively more cores, say 32 (e.g., Spark Jenkins builder is a good candidate). With such a node, the following code should be relatively easy to reproduce this issue: {code} sqlContext.range(0, 10).repartition(32).select('id / 0).write.mode(overwrite).parquet(file:///tmp/foo) {code} You may observe similar log lines as below: {noformat} 01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite WARN FileUtil: Failed to delete file or dir [/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]: it still exists. {noformat} The reason is that, for a Spark job with multiple tasks, when a task fails after multiple retries, the job gets canceled on driver side. At the same time, all child tasks of this job also get canceled. However, task cancelation is asynchronous. This means, some tasks may still be running when the job is already killed on driver side. With this in mind, the following execution order may cause the log line mentioned above: # Job {{A}} spawns 32 tasks to write the Parquet file Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a temporary directory {{D1}} is created to hold output files of different task attempts. # Task {{a1}} fails after several retries first because of the division by zero error # Task {{a1}} aborts the Parquet write task and tries to remove its task attempt output directory {{d1}} (a sub-directory of {{D1}}) # Job {{A}} gets canceled on driver side, all the other 31 tasks also get canceled *asynchronously* # {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first removing all its child files/directories first Note that when testing with local directory, {{RawLocalFileSystem}} simply calls {{java.io.File.delete()}} to deletion, and only empty directories can be deleted. # Because tasks are canceled asynchronously, some other task, say {{a2}} may just get scheduled and create its own task attempt directory {{d2}} under {{D2}} # Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} itself, but fails because {{d2}} makes {{D1}} non-empty again Notice that this bug affects all Spark jobs that writes files with {{FileOutputCommitter}} and its subclasses which creates temporary directories. One of the possible way to fix this issue can be making task cancellation synchronous, but this also increases latency. _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition -- Key: SPARK-8513 URL: https://issues.apache.org/jira/browse/SPARK-8513 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.2.2, 1.3.1, 1.4.0 Reporter: Cheng Lian To reproduce this issue, we need a node with relatively more cores, say 32 (e.g., Spark Jenkins builder is a good candidate). With such a node, the following code should be relatively easy to reproduce this issue: {code} sqlContext.range(0, 10).repartition(32).select('id / 0).write.mode(overwrite).parquet(file:///tmp/foo) {code} You may observe similar log lines as below: {noformat} 01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite WARN FileUtil: Failed to delete file or dir [/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]: it still exists. {noformat} The reason is that, for a Spark job with multiple tasks, when a task fails after multiple retries, the job gets canceled on driver side. At the same time, all child tasks of this job also get canceled. However, task cancelation is asynchronous. This means, some tasks may still be running when the job is already killed on driver side. With this in mind, the following execution order may cause the log line mentioned above: # Job {{A}} spawns 32 tasks to write the Parquet file Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a temporary directory {{D1}} is created to hold output files of different task attempts. # Task {{a1}} fails after several retries first because of the division by zero error # Task {{a1}} aborts the Parquet write task and tries to remove its task attempt output directory {{d1}} (a sub-directory of {{D1}}) # Job {{A}} gets canceled on driver side, all the other 31 tasks also get canceled *asynchronously* # {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first removing all its child
[jira] [Updated] (SPARK-8514) LU factorization on BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8514: - Labels: advanced (was: ) LU factorization on BlockMatrix --- Key: SPARK-8514 URL: https://issues.apache.org/jira/browse/SPARK-8514 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Labels: advanced LU is the most common method to solve a general linear system or inverse a general matrix. A distributed version could in implemented block-wise with pipelining. A reference implementation is provided in ScaLAPACK: http://netlib.org/scalapack/slug/node178.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8515) Improve ML attribute API
Xiangrui Meng created SPARK-8515: Summary: Improve ML attribute API Key: SPARK-8515 URL: https://issues.apache.org/jira/browse/SPARK-8515 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng In 1.4.0, we introduced ML attribute API to embed feature/label attribute info inside DataFrame's schema. However, the API is not very friendly to use. In 1.5, we should re-visit this API and see how we can improve it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8515) Improve ML attribute API
[ https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8515: - Labels: advanced (was: ) Improve ML attribute API Key: SPARK-8515 URL: https://issues.apache.org/jira/browse/SPARK-8515 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Labels: advanced In 1.4.0, we introduced ML attribute API to embed feature/label attribute info inside DataFrame's schema. However, the API is not very friendly to use. In 1.5, we should re-visit this API and see how we can improve it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7888) Be able to disable intercept in Linear Regression in ML package
[ https://issues.apache.org/jira/browse/SPARK-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595251#comment-14595251 ] Apache Spark commented on SPARK-7888: - User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/6927 Be able to disable intercept in Linear Regression in ML package --- Key: SPARK-7888 URL: https://issues.apache.org/jira/browse/SPARK-7888 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai Assignee: holdenk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7888) Be able to disable intercept in Linear Regression in ML package
[ https://issues.apache.org/jira/browse/SPARK-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7888: --- Assignee: Apache Spark (was: holdenk) Be able to disable intercept in Linear Regression in ML package --- Key: SPARK-7888 URL: https://issues.apache.org/jira/browse/SPARK-7888 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7888) Be able to disable intercept in Linear Regression in ML package
[ https://issues.apache.org/jira/browse/SPARK-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7888: --- Assignee: holdenk (was: Apache Spark) Be able to disable intercept in Linear Regression in ML package --- Key: SPARK-7888 URL: https://issues.apache.org/jira/browse/SPARK-7888 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai Assignee: holdenk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7443) MLlib 1.4 QA plan
[ https://issues.apache.org/jira/browse/SPARK-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7443. -- Resolution: Fixed Fix Version/s: 1.4.0 I'm closing this and marking it Fixed. I've checked to make sure the still-unresolved JIRAs linked from this umbrella are marked for appropriate targets. The remaining ones are mainly perf tests (necessary for the next release) or documentation (which we can add to the Spark user guide when those docs are ready). MLlib 1.4 QA plan - Key: SPARK-7443 URL: https://issues.apache.org/jira/browse/SPARK-7443 Project: Spark Issue Type: Umbrella Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Joseph K. Bradley Priority: Critical Fix For: 1.4.0 h2. API * Check API compliance using java-compliance-checker (SPARK-7458) * Audit new public APIs (from the generated html doc) ** Scala (do not forget to check the object doc) (SPARK-7537) ** Java compatibility (SPARK-7529) ** Python API coverage (SPARK-7536) * audit Pipeline APIs (SPARK-7535) * graduate spark.ml from alpha (SPARK-7748) ** remove AlphaComponent annotations ** remove mima excludes for spark.ml ** mark concrete classes final wherever reasonable h2. Algorithms and performance *Performance* * _List any other missing performance tests from spark-perf here_ * LDA online/EM (SPARK-7455) * ElasticNet for linear regression and logistic regression (SPARK-7456) * Bernoulli naive Bayes (SPARK-7453) * PIC (SPARK-7454) * ALS.recommendAll (SPARK-7457) * perf-tests in Python (SPARK-7539) *Correctness* * PMML ** scoring using PMML evaluator vs. MLlib models (SPARK-7540) * model save/load (SPARK-7541) h2. Documentation and example code * Create JIRAs for the user guide to each new algorithm and assign them to the corresponding author. Link here as requires ** Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. *** The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. *** We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. * Create example code for major components. Link here as requires ** cross validation in python (SPARK-7387) ** pipeline with complex feature transformations (scala/java/python) (SPARK-7546) ** elastic-net (possibly with cross validation) (SPARK-7547) ** kernel density (SPARK-7707) * Update Programming Guide for 1.4 (towards end of QA) (SPARK-7715) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition
[ https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8513: -- Description: To reproduce this issue, we need a node with relatively more cores, say 32 (e.g., Spark Jenkins builder is a good candidate). With such a node, the following code should be relatively easy to reproduce this issue: {code} sqlContext.range(0, 10).repartition(32).select('id / 0).write.mode(overwrite).parquet(file:///tmp/foo) {code} You may observe similar log lines as below: {noformat} 01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite WARN FileUtil: Failed to delete file or dir [/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]: it still exists. {noformat} The reason is that, for a Spark job with multiple tasks, when a task fails after multiple retries, the job gets canceled on driver side. At the same time, all child tasks of this job also get canceled. However, task cancelation is asynchronous. This means, some tasks may still be running when the job is already killed on driver side. With this in mind, the following execution order may cause the log line mentioned above: # Job {{A}} spawns 32 tasks to write the Parquet file Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a temporary directory {{D1}} is created to hold output files of different task attempts. # Task {{a1}} fails after several retries first because of the division by zero error # Task {{a1}} aborts the Parquet write task and tries to remove its task attempt output directory {{d1}} (a sub-directory of {{D1}}) # Job {{A}} gets canceled on driver side, all the other 31 tasks also get canceled *asynchronously* # {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first removing all its child files/directories first Note that when testing with local directory, {{RawLocalFileSystem}} simply calls {{java.io.File.delete()}} to deletion, and only empty directories can be deleted. # Because tasks are canceled asynchronously, some other task, say {{a2}} may just get scheduled and create its own task attempt directory {{d2}} under {{D2}} # Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} itself, but fails because {{d2}} makes {{D1}} non-empty again Notice that this bug affects all Spark jobs that writes files with {{FileOutputCommitter}} and its subclasses which creates temporary directories. One of the possible way to fix this issue can be making task cancellation synchronous, but this also increases latency. was: To reproduce this issue, we need a node with relatively more cores, say 32 (e.g., Spark Jenkins builder is a good candidate). With such a node, the following code should be relatively easy to reproduce this issue: {code} sqlContext.range(0, 10).repartition(32).select('id / 0).write.mode(overwrite).parquet(file:///tmp/foo) {code} You may observe similar log lines as below: {noformat} 01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite WARN FileUtil: Failed to delete file or dir [/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]: it still exists. {noformat} The reason is that, for a Spark job with multiple tasks, when a task fails after multiple retries, the job gets canceled on driver side. At the same time, all child tasks of this job also get canceled. However, task cancelation is asynchronous. This means, some tasks may still be running when the job is already killed on driver side. With this in mind, the following execution order may cause the log line mentioned above: # Job {{A}} spawns 32 tasks to write the Parquet file Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a temporary directory {{D1}} is created to hold output files of different task attempts. # Task {{a1}} fails after several retries first because of the division by zero error # Task {{a1}} aborts the Parquet write task and tries to remove its task attempt output directory {{d1}} (a sub-directory of {{D1}}) # Job {{A}} gets canceled on driver side, all the other 31 tasks also get canceled *asynchronously* # {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first removing all its child files/directories first Note that when testing with local directory, {{RawLocalFileSystem}} simply calls {{java.io.File.delete()}} to deletion, and only empty directories can be deleted. # Because tasks are canceled asynchronously, some other task, say {{a2}} may just get scheduled and create its own task attempt directory {{d2}} under {{D2}} # Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} itself, but fails because {{d2}} makes {{D1}} non-empty again Notice that this bug affects all Spark jobs
[jira] [Updated] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition
[ https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8513: -- Description: To reproduce this issue, we need a node with relatively more cores, say 32 (e.g., Spark Jenkins builder is a good candidate). With such a node, the following code should be relatively easy to reproduce this issue: {code} sqlContext.range(0, 10).repartition(32).select('id / 0).write.mode(overwrite).parquet(file:///tmp/foo) {code} You may observe similar log lines as below: {noformat} 01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite WARN FileUtil: Failed to delete file or dir [/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]: it still exists. {noformat} The reason is that, for a Spark job with multiple tasks, when a task fails after multiple retries, the job gets canceled on driver side. At the same time, all child tasks of this job also get canceled. However, task cancelation is asynchronous. This means, some tasks may still be running when the job is already killed on driver side. With this in mind, the following execution order may cause the log line mentioned above: # Job {{A}} spawns 32 tasks to write the Parquet file Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a temporary directory {{D1}} is created to hold output files of different task attempts. # Task {{a1}} fails after several retries first because of the division by zero error # Task {{a1}} aborts the Parquet write task and tries to remove its task attempt output directory {{d1}} (a sub-directory of {{D1}}) # Job {{A}} gets canceled on driver side, all the other 31 tasks also get canceled *asynchronously* # {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first removing all its child files/directories first Note that when testing with local directory, {{RawLocalFileSystem}} simply calls {{java.io.File.delete()}} to deletion, and only empty directories can be deleted. # Because tasks are canceled asynchronously, some other task, say {{a2}}, may just get scheduled and create its own task attempt directory {{d2}} under {{D1}} # Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} itself, but fails because {{d2}} makes {{D1}} non-empty again Notice that this bug affects all Spark jobs that writes files with {{FileOutputCommitter}} and its subclasses which creates temporary directories. One of the possible way to fix this issue can be making task cancellation synchronous, but this also increases latency. was: To reproduce this issue, we need a node with relatively more cores, say 32 (e.g., Spark Jenkins builder is a good candidate). With such a node, the following code should be relatively easy to reproduce this issue: {code} sqlContext.range(0, 10).repartition(32).select('id / 0).write.mode(overwrite).parquet(file:///tmp/foo) {code} You may observe similar log lines as below: {noformat} 01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite WARN FileUtil: Failed to delete file or dir [/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]: it still exists. {noformat} The reason is that, for a Spark job with multiple tasks, when a task fails after multiple retries, the job gets canceled on driver side. At the same time, all child tasks of this job also get canceled. However, task cancelation is asynchronous. This means, some tasks may still be running when the job is already killed on driver side. With this in mind, the following execution order may cause the log line mentioned above: # Job {{A}} spawns 32 tasks to write the Parquet file Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a temporary directory {{D1}} is created to hold output files of different task attempts. # Task {{a1}} fails after several retries first because of the division by zero error # Task {{a1}} aborts the Parquet write task and tries to remove its task attempt output directory {{d1}} (a sub-directory of {{D1}}) # Job {{A}} gets canceled on driver side, all the other 31 tasks also get canceled *asynchronously* # {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first removing all its child files/directories first Note that when testing with local directory, {{RawLocalFileSystem}} simply calls {{java.io.File.delete()}} to deletion, and only empty directories can be deleted. # Because tasks are canceled asynchronously, some other task, say {{a2}} may just get scheduled and create its own task attempt directory {{d2}} under {{D2}} # Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} itself, but fails because {{d2}} makes {{D1}} non-empty again Notice that this bug affects all Spark jobs
[jira] [Commented] (SPARK-8475) SparkSubmit with Ivy jars is very slow to load with no internet access
[ https://issues.apache.org/jira/browse/SPARK-8475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595363#comment-14595363 ] Burak Yavuz commented on SPARK-8475: Me too. I prefer option 1 as well. SparkSubmit with Ivy jars is very slow to load with no internet access -- Key: SPARK-8475 URL: https://issues.apache.org/jira/browse/SPARK-8475 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.4.0 Reporter: Nathan McCarthy Priority: Minor Spark Submit adds maven central spark bintray to the ChainResolver before it adds any external resolvers. https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L821 When running on a cluster without internet access, this means the spark shell takes forever to launch as it tries these two remote repos before the ones specified in the --repositories list. In our case we have a proxy which the cluster can access it and supply it via --repositories. This is also a problem for users who maintain a proxy for maven/ivy repos with something like Nexus/Artifactory. Having a repo proxy is popular at many organisations so I'd say this would be a useful change for these users as well. In the current state even if a maven central proxy is supplied, it will still try and hit central. I see two options for a fix; * Change the order repos are added to the ChainResolver, making the --repositories supplied repos come before anything else. https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L843 * Have a config option (like spark.jars.ivy.useDefaultRemoteRepos, default true) which when false wont add the maven central bintry to the ChainResolver. Happy to do a PR for this fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7398) Add back-pressure to Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7398: - Priority: Critical (was: Major) Target Version/s: 1.5.0 Add back-pressure to Spark Streaming Key: SPARK-7398 URL: https://issues.apache.org/jira/browse/SPARK-7398 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.1 Reporter: François Garillot Priority: Critical Labels: streams Spark Streaming has trouble dealing with situations where batch processing time batch interval Meaning a high throughput of input data w.r.t. Spark's ability to remove data from the queue. If this throughput is sustained for long enough, it leads to an unstable situation where the memory of the Receiver's Executor is overflowed. This aims at transmitting a back-pressure signal back to data ingestion to help with dealing with that high throughput, in a backwards-compatible way. The design doc can be found here: https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7398) Add back-pressure to Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595374#comment-14595374 ] Tathagata Das commented on SPARK-7398: -- I took a look at the whole design doc. Its very well composed, but the actual details on how the actual code changes is a little unclear. I Now that you have a working branch, I strongly recommend doing another additional design doc which skips all the intro and background, and just focuses on the code changes. Here is a design doc for inspiration. This is original design doc for the Write Ahead Log. https://docs.google.com/document/d/1vTCB5qVfyxQPlHuv8rit9-zjdttlgaSrMgfCDQlCJIM/edit#heading=h.9xoxtbgz551y See the architecture and proposed implementation section. Accordingly you should have the following two sections 1. Use diagrams to explain the high-level control flow in the architecture with new classes in the picture and how they interoperate/interface with existing classes (BTW, high-level = not as detailed at the control flow that you have in the earlier design doc). 2. The details of every class and interface that needs to be introduced or modified. Especially focus on the interfaces for - (1) the heuristic algorithm, (2) the congestion control. This will allow me and others to evaluate the architecture more critically. Then if needed we can break up the task into smaller smaller sub-tasks (as done in the case of the WAL JIRA - https://issues.apache.org/jira/browse/SPARK-3129). Add back-pressure to Spark Streaming Key: SPARK-7398 URL: https://issues.apache.org/jira/browse/SPARK-7398 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.1 Reporter: François Garillot Labels: streams Spark Streaming has trouble dealing with situations where batch processing time batch interval Meaning a high throughput of input data w.r.t. Spark's ability to remove data from the queue. If this throughput is sustained for long enough, it leads to an unstable situation where the memory of the Receiver's Executor is overflowed. This aims at transmitting a back-pressure signal back to data ingestion to help with dealing with that high throughput, in a backwards-compatible way. The design doc can be found here: https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition
[ https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8513: -- Description: To reproduce this issue, we need a node with relatively more cores, say 32 (e.g., Spark Jenkins builder is a good candidate). With such a node, the following code should be relatively easy to reproduce this issue: {code} sqlContext.range(0, 10).repartition(32).select('id / 0).write.mode(overwrite).parquet(file:///tmp/foo) {code} You may observe similar log lines as below: {noformat} 01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite WARN FileUtil: Failed to delete file or dir [/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]: it still exists. {noformat} The reason is that, for a Spark job with multiple tasks, when a task fails after multiple retries, the job gets canceled on driver side. At the same time, all child tasks of this job also get canceled. However, task cancelation is asynchronous. This means, some tasks may still be running when the job is already killed on driver side. With this in mind, the following execution order may cause the log line mentioned above: # Job {{A}} spawns 32 tasks to write the Parquet file Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a temporary directory {{D1}} is created to hold output files of different task attempts. # Task {{a1}} fails after several retries first because of the division by zero error # Task {{a1}} aborts the Parquet write task and tries to remove its task attempt output directory {{d1}} (a sub-directory of {{D1}}) # Job {{A}} gets canceled on driver side, all the other 31 tasks also get canceled *asynchronously* # {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first removing all its child files/directories first Note that when testing with local directory, {{RawLocalFileSystem}} simply calls {{java.io.File.delete()}} to deletion, and only empty directories can be deleted. # Because tasks are canceled asynchronously, some other task, say {{a2}}, may just get scheduled and create its own task attempt directory {{d2}} under {{D1}} # Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} itself, but fails because {{d2}} makes {{D1}} non-empty again Notice that this bug affects all Spark jobs that writes files with {{FileOutputCommitter}} and its subclasses which create and delete temporary directories. One of the possible way to fix this issue can be making task cancellation synchronous, but this also increases latency. was: To reproduce this issue, we need a node with relatively more cores, say 32 (e.g., Spark Jenkins builder is a good candidate). With such a node, the following code should be relatively easy to reproduce this issue: {code} sqlContext.range(0, 10).repartition(32).select('id / 0).write.mode(overwrite).parquet(file:///tmp/foo) {code} You may observe similar log lines as below: {noformat} 01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite WARN FileUtil: Failed to delete file or dir [/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]: it still exists. {noformat} The reason is that, for a Spark job with multiple tasks, when a task fails after multiple retries, the job gets canceled on driver side. At the same time, all child tasks of this job also get canceled. However, task cancelation is asynchronous. This means, some tasks may still be running when the job is already killed on driver side. With this in mind, the following execution order may cause the log line mentioned above: # Job {{A}} spawns 32 tasks to write the Parquet file Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a temporary directory {{D1}} is created to hold output files of different task attempts. # Task {{a1}} fails after several retries first because of the division by zero error # Task {{a1}} aborts the Parquet write task and tries to remove its task attempt output directory {{d1}} (a sub-directory of {{D1}}) # Job {{A}} gets canceled on driver side, all the other 31 tasks also get canceled *asynchronously* # {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first removing all its child files/directories first Note that when testing with local directory, {{RawLocalFileSystem}} simply calls {{java.io.File.delete()}} to deletion, and only empty directories can be deleted. # Because tasks are canceled asynchronously, some other task, say {{a2}}, may just get scheduled and create its own task attempt directory {{d2}} under {{D1}} # Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} itself, but fails because {{d2}} makes {{D1}} non-empty again Notice that this bug affects all
[jira] [Updated] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition
[ https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8513: -- Component/s: SQL _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition -- Key: SPARK-8513 URL: https://issues.apache.org/jira/browse/SPARK-8513 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.2.2, 1.3.1, 1.4.0 Reporter: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8506) SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R
[ https://issues.apache.org/jira/browse/SPARK-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8506: --- Assignee: (was: Apache Spark) SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R - Key: SPARK-8506 URL: https://issues.apache.org/jira/browse/SPARK-8506 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: holdenk Priority: Minor While packages can be specified when using the sparkR or sparkSubmit scripts, the programming guide tells people to create their spark context using the R shell + init. The init does have a parameter for jars but no parameter for packages. Setting the SPARKR_SUBMIT_ARGS overwrites some necessary information. I think a good solution would just be adding another field to the init function to allow people to specify packages in the same way as jars. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8506) SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R
[ https://issues.apache.org/jira/browse/SPARK-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595275#comment-14595275 ] Apache Spark commented on SPARK-8506: - User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/6928 SparkR does not provide an easy way to depend on Spark Packages when performing init from inside of R - Key: SPARK-8506 URL: https://issues.apache.org/jira/browse/SPARK-8506 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: holdenk Priority: Minor While packages can be specified when using the sparkR or sparkSubmit scripts, the programming guide tells people to create their spark context using the R shell + init. The init does have a parameter for jars but no parameter for packages. Setting the SPARKR_SUBMIT_ARGS overwrites some necessary information. I think a good solution would just be adding another field to the init function to allow people to specify packages in the same way as jars. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition
Cheng Lian created SPARK-8513: - Summary: _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition Key: SPARK-8513 URL: https://issues.apache.org/jira/browse/SPARK-8513 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0, 1.3.1, 1.2.2 Reporter: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8510) NumPy arrays and matrices as values in sequence files
[ https://issues.apache.org/jira/browse/SPARK-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Aberline updated SPARK-8510: -- Description: Using the DoubleArrayWritable example, I have added support for storing NumPy double arrays and matrices as arrays of doubles and nested arrays of doubles as value elements of Sequence Files. Each value element is a discrete matrix or array. This is useful where you have many matrices that you don't want to join into a single Spark Data Frame to store in a Parquet file. Pandas DataFrames can be easily converted to and from NumPy matrices, so I've also added the ability to store the schema-less data from DataFrames and Series that contain double data. There seems to be demand for this functionality: http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E I'll be issuing a PR for this shortly. was: I have extended the provided example code DoubleArrayWritable example to store NumPy double type arrays and matrices as arrays of doubles and nested arrays of doubles. Pandas DataFrames can be easily converted to NumPy matrices, so I've also added the ability to store the schema-less data from DataFrames and Series that contain double data. Other than my own use there seems to be demand for this functionality: http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E I'll be issuing a PR for this shortly. NumPy arrays and matrices as values in sequence files - Key: SPARK-8510 URL: https://issues.apache.org/jira/browse/SPARK-8510 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Peter Aberline Priority: Minor Using the DoubleArrayWritable example, I have added support for storing NumPy double arrays and matrices as arrays of doubles and nested arrays of doubles as value elements of Sequence Files. Each value element is a discrete matrix or array. This is useful where you have many matrices that you don't want to join into a single Spark Data Frame to store in a Parquet file. Pandas DataFrames can be easily converted to and from NumPy matrices, so I've also added the ability to store the schema-less data from DataFrames and Series that contain double data. There seems to be demand for this functionality: http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E I'll be issuing a PR for this shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7075) Project Tungsten: Improving Physical Execution
[ https://issues.apache.org/jira/browse/SPARK-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7075: --- Summary: Project Tungsten: Improving Physical Execution (was: Project Tungsten: Improving Physical Execution and Memory Management) Project Tungsten: Improving Physical Execution -- Key: SPARK-7075 URL: https://issues.apache.org/jira/browse/SPARK-7075 Project: Spark Issue Type: Epic Components: Block Manager, Shuffle, Spark Core, SQL Reporter: Reynold Xin Assignee: Reynold Xin Based on our observation, majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory. This project focuses on 3 areas to improve the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of the underlying hardware. *Memory Management and Binary Processing* - Avoiding non-transient Java objects (store them in binary format), which reduces GC overhead. - Minimizing memory usage through denser in-memory data format, which means we spill less. - Better memory accounting (size of bytes) rather than relying on heuristics - For operators that understand data types (in the case of DataFrames and SQL), work directly against binary format in memory, i.e. have no serialization/deserialization *Cache-aware Computation* - Faster sorting and hashing for aggregations, joins, and shuffle *Code Generation* - Faster expression evaluation and DataFrame/SQL operators - Faster serializer Several parts of project Tungsten leverage the DataFrame model, which gives us more semantics about the application. We will also retrofit the improvements onto Spark’s RDD API whenever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595226#comment-14595226 ] Joseph K. Bradley commented on SPARK-1503: -- [~staple] [~lewuathe] Can you please coordinate about how convergence is measured, as in [SPARK-3382]? The current implementation for [SPARK-3382] uses relative convergence w.r.t. the weight vector, where it measures relative to the weight vector from the previous iteration. I figure we should use the same convergence criterion for both accelerated and non-accelerated gradient descent. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns
[ https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595239#comment-14595239 ] Reynold Xin commented on SPARK-8072: I think it's best to say the name of the duplicate columns. Better AnalysisException for writing DataFrame with identically named columns - Key: SPARK-8072 URL: https://issues.apache.org/jira/browse/SPARK-8072 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker We should check if there are duplicate columns, and if yes, throw an explicit error message saying there are duplicate columns. See current error message below. {code} In [3]: df.withColumn('age', df.age) Out[3]: DataFrame[age: bigint, name: string, age: bigint] In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out') --- Py4JJavaError Traceback (most recent call last) ipython-input-4-eecb85256898 in module() 1 df.withColumn('age', df.age).write.parquet('test-parquet.out') /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, mode) 350 df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data')) 351 -- 352 self._jwrite.mode(mode).parquet(path) 353 354 @since(1.4) /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc in __call__(self, *args) 535 answer = self.gateway_client.send_command(command) 536 return_value = get_return_value(answer, self.gateway_client, -- 537 self.target_id, self.name) 538 539 for temp_arg in temp_args: /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o35.parquet. : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could be: age#0L, age#3L.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
[jira] [Closed] (SPARK-8509) Failed to JOIN in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-8509. Resolution: Not A Problem Failed to JOIN in pyspark - Key: SPARK-8509 URL: https://issues.apache.org/jira/browse/SPARK-8509 Project: Spark Issue Type: Bug Reporter: afancy Hi, I am writing pyspark stream program. I have the training data set to compute the regression model. I want to use the stream data set to test the model. So, I join with RDD with the StreamRDD, but i got the exception. Following are my source code, and the exception I got. Any help is appreciated. Thanks Regards, Afancy {code} from __future__ import print_function import sys,os,datetime from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql.context import SQLContext from pyspark.resultiterable import ResultIterable from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD import numpy as np import statsmodels.api as sm def splitLine(line, delimiter='|'): values = line.split(delimiter) st = datetime.datetime.strptime(values[1], '%Y-%m-%d %H:%M:%S') return (values[0],st.hour), values[2:] def reg_m(y, x): ones = np.ones(len(x[0])) X = sm.add_constant(np.column_stack((x[0], ones))) for ele in x[1:]: X = sm.add_constant(np.column_stack((ele, X))) results = sm.OLS(y, X).fit() return results def train(line): y,x = [],[] y, x = [],[[],[],[],[],[],[]] reading_tmp,temp_tmp = [],[] i = 0 for reading, temperature in line[1]: if i%4==0 and len(reading_tmp)==4: y.append(reading_tmp.pop()) x[0].append(reading_tmp.pop()) x[1].append(reading_tmp.pop()) x[2].append(reading_tmp.pop()) temp = float(temp_tmp[0]) del temp_tmp[:] x[3].append(temp-20.0 if temp20.0 else 0.0) x[4].append(16.0-temp if temp16.0 else 0.0) x[5].append(5.0-temp if temp5.0 else 0.0) reading_tmp.append(float(reading)) temp_tmp.append(float(temperature)) i = i + 1 return str(line[0]),reg_m(y, x).params.tolist() if __name__ == __main__: if len(sys.argv) != 4: print(Usage: regression.py checkpointDir trainingDataDir streamDataDir, file=sys.stderr) exit(-1) checkpoint, trainingInput, streamInput = sys.argv[1:] sc = SparkContext(local[2], appName=BenchmarkSparkStreaming) trainingLines = sc.textFile(trainingInput) modelRDD = trainingLines.map(lambda line: splitLine(line, |))\ .groupByKey()\ .map(lambda line: train(line))\ .cache() ssc = StreamingContext(sc, 2) ssc.checkpoint(checkpoint) lines = ssc.textFileStream(streamInput).map(lambda line: splitLine(line, |)) testRDD = lines.groupByKeyAndWindow(4,2).map(lambda line:(str(line[0]), line[1])).transform(lambda rdd: rdd.leftOuterJoin(modelRDD)) testRDD.pprint(20) ssc.start() ssc.awaitTermination() {code} {code} 15/06/18 12:25:37 INFO FileInputDStream: Duration for remembering RDDs set to 6 ms for org.apache.spark.streaming.dstream.FileInputDStream@15b81ee6 Traceback (most recent call last): File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/util.py, line 90, in dumps return bytearray(self.serializer.dumps((func.func, func.deserializers))) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py, line 427, in dumps return cloudpickle.dumps(obj, 2) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 622, in dumps cp.dump(obj) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 107, in dump return Pickler.dump(self, obj) File /usr/lib/python2.7/pickle.py, line 224, in dump self.save(obj) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.7/pickle.py, line 548, in save_tuple save(element) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 193, in save_function self.save_function_tuple(obj) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 236, in save_function_tuple save((code, closure, base_globals)) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File
[jira] [Commented] (SPARK-8509) Failed to JOIN in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595163#comment-14595163 ] Joseph K. Bradley commented on SPARK-8509: -- I believe the bug is here, where you'll need to be careful about what is a DStream vs RDD vs local data: {code} testRDD = lines.groupByKeyAndWindow(4,2).map(lambda line:(str(line[0]), line[1])).transform(lambda rdd: rdd.leftOuterJoin(modelRDD)) testRDD.pprint(20) {code} I believe this is a bug in your code, not in Spark, so please post on the user list instead of JIRA. I'll close this for now. Failed to JOIN in pyspark - Key: SPARK-8509 URL: https://issues.apache.org/jira/browse/SPARK-8509 Project: Spark Issue Type: Bug Reporter: afancy Hi, I am writing pyspark stream program. I have the training data set to compute the regression model. I want to use the stream data set to test the model. So, I join with RDD with the StreamRDD, but i got the exception. Following are my source code, and the exception I got. Any help is appreciated. Thanks Regards, Afancy {code} from __future__ import print_function import sys,os,datetime from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql.context import SQLContext from pyspark.resultiterable import ResultIterable from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD import numpy as np import statsmodels.api as sm def splitLine(line, delimiter='|'): values = line.split(delimiter) st = datetime.datetime.strptime(values[1], '%Y-%m-%d %H:%M:%S') return (values[0],st.hour), values[2:] def reg_m(y, x): ones = np.ones(len(x[0])) X = sm.add_constant(np.column_stack((x[0], ones))) for ele in x[1:]: X = sm.add_constant(np.column_stack((ele, X))) results = sm.OLS(y, X).fit() return results def train(line): y,x = [],[] y, x = [],[[],[],[],[],[],[]] reading_tmp,temp_tmp = [],[] i = 0 for reading, temperature in line[1]: if i%4==0 and len(reading_tmp)==4: y.append(reading_tmp.pop()) x[0].append(reading_tmp.pop()) x[1].append(reading_tmp.pop()) x[2].append(reading_tmp.pop()) temp = float(temp_tmp[0]) del temp_tmp[:] x[3].append(temp-20.0 if temp20.0 else 0.0) x[4].append(16.0-temp if temp16.0 else 0.0) x[5].append(5.0-temp if temp5.0 else 0.0) reading_tmp.append(float(reading)) temp_tmp.append(float(temperature)) i = i + 1 return str(line[0]),reg_m(y, x).params.tolist() if __name__ == __main__: if len(sys.argv) != 4: print(Usage: regression.py checkpointDir trainingDataDir streamDataDir, file=sys.stderr) exit(-1) checkpoint, trainingInput, streamInput = sys.argv[1:] sc = SparkContext(local[2], appName=BenchmarkSparkStreaming) trainingLines = sc.textFile(trainingInput) modelRDD = trainingLines.map(lambda line: splitLine(line, |))\ .groupByKey()\ .map(lambda line: train(line))\ .cache() ssc = StreamingContext(sc, 2) ssc.checkpoint(checkpoint) lines = ssc.textFileStream(streamInput).map(lambda line: splitLine(line, |)) testRDD = lines.groupByKeyAndWindow(4,2).map(lambda line:(str(line[0]), line[1])).transform(lambda rdd: rdd.leftOuterJoin(modelRDD)) testRDD.pprint(20) ssc.start() ssc.awaitTermination() {code} {code} 15/06/18 12:25:37 INFO FileInputDStream: Duration for remembering RDDs set to 6 ms for org.apache.spark.streaming.dstream.FileInputDStream@15b81ee6 Traceback (most recent call last): File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/util.py, line 90, in dumps return bytearray(self.serializer.dumps((func.func, func.deserializers))) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py, line 427, in dumps return cloudpickle.dumps(obj, 2) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 622, in dumps cp.dump(obj) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 107, in dump return Pickler.dump(self, obj) File /usr/lib/python2.7/pickle.py, line 224, in dump self.save(obj) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.7/pickle.py, line 548, in save_tuple save(element) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File
[jira] [Created] (SPARK-8512) Web UI Inconsistent with History Server in Standalone Mode
Jonathon Cai created SPARK-8512: --- Summary: Web UI Inconsistent with History Server in Standalone Mode Key: SPARK-8512 URL: https://issues.apache.org/jira/browse/SPARK-8512 Project: Spark Issue Type: Bug Reporter: Jonathon Cai I find that when I examine the web UI, a couple bugs arise: 1. There is a discrepancy between the number denoting the duration of the application when I run the history server and the number given by the web UI (default address is master:8080). I checked more specific details, including task and stage durations (when clicking on the application), and these appear to be the same for both avenues. 2. Sometimes the web UI on master:8080 is unable to display more specific information for an application that has finished (when clicking on the application), even when there is a log file in the appropriate directory. But when the history server is opened, it is able to read this file and output information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8512) Web UI Inconsistent with History Server in Standalone Mode
[ https://issues.apache.org/jira/browse/SPARK-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathon Cai updated SPARK-8512: Affects Version/s: 1.4.0 Web UI Inconsistent with History Server in Standalone Mode -- Key: SPARK-8512 URL: https://issues.apache.org/jira/browse/SPARK-8512 Project: Spark Issue Type: Bug Affects Versions: 1.4.0 Reporter: Jonathon Cai I find that when I examine the web UI, a couple bugs arise: 1. There is a discrepancy between the number denoting the duration of the application when I run the history server and the number given by the web UI (default address is master:8080). I checked more specific details, including task and stage durations (when clicking on the application), and these appear to be the same for both avenues. 2. Sometimes the web UI on master:8080 is unable to display more specific information for an application that has finished (when clicking on the application), even when there is a log file in the appropriate directory. But when the history server is opened, it is able to read this file and output information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8510) NumPy arrays and matrices as values in sequence files
[ https://issues.apache.org/jira/browse/SPARK-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Aberline updated SPARK-8510: -- Summary: NumPy arrays and matrices as values in sequence files (was: Support NumPy arrays and matrices as values in sequence files) NumPy arrays and matrices as values in sequence files - Key: SPARK-8510 URL: https://issues.apache.org/jira/browse/SPARK-8510 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Peter Aberline Priority: Minor I have extended the provided example code DoubleArrayWritable example to store NumPy double type arrays and matrices as arrays of doubles and nested arrays of doubles. Pandas DataFrames can be easily converted to NumPy matrices, so I've also added the ability to store the schema-less data from DataFrames and Series that contain double data. Other than my own use there seems to be demand for this functionality: http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E I'll be issuing a PR for this shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
[ https://issues.apache.org/jira/browse/SPARK-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595205#comment-14595205 ] Bolke de Bruin commented on SPARK-5111: --- This patch does not work on a hadoop 2.6 + kerberos + yarn cluster. The SparkContext cannot be setup if the patch is nor earlier in the process (in SparkContext.init). However, just moving the patch makes the connection fail in HiveContext. Keeping it (ie. duplicating) in two places does not work as the class became stale if I remember correctly. Latest spark build (1.5.0-SNAPSHOT) from git does not go beyond 0.13 yet it seems even if something like spark.sql.hive.metastore.version0.14 spark.sql.hive.metastore.jars /usr/hdp/2.2.0.0-2041/hadoop/conf:/usr/hdp/2.2.0.0-2041/hadoop/lib/*:/usr/hdp/2.2.0.0-2041/hadoop/.//*:/usr/hdp/2.2.0.0-2041/hadoop-hdfs/./:/usr/hdp/2.2.0.0-2041/hadoop-hdfs/lib/*:/usr/hdp/2.2.0.0-2041/hadoop-hdfs/.//*:/usr/hdp/2.2.0.0-2041/hadoop-yarn/lib/*:/usr/hdp/2.2.0.0-2041/hadoop-yarn/.//*:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/lib/*:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/.//*::/usr/share/java/mysql-connector-java-5.1.17.jar:/usr/share/java/mysql-connector-java-5.1.34-bin.jar:/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-mapreduce-client/*:/usr/hdp/current/tez-client/*:/usr/hdp/current/tez-client/lib/*:/etc/tez/conf/:/usr/hdp/2.2.0.0-2041/tez/*:/usr/hdp/2.2.0.0-2041/tez/lib/*:/etc/tez/conf:/usr/hdp/2.2.0.0-2041/hive/lib/* is configured (an out of memory error is then thrown, could not get rid of it). HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5 --- Key: SPARK-5111 URL: https://issues.apache.org/jira/browse/SPARK-5111 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhan Zhang Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 support in spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7715) Update MLlib Programming Guide for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7715. -- Resolution: Fixed Fix Version/s: (was: 1.4.0) 1.4.1 1.5.0 Issue resolved by pull request 6897 [https://github.com/apache/spark/pull/6897] Update MLlib Programming Guide for 1.4 -- Key: SPARK-7715 URL: https://issues.apache.org/jira/browse/SPARK-7715 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Fix For: 1.5.0, 1.4.1 Before the release, we need to update the MLlib Programming Guide. Updates will include: * Add migration guide subsection. ** Use the results of the QA audit JIRAs. * Check phrasing, especially in main sections (for outdated items such as In this release, ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6813) SparkR style guide
[ https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595159#comment-14595159 ] Yu Ishikawa commented on SPARK-6813: I got it. As you mentioned, disabling those two checks affects {{if-else}} blocks. And according to the Google R guide, an opening curly brace should never go on its own line, as you know. Alright. I think we should go with your suggestion. If {{lintr}} support any linters for {{if}} clauses and {{function}} definitions in a future, we will discuss that again. What do you think about that? I asked the linter for curly braces in detail to the creator of lintr by email. SparkR style guide -- Key: SPARK-6813 URL: https://issues.apache.org/jira/browse/SPARK-6813 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman We should develop a SparkR style guide document based on the some of the guidelines we use and some of the best practices in R. Some examples of R style guide are: http://r-pkgs.had.co.nz/r.html#style http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html A related issue is to work on a automatic style checking tool. https://github.com/jimhester/lintr seems promising We could have a R style guide based on the one from google [1], and adjust some of them with the conversation in Spark: 1. Line Length: maximum 100 characters 2. no limit on function name (API should be similar as in other languages) 3. Allow S4 objects/methods -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7604) Python API for PCA and PCAModel
[ https://issues.apache.org/jira/browse/SPARK-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7604. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6315 [https://github.com/apache/spark/pull/6315] Python API for PCA and PCAModel --- Key: SPARK-7604 URL: https://issues.apache.org/jira/browse/SPARK-7604 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Assignee: Yanbo Liang Fix For: 1.5.0 Python API for org.apache.spark.mllib.feature.PCA and org.apache.spark.mllib.feature.PCAModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7499) Investigate how to specify columns in SparkR without $ or strings
[ https://issues.apache.org/jira/browse/SPARK-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595171#comment-14595171 ] Ben Sully commented on SPARK-7499: -- I just had a quick go and it seems quite possibly: ``` name_to_col - function(colname, data) { eval(parse(text=paste0(substitute(data), $, colname))) } ``` The colname and then column could be extracted using lazyeval::all_dots like this: ``` select_.DataFrame - function(.data, ..., .dots) { dots - lazyeval::all_dots(..., .dots) colnames - sapply(dots, function(x) x$expr) columns - sapply(colnames, name_to_col, data = .data) do_something_with_columns() } select(df, name, age) ``` The method I've used is quite primitive, in that it just uses `substitute` to find out the name of the DataFrame object being passed, then uses `eval` to resolve the {tablename}${column} into the actual object which then gets passed to SparkR. It feels quite hacky though, by relying on the `$` method of DataFrames. This is just adding a method to dplyr which allows its select, mutate, filter, group_by and summarise functions to work with DataFrames. I think the part which allows it to work without strings or df$age is the lazyeval package. Investigate how to specify columns in SparkR without $ or strings - Key: SPARK-7499 URL: https://issues.apache.org/jira/browse/SPARK-7499 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Right now in SparkR we need to specify the columns used using `$` or strings. For example to run select we would do {code} df1 - select(df, df$age 10) {code} It would be good to infer the set of columns in a dataframe automatically and resolve symbols for column names. For example {code} df1 - select(df, age 10) {code} One way to do this is to build an environment with all the column names to column handles and then use `substitute(arg, env = columnNameEnv)` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8508) Test case SQLQuerySuite.test script transform for stderr generates super long output
[ https://issues.apache.org/jira/browse/SPARK-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-8508. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6925 [https://github.com/apache/spark/pull/6925] Test case SQLQuerySuite.test script transform for stderr generates super long output -- Key: SPARK-8508 URL: https://issues.apache.org/jira/browse/SPARK-8508 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Fix For: 1.5.0 This test case writes 100,000 lines of integer triples to stderr, and makes Jenkins build output unnecessarily large and hard to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7499) Investigate how to specify columns in SparkR without $ or strings
[ https://issues.apache.org/jira/browse/SPARK-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595153#comment-14595153 ] Shivaram Venkataraman commented on SPARK-7499: -- [~sd2k] Thanks for taking a shot at this -- One question I have about this approach is if we can avoid implementing new versions of all these functions. i.e. is there a common sub-routine we can create that given a set of identifiers / columns will resolve them to SparkR DataFrame columns ? Also just to help me understand this better, what is the specific function in dplyr we are using here ? Investigate how to specify columns in SparkR without $ or strings - Key: SPARK-7499 URL: https://issues.apache.org/jira/browse/SPARK-7499 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Right now in SparkR we need to specify the columns used using `$` or strings. For example to run select we would do {code} df1 - select(df, df$age 10) {code} It would be good to infer the set of columns in a dataframe automatically and resolve symbols for column names. For example {code} df1 - select(df, age 10) {code} One way to do this is to build an environment with all the column names to column handles and then use `substitute(arg, env = columnNameEnv)` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7426) spark.ml AttributeFactory.fromStructField should allow other NumericTypes
[ https://issues.apache.org/jira/browse/SPARK-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7426: - Assignee: Mike Dusenberry spark.ml AttributeFactory.fromStructField should allow other NumericTypes - Key: SPARK-7426 URL: https://issues.apache.org/jira/browse/SPARK-7426 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Mike Dusenberry Priority: Minor It currently only supports DoubleType, but it should support others, at least for fromStructField (importing into ML attribute format, rather than exporting). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8509) Failed to JOIN in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8509: - Description: Hi, I am writing pyspark stream program. I have the training data set to compute the regression model. I want to use the stream data set to test the model. So, I join with RDD with the StreamRDD, but i got the exception. Following are my source code, and the exception I got. Any help is appreciated. Thanks Regards, Afancy {code} from __future__ import print_function import sys,os,datetime from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql.context import SQLContext from pyspark.resultiterable import ResultIterable from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD import numpy as np import statsmodels.api as sm def splitLine(line, delimiter='|'): values = line.split(delimiter) st = datetime.datetime.strptime(values[1], '%Y-%m-%d %H:%M:%S') return (values[0],st.hour), values[2:] def reg_m(y, x): ones = np.ones(len(x[0])) X = sm.add_constant(np.column_stack((x[0], ones))) for ele in x[1:]: X = sm.add_constant(np.column_stack((ele, X))) results = sm.OLS(y, X).fit() return results def train(line): y,x = [],[] y, x = [],[[],[],[],[],[],[]] reading_tmp,temp_tmp = [],[] i = 0 for reading, temperature in line[1]: if i%4==0 and len(reading_tmp)==4: y.append(reading_tmp.pop()) x[0].append(reading_tmp.pop()) x[1].append(reading_tmp.pop()) x[2].append(reading_tmp.pop()) temp = float(temp_tmp[0]) del temp_tmp[:] x[3].append(temp-20.0 if temp20.0 else 0.0) x[4].append(16.0-temp if temp16.0 else 0.0) x[5].append(5.0-temp if temp5.0 else 0.0) reading_tmp.append(float(reading)) temp_tmp.append(float(temperature)) i = i + 1 return str(line[0]),reg_m(y, x).params.tolist() if __name__ == __main__: if len(sys.argv) != 4: print(Usage: regression.py checkpointDir trainingDataDir streamDataDir, file=sys.stderr) exit(-1) checkpoint, trainingInput, streamInput = sys.argv[1:] sc = SparkContext(local[2], appName=BenchmarkSparkStreaming) trainingLines = sc.textFile(trainingInput) modelRDD = trainingLines.map(lambda line: splitLine(line, |))\ .groupByKey()\ .map(lambda line: train(line))\ .cache() ssc = StreamingContext(sc, 2) ssc.checkpoint(checkpoint) lines = ssc.textFileStream(streamInput).map(lambda line: splitLine(line, |)) testRDD = lines.groupByKeyAndWindow(4,2).map(lambda line:(str(line[0]), line[1])).transform(lambda rdd: rdd.leftOuterJoin(modelRDD)) testRDD.pprint(20) ssc.start() ssc.awaitTermination() 15/06/18 12:25:37 INFO FileInputDStream: Duration for remembering RDDs set to 6 ms for org.apache.spark.streaming.dstream.FileInputDStream@15b81ee6 Traceback (most recent call last): File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/util.py, line 90, in dumps return bytearray(self.serializer.dumps((func.func, func.deserializers))) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py, line 427, in dumps return cloudpickle.dumps(obj, 2) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 622, in dumps cp.dump(obj) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 107, in dump return Pickler.dump(self, obj) File /usr/lib/python2.7/pickle.py, line 224, in dump self.save(obj) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.7/pickle.py, line 548, in save_tuple save(element) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 193, in save_function self.save_function_tuple(obj) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 236, in save_function_tuple save((code, closure, base_globals)) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.7/pickle.py, line 548, in save_tuple save(element) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.7/pickle.py, line 600, in save_list self._batch_appends(iter(obj)) File /usr/lib/python2.7/pickle.py, line 633, in _batch_appends save(x) File
[jira] [Updated] (SPARK-8509) Failed to JOIN in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8509: - Description: Hi, I am writing pyspark stream program. I have the training data set to compute the regression model. I want to use the stream data set to test the model. So, I join with RDD with the StreamRDD, but i got the exception. Following are my source code, and the exception I got. Any help is appreciated. Thanks Regards, Afancy {code} from __future__ import print_function import sys,os,datetime from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql.context import SQLContext from pyspark.resultiterable import ResultIterable from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD import numpy as np import statsmodels.api as sm def splitLine(line, delimiter='|'): values = line.split(delimiter) st = datetime.datetime.strptime(values[1], '%Y-%m-%d %H:%M:%S') return (values[0],st.hour), values[2:] def reg_m(y, x): ones = np.ones(len(x[0])) X = sm.add_constant(np.column_stack((x[0], ones))) for ele in x[1:]: X = sm.add_constant(np.column_stack((ele, X))) results = sm.OLS(y, X).fit() return results def train(line): y,x = [],[] y, x = [],[[],[],[],[],[],[]] reading_tmp,temp_tmp = [],[] i = 0 for reading, temperature in line[1]: if i%4==0 and len(reading_tmp)==4: y.append(reading_tmp.pop()) x[0].append(reading_tmp.pop()) x[1].append(reading_tmp.pop()) x[2].append(reading_tmp.pop()) temp = float(temp_tmp[0]) del temp_tmp[:] x[3].append(temp-20.0 if temp20.0 else 0.0) x[4].append(16.0-temp if temp16.0 else 0.0) x[5].append(5.0-temp if temp5.0 else 0.0) reading_tmp.append(float(reading)) temp_tmp.append(float(temperature)) i = i + 1 return str(line[0]),reg_m(y, x).params.tolist() if __name__ == __main__: if len(sys.argv) != 4: print(Usage: regression.py checkpointDir trainingDataDir streamDataDir, file=sys.stderr) exit(-1) checkpoint, trainingInput, streamInput = sys.argv[1:] sc = SparkContext(local[2], appName=BenchmarkSparkStreaming) trainingLines = sc.textFile(trainingInput) modelRDD = trainingLines.map(lambda line: splitLine(line, |))\ .groupByKey()\ .map(lambda line: train(line))\ .cache() ssc = StreamingContext(sc, 2) ssc.checkpoint(checkpoint) lines = ssc.textFileStream(streamInput).map(lambda line: splitLine(line, |)) testRDD = lines.groupByKeyAndWindow(4,2).map(lambda line:(str(line[0]), line[1])).transform(lambda rdd: rdd.leftOuterJoin(modelRDD)) testRDD.pprint(20) ssc.start() ssc.awaitTermination() {code} {code} 15/06/18 12:25:37 INFO FileInputDStream: Duration for remembering RDDs set to 6 ms for org.apache.spark.streaming.dstream.FileInputDStream@15b81ee6 Traceback (most recent call last): File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/util.py, line 90, in dumps return bytearray(self.serializer.dumps((func.func, func.deserializers))) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py, line 427, in dumps return cloudpickle.dumps(obj, 2) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 622, in dumps cp.dump(obj) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 107, in dump return Pickler.dump(self, obj) File /usr/lib/python2.7/pickle.py, line 224, in dump self.save(obj) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.7/pickle.py, line 548, in save_tuple save(element) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 193, in save_function self.save_function_tuple(obj) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 236, in save_function_tuple save((code, closure, base_globals)) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.7/pickle.py, line 548, in save_tuple save(element) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.7/pickle.py, line 600, in save_list self._batch_appends(iter(obj)) File /usr/lib/python2.7/pickle.py, line 633, in _batch_appends
[jira] [Commented] (SPARK-8472) Python API for DCT
[ https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594977#comment-14594977 ] Yu Ishikawa commented on SPARK-8472: Please assign this issue to me. Python API for DCT -- Key: SPARK-8472 URL: https://issues.apache.org/jira/browse/SPARK-8472 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Priority: Minor We need to implement a wrapper for enabling the DCT feature transformer to be used from the Python API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-8379. --- Resolution: Fixed Fix Version/s: 1.4.1 1.5.0 Issue resolved by pull request 6833 [https://github.com/apache/spark/pull/6833] LeaseExpiredException when using dynamic partition with speculative execution - Key: SPARK-8379 URL: https://issues.apache.org/jira/browse/SPARK-8379 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: jeanlyn Assignee: jeanlyn Fix For: 1.5.0, 1.4.1 when inserting to table using dynamic partitions with *spark.speculation=true* and there is a skew data of some partitions trigger the speculative tasks ,it will throws the exception like {code} org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): Lease mismatch on /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8508) Test case SQLQuerySuite.test script transform for stderr generates super long output
Cheng Lian created SPARK-8508: - Summary: Test case SQLQuerySuite.test script transform for stderr generates super long output Key: SPARK-8508 URL: https://issues.apache.org/jira/browse/SPARK-8508 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor This test case writes 100,000 lines of integer triples to stderr, and makes Jenkins build output unnecessarily large and hard to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8508) Test case SQLQuerySuite.test script transform for stderr generates super long output
[ https://issues.apache.org/jira/browse/SPARK-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594985#comment-14594985 ] Apache Spark commented on SPARK-8508: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/6925 Test case SQLQuerySuite.test script transform for stderr generates super long output -- Key: SPARK-8508 URL: https://issues.apache.org/jira/browse/SPARK-8508 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor This test case writes 100,000 lines of integer triples to stderr, and makes Jenkins build output unnecessarily large and hard to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8508) Test case SQLQuerySuite.test script transform for stderr generates super long output
[ https://issues.apache.org/jira/browse/SPARK-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8508: --- Assignee: Cheng Lian (was: Apache Spark) Test case SQLQuerySuite.test script transform for stderr generates super long output -- Key: SPARK-8508 URL: https://issues.apache.org/jira/browse/SPARK-8508 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor This test case writes 100,000 lines of integer triples to stderr, and makes Jenkins build output unnecessarily large and hard to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8508) Test case SQLQuerySuite.test script transform for stderr generates super long output
[ https://issues.apache.org/jira/browse/SPARK-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8508: --- Assignee: Apache Spark (was: Cheng Lian) Test case SQLQuerySuite.test script transform for stderr generates super long output -- Key: SPARK-8508 URL: https://issues.apache.org/jira/browse/SPARK-8508 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Apache Spark Priority: Minor This test case writes 100,000 lines of integer triples to stderr, and makes Jenkins build output unnecessarily large and hard to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8509) Failed to JOIN in pyspark
afancy created SPARK-8509: - Summary: Failed to JOIN in pyspark Key: SPARK-8509 URL: https://issues.apache.org/jira/browse/SPARK-8509 Project: Spark Issue Type: Bug Reporter: afancy Hi, I am writing pyspark stream program. I have the training data set to compute the regression model. I want to use the stream data set to test the model. So, I join with RDD with the StreamRDD, but i got the exception. Following are my source code, and the exception I got. Any help is appreciated. Thanks Regards, Afancy from __future__ import print_function import sys,os,datetime from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql.context import SQLContext from pyspark.resultiterable import ResultIterable from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD import numpy as np import statsmodels.api as sm def splitLine(line, delimiter='|'): values = line.split(delimiter) st = datetime.datetime.strptime(values[1], '%Y-%m-%d %H:%M:%S') return (values[0],st.hour), values[2:] def reg_m(y, x): ones = np.ones(len(x[0])) X = sm.add_constant(np.column_stack((x[0], ones))) for ele in x[1:]: X = sm.add_constant(np.column_stack((ele, X))) results = sm.OLS(y, X).fit() return results def train(line): y,x = [],[] y, x = [],[[],[],[],[],[],[]] reading_tmp,temp_tmp = [],[] i = 0 for reading, temperature in line[1]: if i%4==0 and len(reading_tmp)==4: y.append(reading_tmp.pop()) x[0].append(reading_tmp.pop()) x[1].append(reading_tmp.pop()) x[2].append(reading_tmp.pop()) temp = float(temp_tmp[0]) del temp_tmp[:] x[3].append(temp-20.0 if temp20.0 else 0.0) x[4].append(16.0-temp if temp16.0 else 0.0) x[5].append(5.0-temp if temp5.0 else 0.0) reading_tmp.append(float(reading)) temp_tmp.append(float(temperature)) i = i + 1 return str(line[0]),reg_m(y, x).params.tolist() if __name__ == __main__: if len(sys.argv) != 4: print(Usage: regression.py checkpointDir trainingDataDir streamDataDir, file=sys.stderr) exit(-1) checkpoint, trainingInput, streamInput = sys.argv[1:] sc = SparkContext(local[2], appName=BenchmarkSparkStreaming) trainingLines = sc.textFile(trainingInput) modelRDD = trainingLines.map(lambda line: splitLine(line, |))\ .groupByKey()\ .map(lambda line: train(line))\ .cache() ssc = StreamingContext(sc, 2) ssc.checkpoint(checkpoint) lines = ssc.textFileStream(streamInput).map(lambda line: splitLine(line, |)) testRDD = lines.groupByKeyAndWindow(4,2).map(lambda line:(str(line[0]), line[1])).transform(lambda rdd: rdd.leftOuterJoin(modelRDD)) testRDD.pprint(20) ssc.start() ssc.awaitTermination() 15/06/18 12:25:37 INFO FileInputDStream: Duration for remembering RDDs set to 6 ms for org.apache.spark.streaming.dstream.FileInputDStream@15b81ee6 Traceback (most recent call last): File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/util.py, line 90, in dumps return bytearray(self.serializer.dumps((func.func, func.deserializers))) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py, line 427, in dumps return cloudpickle.dumps(obj, 2) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 622, in dumps cp.dump(obj) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 107, in dump return Pickler.dump(self, obj) File /usr/lib/python2.7/pickle.py, line 224, in dump self.save(obj) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.7/pickle.py, line 548, in save_tuple save(element) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 193, in save_function self.save_function_tuple(obj) File /opt/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py, line 236, in save_function_tuple save((code, closure, base_globals)) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.7/pickle.py, line 548, in save_tuple save(element) File /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.7/pickle.py, line 600, in save_list self._batch_appends(iter(obj)) File