[jira] [Commented] (SPARK-7229) SpecificMutableRow should take integer type as internal representation for DateType
[ https://issues.apache.org/jira/browse/SPARK-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518924#comment-14518924 ] Apache Spark commented on SPARK-7229: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/5772 SpecificMutableRow should take integer type as internal representation for DateType --- Key: SPARK-7229 URL: https://issues.apache.org/jira/browse/SPARK-7229 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao {code} test(test DATE types in cache) { val rows = TestSQLContext.jdbc(urlWithUserAndPass, TEST.TIMETYPES).collect() TestSQLContext.jdbc(urlWithUserAndPass, TEST.TIMETYPES).cache().registerTempTable(mycached_date) val cachedRows = sql(select * from mycached_date).collect() assert(rows(0).getAs[java.sql.Date](1) === java.sql.Date.valueOf(1996-01-01)) assert(cachedRows(0).getAs[java.sql.Date](1) === java.sql.Date.valueOf(1996-01-01)) } {code} java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getInt(SpecificMutableRow.scala:252) at org.apache.spark.sql.columnar.IntColumnStats.gatherStats(ColumnStats.scala:208) at org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:56) at org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78) at org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:148) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:242) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {panel} {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7222) Added mathematical derivation in comment to LinearRegression with ElasticNet
[ https://issues.apache.org/jira/browse/SPARK-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7222: --- Assignee: Apache Spark Added mathematical derivation in comment to LinearRegression with ElasticNet Key: SPARK-7222 URL: https://issues.apache.org/jira/browse/SPARK-7222 Project: Spark Issue Type: Documentation Components: ML Reporter: DB Tsai Assignee: Apache Spark Added detailed mathematical derivation of how scaling and LeastSquaresAggregator work. Also refactored the code. TODO: Add test that fail the test when the correction terms are not correctly computed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7159) Support multiclass logistic regression in spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518952#comment-14518952 ] Selim Namsi commented on SPARK-7159: I'll Work on it Support multiclass logistic regression in spark.ml -- Key: SPARK-7159 URL: https://issues.apache.org/jira/browse/SPARK-7159 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley This should be implemented by checking the input DataFrame's label column for feature metadata specifying the number of classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7222) Added mathematical derivation in comment to LinearRegression with ElasticNet
[ https://issues.apache.org/jira/browse/SPARK-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518769#comment-14518769 ] Apache Spark commented on SPARK-7222: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/5767 Added mathematical derivation in comment to LinearRegression with ElasticNet Key: SPARK-7222 URL: https://issues.apache.org/jira/browse/SPARK-7222 Project: Spark Issue Type: Documentation Components: ML Reporter: DB Tsai Added detailed mathematical derivation of how scaling and LeastSquaresAggregator work. Also refactored the code. TODO: Add test that fail the test when the correction terms are not correctly computed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6824) Fill the docs for DataFrame API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6824: - Issue Type: Sub-task (was: New Feature) Parent: SPARK-7228 Fill the docs for DataFrame API in SparkR - Key: SPARK-6824 URL: https://issues.apache.org/jira/browse/SPARK-6824 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Priority: Blocker Some of the DataFrame functions in SparkR do not have complete roxygen docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7225) CombineLimits in Optimizer do not works
[ https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei updated SPARK-7225: -- Summary: CombineLimits in Optimizer do not works (was: CombineLimits do not works) CombineLimits in Optimizer do not works --- Key: SPARK-7225 URL: https://issues.apache.org/jira/browse/SPARK-7225 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhongshuai Pei The optimized logical plan of select key from (select key from src limit 100) t2 limit 10 like that {quote} == Optimized Logical Plan == Limit 10 Limit 100 Project [key#3] MetastoreRelation default, src, None {quote} It did not combineLimits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6815) Support accumulators in R
[ https://issues.apache.org/jira/browse/SPARK-6815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6815: - Target Version/s: 1.5.0 (was: 1.4.0) Support accumulators in R - Key: SPARK-6815 URL: https://issues.apache.org/jira/browse/SPARK-6815 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor SparkR doesn't support acccumulators right now. It might be good to add support for this to get feature parity with PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7229) SpecificMutableRow should take integer type as internal representation for DateType
Cheng Hao created SPARK-7229: Summary: SpecificMutableRow should take integer type as internal representation for DateType Key: SPARK-7229 URL: https://issues.apache.org/jira/browse/SPARK-7229 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao {code} test(test DATE types in cache) { val rows = TestSQLContext.jdbc(urlWithUserAndPass, TEST.TIMETYPES).collect() TestSQLContext.jdbc(urlWithUserAndPass, TEST.TIMETYPES).cache().registerTempTable(mycached_date) val cachedRows = sql(select * from mycached_date).collect() assert(rows(0).getAs[java.sql.Date](1) === java.sql.Date.valueOf(1996-01-01)) assert(cachedRows(0).getAs[java.sql.Date](1) === java.sql.Date.valueOf(1996-01-01)) } {code} java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getInt(SpecificMutableRow.scala:252) at org.apache.spark.sql.columnar.IntColumnStats.gatherStats(ColumnStats.scala:208) at org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:56) at org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78) at org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:148) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:242) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {panel} {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7225) CombineLimits do not works
[ https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei updated SPARK-7225: -- Description: The optimized logical plan of select key from (select key from src limit 100) t2 limit 10 like that {quote} == Optimized Logical Plan == Limit 10 Limit 100 Project [key#3] MetastoreRelation default, src, None {quote} It did not CombineLimits do not works -- Key: SPARK-7225 URL: https://issues.apache.org/jira/browse/SPARK-7225 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhongshuai Pei The optimized logical plan of select key from (select key from src limit 100) t2 limit 10 like that {quote} == Optimized Logical Plan == Limit 10 Limit 100 Project [key#3] MetastoreRelation default, src, None {quote} It did not -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7223) Rename RPC askWithReply - askWithReply, sendWithReply - ask
[ https://issues.apache.org/jira/browse/SPARK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7223: --- Assignee: Apache Spark (was: Reynold Xin) Rename RPC askWithReply - askWithReply, sendWithReply - ask - Key: SPARK-7223 URL: https://issues.apache.org/jira/browse/SPARK-7223 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Apache Spark Current naming is too confusing between askWithReply and sendWithReply. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7202) Add SparseMatrixPickler to SerDe
[ https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7202: --- Assignee: Apache Spark Add SparseMatrixPickler to SerDe Key: SPARK-7202 URL: https://issues.apache.org/jira/browse/SPARK-7202 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar Assignee: Apache Spark We need Sparse MatrixPicker similar to that of DenseMatrixPickler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7157) Add approximate stratified sampling to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7157: --- Summary: Add approximate stratified sampling to DataFrame (was: Add sampleByKey, sampleByKeyExact methods to DataFrame) Add approximate stratified sampling to DataFrame Key: SPARK-7157 URL: https://issues.apache.org/jira/browse/SPARK-7157 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Joseph K. Bradley Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6752) Allow StreamingContext to be recreated from checkpoint and existing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-6752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518927#comment-14518927 ] Apache Spark commented on SPARK-6752: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/5773 Allow StreamingContext to be recreated from checkpoint and existing SparkContext Key: SPARK-6752 URL: https://issues.apache.org/jira/browse/SPARK-6752 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.1, 1.2.1, 1.3.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Fix For: 1.4.0 Currently if you want to create a StreamingContext from checkpoint information, the system will create a new SparkContext. This prevent StreamingContext to be recreated from checkpoints in managed environments where SparkContext is precreated. Proposed solution: Introduce the following methods on StreamingContext 1. {{new StreamingContext(checkpointDirectory, sparkContext)}} - Recreate StreamingContext from checkpoint using the provided SparkContext 2. {{new StreamingContext(checkpointDirectory, hadoopConf, sparkContext)}} - Recreate StreamingContext from checkpoint using the provided SparkContext and hadoop conf to read the checkpoint 3. {{StreamingContext.getOrCreate(checkpointDirectory, sparkContext, createFunction: SparkContext = StreamingContext)}} - If checkpoint file exists, then recreate StreamingContext using the provided SparkContext (that is, call 1.), else create StreamingContext using the provided createFunction Also, the corresponding Java and Python API has to be added as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6825) Data sources implementation to support `sequenceFile`
[ https://issues.apache.org/jira/browse/SPARK-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6825: - Target Version/s: 1.5.0 (was: 1.4.0) Data sources implementation to support `sequenceFile` - Key: SPARK-6825 URL: https://issues.apache.org/jira/browse/SPARK-6825 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman SequenceFiles are a widely used input format and right now they are not supported in SparkR. It would be good to add support for SequenceFiles by implementing a new data source that can create a DataFrame from a SequenceFile. However as SequenceFiles can have arbitrary types, we probably need to map them to User-defined types in SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7133) Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python
[ https://issues.apache.org/jira/browse/SPARK-7133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7133: --- Assignee: Wenchen Fan Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python -- Key: SPARK-7133 URL: https://issues.apache.org/jira/browse/SPARK-7133 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Wenchen Fan Labels: starter Typing {code} df.col[1] {code} and {code} df.col['field'] {code} is so much eaiser than {code} df.col.getField('field') df.col.getItem(1) {code} This would require us to define (in Column) an apply function in Scala, and a __getitem__ function in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7225) CombineLimits in Optimizer does not works
[ https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei updated SPARK-7225: -- Summary: CombineLimits in Optimizer does not works (was: CombineLimits in Optimizer do not works) CombineLimits in Optimizer does not works - Key: SPARK-7225 URL: https://issues.apache.org/jira/browse/SPARK-7225 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhongshuai Pei The optimized logical plan of select key from (select key from src limit 100) t2 limit 10 like that {quote} == Optimized Logical Plan == Limit 10 Limit 100 Project [key#3] MetastoreRelation default, src, None {quote} It did not combineLimits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7223) Rename RPC askWithReply - askWithReply, sendWithReply - ask
Reynold Xin created SPARK-7223: -- Summary: Rename RPC askWithReply - askWithReply, sendWithReply - ask Key: SPARK-7223 URL: https://issues.apache.org/jira/browse/SPARK-7223 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Current naming is too confusing between askWithReply and sendWithReply. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7225) CombineLimits optimizer does not work
[ https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518813#comment-14518813 ] Apache Spark commented on SPARK-7225: - User 'DoingDone9' has created a pull request for this issue: https://github.com/apache/spark/pull/5770 CombineLimits optimizer does not work - Key: SPARK-7225 URL: https://issues.apache.org/jira/browse/SPARK-7225 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhongshuai Pei The optimized logical plan of select key from (select key from src limit 100) t2 limit 10 like that {quote} == Optimized Logical Plan == Limit 10 Limit 100 Project [key#3] MetastoreRelation default, src, None {quote} It did not combineLimits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3808) PySpark fails to start in Windows
[ https://issues.apache.org/jira/browse/SPARK-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518795#comment-14518795 ] eminent commented on SPARK-3808: Yes. It's the cause. After updating the %PATH%, spark was successfully launched. Thanks so much for your help! PySpark fails to start in Windows - Key: SPARK-3808 URL: https://issues.apache.org/jira/browse/SPARK-3808 Project: Spark Issue Type: Bug Components: PySpark, Windows Affects Versions: 1.2.0 Environment: Windows Reporter: Masayoshi TSUZUKI Assignee: Masayoshi TSUZUKI Priority: Blocker Fix For: 1.2.0 When we execute bin\pyspark.cmd in Windows, it fails to start. We get following messages. {noformat} C:\bin\pyspark.cmd Running C:\\python.exe with PYTHONPATH=C:\\bin\..\python\lib\py4j-0.8.2.1-src.zip;C:\\bin\..\python; Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on win32 Type help, copyright, credits or license for more information. =x was unexpected at this time. Traceback (most recent call last): File C:\\bin\..\python\pyspark\shell.py, line 45, in module sc = SparkContext(appName=PySparkShell, pyFiles=add_files) File C:\\python\pyspark\context.py, line 103, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File C:\\python\pyspark\context.py, line 212, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File C:\\python\pyspark\java_gateway.py, line 71, in launch_gateway raise Exception(error_msg) Exception: Launching GatewayServer failed with exit code 255! Warning: Expected GatewayServer to output a port, but found no output. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7035) Drop __getattr__ on pyspark.sql.DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518925#comment-14518925 ] Kalle Jepsen commented on SPARK-7035: - I've created a PR to fix the error message in https://github.com/apache/spark/pull/5771. I didn't deem it necessary to open a JIRA for a minor change like this and hope that was the right thing to do. Drop __getattr__ on pyspark.sql.DataFrame - Key: SPARK-7035 URL: https://issues.apache.org/jira/browse/SPARK-7035 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.4.0 Reporter: Kalle Jepsen I think the {{\_\_getattr\_\_}} method on the DataFrame should be removed. There is no point in having the possibility to address the DataFrames columns as {{df.column}}, other than the questionable goal to please R developers. And it seems R people can use Spark from their native API in the future. I see the following problems with {{\_\_getattr\_\_}} for column selection: * It's un-pythonic: There should only be one obvious way to solve a problem, and we can already address columns on a DataFrame via the {{\_\_getitem\_\_}} method, which in my opinion is by far superior and a lot more intuitive. * It leads to confusing Exceptions. When we mistype a method-name the {{AttributeError}} will say 'No such column ... '. * And most importantly: we cannot load DataFrames that have columns with the same name as any attribute on the DataFrame-object. Imagine having a DataFrame with a column named {{cache}} or {{filter}}. Calling {{df.cache()}} will be ambiguous and lead to broken code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6816) Add SparkConf API to configure SparkR
[ https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6816: - Target Version/s: 1.5.0 (was: 1.4.0) Add SparkConf API to configure SparkR - Key: SPARK-6816 URL: https://issues.apache.org/jira/browse/SPARK-6816 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the only way to configure SparkR is to pass in arguments to sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python to make configuration easier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7223) Rename RPC askWithReply - askWithReply, sendWithReply - ask
[ https://issues.apache.org/jira/browse/SPARK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7223: --- Assignee: Reynold Xin (was: Apache Spark) Rename RPC askWithReply - askWithReply, sendWithReply - ask - Key: SPARK-7223 URL: https://issues.apache.org/jira/browse/SPARK-7223 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Current naming is too confusing between askWithReply and sendWithReply. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7232) Add a Substitution batch for spark sql analyzer
[ https://issues.apache.org/jira/browse/SPARK-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7232: --- Assignee: (was: Apache Spark) Add a Substitution batch for spark sql analyzer --- Key: SPARK-7232 URL: https://issues.apache.org/jira/browse/SPARK-7232 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Fei Wang Added a new batch named `Substitution` before Resolution batch. The motivation for this is there are kind of cases we want to do some substitution on the parsed logical plan before resolve it. Consider this two cases: 1 CTE, for cte we first build a row logical plan 'With Map(q1 - 'Subquery q1 'Project ['key] 'UnresolvedRelation [src], None) 'Project [*] 'Filter ('key = 5) 'UnresolvedRelation [q1], None In `With` logicalplan here is a map stored the (q1- subquery), we want first take off the with command and substitute the q1 of UnresolvedRelation by the subquery 2 Another example is Window function, in window function user may define some windows, we also need substitute the window name of child by the concrete window. this should also done in the Substitution batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7234) When codegen on DateType defaultPrimitive will throw type mismatch exception
Chen Song created SPARK-7234: Summary: When codegen on DateType defaultPrimitive will throw type mismatch exception Key: SPARK-7234 URL: https://issues.apache.org/jira/browse/SPARK-7234 Project: Spark Issue Type: Bug Components: SQL Reporter: Chen Song When codegen on, the defaultPrimitive of DateType is null. This will rise below error. select COUNT(a) from table a - DateType type mismatch; found : Null(null) required: DateType.this.InternalType -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7234) When codegen on DateType defaultPrimitive will throw type mismatch exception
[ https://issues.apache.org/jira/browse/SPARK-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7234: --- Assignee: (was: Apache Spark) When codegen on DateType defaultPrimitive will throw type mismatch exception Key: SPARK-7234 URL: https://issues.apache.org/jira/browse/SPARK-7234 Project: Spark Issue Type: Bug Components: SQL Reporter: Chen Song When codegen on, the defaultPrimitive of DateType is null. This will rise below error. select COUNT(a) from table a - DateType type mismatch; found : Null(null) required: DateType.this.InternalType -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7234) When codegen on DateType defaultPrimitive will throw type mismatch exception
[ https://issues.apache.org/jira/browse/SPARK-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519176#comment-14519176 ] Apache Spark commented on SPARK-7234: - User 'kaka1992' has created a pull request for this issue: https://github.com/apache/spark/pull/5778 When codegen on DateType defaultPrimitive will throw type mismatch exception Key: SPARK-7234 URL: https://issues.apache.org/jira/browse/SPARK-7234 Project: Spark Issue Type: Bug Components: SQL Reporter: Chen Song When codegen on, the defaultPrimitive of DateType is null. This will rise below error. select COUNT(a) from table a - DateType type mismatch; found : Null(null) required: DateType.this.InternalType -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7234) When codegen on DateType defaultPrimitive will throw type mismatch exception
[ https://issues.apache.org/jira/browse/SPARK-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7234: --- Assignee: Apache Spark When codegen on DateType defaultPrimitive will throw type mismatch exception Key: SPARK-7234 URL: https://issues.apache.org/jira/browse/SPARK-7234 Project: Spark Issue Type: Bug Components: SQL Reporter: Chen Song Assignee: Apache Spark When codegen on, the defaultPrimitive of DateType is null. This will rise below error. select COUNT(a) from table a - DateType type mismatch; found : Null(null) required: DateType.this.InternalType -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7233) ClosureCleaner#clean blocks concurrent job submitter threads
Oleksii Kostyliev created SPARK-7233: Summary: ClosureCleaner#clean blocks concurrent job submitter threads Key: SPARK-7233 URL: https://issues.apache.org/jira/browse/SPARK-7233 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1, 1.4.0 Reporter: Oleksii Kostyliev {{org.apache.spark.util.ClosureCleaner#clean}} method contains logic to determine if Spark is run in interpreter mode: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L120 While this behavior is indeed valuable in particular situations, in addition to this it causes concurrent submitter threads to be blocked on a native call to {{java.lang.Class#forName0}} since it appears only 1 thread at a time can make the call. This becomes a major issue when you have multiple threads concurrently submitting short-lived jobs. This is one of the patterns how we use Spark in production, and the number of parallel requests is expected to be quite high, up to a couple of thousand at a time. A typical stacktrace of a blocked thread looks like: {code} http-bio-8091-exec-14 [BLOCKED] [DAEMON] java.lang.Class.forName0(String, boolean, ClassLoader, Class) Class.java (native) java.lang.Class.forName(String) Class.java:260 org.apache.spark.util.ClosureCleaner$.clean(Object, boolean) ClosureCleaner.scala:122 org.apache.spark.SparkContext.clean(Object, boolean) SparkContext.scala:1623 org.apache.spark.rdd.RDD.reduce(Function2) RDD.scala:883 org.apache.spark.rdd.RDD.takeOrdered(int, Ordering) RDD.scala:1240 org.apache.spark.api.java.JavaRDDLike$class.takeOrdered(JavaRDDLike, int, Comparator) JavaRDDLike.scala:586 org.apache.spark.api.java.AbstractJavaRDDLike.takeOrdered(int, Comparator) JavaRDDLike.scala:46 ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6913) No suitable driver found loading JDBC dataframe using driver added by through SparkContext.addJar
[ https://issues.apache.org/jira/browse/SPARK-6913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519124#comment-14519124 ] Vyacheslav Baranov commented on SPARK-6913: --- The problem is in java.sql.DriverManager that doesn't see the drivers loaded by ClassLoaders other than bootstrap ClassLoader. The solution would be to create a proxy driver included in Spark assembly that forwards all requests to wrapped driver. I have a working fix for this issue and going to make pull request soon. No suitable driver found loading JDBC dataframe using driver added by through SparkContext.addJar --- Key: SPARK-6913 URL: https://issues.apache.org/jira/browse/SPARK-6913 Project: Spark Issue Type: Bug Components: SQL Reporter: Evan Yu val sc = new SparkContext(conf) sc.addJar(J:\mysql-connector-java-5.1.35.jar) val df = sqlContext.jdbc(jdbc:mysql://localhost:3000/test_db?user=abcpassword=123, table1) df.show() Folloing error: 2015-04-14 17:04:39,541 [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, dev1.test.dc2.com): java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3000/test_db?user=abcpassword=123 at java.sql.DriverManager.getConnection(DriverManager.java:689) at java.sql.DriverManager.getConnection(DriverManager.java:270) at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:158) at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:150) at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.init(JDBCRDD.scala:317) at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7196) decimal precision lost when loading DataFrame from JDBC
[ https://issues.apache.org/jira/browse/SPARK-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519158#comment-14519158 ] Apache Spark commented on SPARK-7196: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/5777 decimal precision lost when loading DataFrame from JDBC --- Key: SPARK-7196 URL: https://issues.apache.org/jira/browse/SPARK-7196 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Ken Geis I have a decimal database field that is defined as 10.2 (i.e. ##.##). When I load it into Spark via sqlContext.jdbc(..), the type of the corresponding field in the DataFrame is DecimalType, with precisionInfo None. Because of that loss of precision information, SPARK-4176 is triggered when I try to .saveAsTable(..). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7196) decimal precision lost when loading DataFrame from JDBC
[ https://issues.apache.org/jira/browse/SPARK-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7196: --- Assignee: Apache Spark decimal precision lost when loading DataFrame from JDBC --- Key: SPARK-7196 URL: https://issues.apache.org/jira/browse/SPARK-7196 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Ken Geis Assignee: Apache Spark I have a decimal database field that is defined as 10.2 (i.e. ##.##). When I load it into Spark via sqlContext.jdbc(..), the type of the corresponding field in the DataFrame is DecimalType, with precisionInfo None. Because of that loss of precision information, SPARK-4176 is triggered when I try to .saveAsTable(..). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7196) decimal precision lost when loading DataFrame from JDBC
[ https://issues.apache.org/jira/browse/SPARK-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7196: --- Assignee: (was: Apache Spark) decimal precision lost when loading DataFrame from JDBC --- Key: SPARK-7196 URL: https://issues.apache.org/jira/browse/SPARK-7196 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Ken Geis I have a decimal database field that is defined as 10.2 (i.e. ##.##). When I load it into Spark via sqlContext.jdbc(..), the type of the corresponding field in the DataFrame is DecimalType, with precisionInfo None. Because of that loss of precision information, SPARK-4176 is triggered when I try to .saveAsTable(..). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7233) ClosureCleaner#clean blocks concurrent job submitter threads
[ https://issues.apache.org/jira/browse/SPARK-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519169#comment-14519169 ] Oleksii Kostyliev commented on SPARK-7233: -- To illustrate the issue, I performed a test against local Spark. Attached is the screenshot from the Threads view in Yourkit profiler. The test was generating only 20 concurrent requests. As you can see, job submitter threads mainly spend their time being blocked by each other. ClosureCleaner#clean blocks concurrent job submitter threads Key: SPARK-7233 URL: https://issues.apache.org/jira/browse/SPARK-7233 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1, 1.4.0 Reporter: Oleksii Kostyliev Attachments: blocked_threads_closurecleaner.png {{org.apache.spark.util.ClosureCleaner#clean}} method contains logic to determine if Spark is run in interpreter mode: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L120 While this behavior is indeed valuable in particular situations, in addition to this it causes concurrent submitter threads to be blocked on a native call to {{java.lang.Class#forName0}} since it appears only 1 thread at a time can make the call. This becomes a major issue when you have multiple threads concurrently submitting short-lived jobs. This is one of the patterns how we use Spark in production, and the number of parallel requests is expected to be quite high, up to a couple of thousand at a time. A typical stacktrace of a blocked thread looks like: {code} http-bio-8091-exec-14 [BLOCKED] [DAEMON] java.lang.Class.forName0(String, boolean, ClassLoader, Class) Class.java (native) java.lang.Class.forName(String) Class.java:260 org.apache.spark.util.ClosureCleaner$.clean(Object, boolean) ClosureCleaner.scala:122 org.apache.spark.SparkContext.clean(Object, boolean) SparkContext.scala:1623 org.apache.spark.rdd.RDD.reduce(Function2) RDD.scala:883 org.apache.spark.rdd.RDD.takeOrdered(int, Ordering) RDD.scala:1240 org.apache.spark.api.java.JavaRDDLike$class.takeOrdered(JavaRDDLike, int, Comparator) JavaRDDLike.scala:586 org.apache.spark.api.java.AbstractJavaRDDLike.takeOrdered(int, Comparator) JavaRDDLike.scala:46 ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7233) ClosureCleaner#clean blocks concurrent job submitter threads
[ https://issues.apache.org/jira/browse/SPARK-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleksii Kostyliev updated SPARK-7233: - Attachment: blocked_threads_closurecleaner.png ClosureCleaner#clean blocks concurrent job submitter threads Key: SPARK-7233 URL: https://issues.apache.org/jira/browse/SPARK-7233 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1, 1.4.0 Reporter: Oleksii Kostyliev Attachments: blocked_threads_closurecleaner.png {{org.apache.spark.util.ClosureCleaner#clean}} method contains logic to determine if Spark is run in interpreter mode: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L120 While this behavior is indeed valuable in particular situations, in addition to this it causes concurrent submitter threads to be blocked on a native call to {{java.lang.Class#forName0}} since it appears only 1 thread at a time can make the call. This becomes a major issue when you have multiple threads concurrently submitting short-lived jobs. This is one of the patterns how we use Spark in production, and the number of parallel requests is expected to be quite high, up to a couple of thousand at a time. A typical stacktrace of a blocked thread looks like: {code} http-bio-8091-exec-14 [BLOCKED] [DAEMON] java.lang.Class.forName0(String, boolean, ClassLoader, Class) Class.java (native) java.lang.Class.forName(String) Class.java:260 org.apache.spark.util.ClosureCleaner$.clean(Object, boolean) ClosureCleaner.scala:122 org.apache.spark.SparkContext.clean(Object, boolean) SparkContext.scala:1623 org.apache.spark.rdd.RDD.reduce(Function2) RDD.scala:883 org.apache.spark.rdd.RDD.takeOrdered(int, Ordering) RDD.scala:1240 org.apache.spark.api.java.JavaRDDLike$class.takeOrdered(JavaRDDLike, int, Comparator) JavaRDDLike.scala:586 org.apache.spark.api.java.AbstractJavaRDDLike.takeOrdered(int, Comparator) JavaRDDLike.scala:46 ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7077) Binary processing hash table for aggregation
[ https://issues.apache.org/jira/browse/SPARK-7077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7077. Resolution: Fixed Fix Version/s: 1.4.0 Binary processing hash table for aggregation Key: SPARK-7077 URL: https://issues.apache.org/jira/browse/SPARK-7077 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Assignee: Josh Rosen Fix For: 1.4.0 Let's start with a hash table for aggregations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6838) Explore using Reference Classes instead of S4 objects
[ https://issues.apache.org/jira/browse/SPARK-6838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6838: - Target Version/s: 1.5.0 (was: 1.4.0) Explore using Reference Classes instead of S4 objects - Key: SPARK-6838 URL: https://issues.apache.org/jira/browse/SPARK-6838 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor The current RDD and PipelinedRDD are represented in S4 objects. R has a new OO system: Reference Class (RC or R5). It seems to be a more message-passing OO and instances are mutable objects. It is not an important issue, but it should also require trivial work. It could also remove the kind-of awkward @ operator in S4. R6 is also worth checking out. Feels closer to your ordinary object oriented language. https://github.com/wch/R6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7157) Add approximate stratified sampling to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7157: --- Description: def sampleBy(c Add approximate stratified sampling to DataFrame Key: SPARK-7157 URL: https://issues.apache.org/jira/browse/SPARK-7157 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Joseph K. Bradley Priority: Minor def sampleBy(c -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7202) Add SparseMatrixPickler to SerDe
[ https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519021#comment-14519021 ] Apache Spark commented on SPARK-7202: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/5775 Add SparseMatrixPickler to SerDe Key: SPARK-7202 URL: https://issues.apache.org/jira/browse/SPARK-7202 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar We need Sparse MatrixPicker similar to that of DenseMatrixPickler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7227) Support fillna / dropna in R DataFrame
Reynold Xin created SPARK-7227: -- Summary: Support fillna / dropna in R DataFrame Key: SPARK-7227 URL: https://issues.apache.org/jira/browse/SPARK-7227 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7232) Add a Substitution batch for spark sql analyzer
Fei Wang created SPARK-7232: --- Summary: Add a Substitution batch for spark sql analyzer Key: SPARK-7232 URL: https://issues.apache.org/jira/browse/SPARK-7232 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Fei Wang Added a new batch named `Substitution` before Resolution batch. The motivation for this is there are kind of cases we want to do some substitution on the parsed logical plan before resolve it. Consider this two cases: 1 CTE, for cte we first build a row logical plan 'With Map(q1 - 'Subquery q1 'Project ['key] 'UnresolvedRelation [src], None) 'Project [*] 'Filter ('key = 5) 'UnresolvedRelation [q1], None In `With` logicalplan here is a map stored the (q1- subquery), we want first take off the with command and substitute the q1 of UnresolvedRelation by the subquery 2 Another example is Window function, in window function user may define some windows, we also need substitute the window name of child by the concrete window. this should also done in the Substitution batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7225) CombineLimits optimizer does not work
[ https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7225: --- Assignee: (was: Apache Spark) CombineLimits optimizer does not work - Key: SPARK-7225 URL: https://issues.apache.org/jira/browse/SPARK-7225 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhongshuai Pei The optimized logical plan of select key from (select key from src limit 100) t2 limit 10 like that {quote} == Optimized Logical Plan == Limit 10 Limit 100 Project [key#3] MetastoreRelation default, src, None {quote} It did not combineLimits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7205) Support local ivy cache in --packages
[ https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-7205. Resolution: Fixed Assignee: Burak Yavuz Support local ivy cache in --packages - Key: SPARK-7205 URL: https://issues.apache.org/jira/browse/SPARK-7205 Project: Spark Issue Type: Bug Components: Spark Submit Reporter: Burak Yavuz Assignee: Burak Yavuz Priority: Critical Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7222) Added mathematical derivation in comment and compressed the model to LinearRegression with ElasticNet
[ https://issues.apache.org/jira/browse/SPARK-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-7222: --- Description: Added detailed mathematical derivation of how scaling and LeastSquaresAggregator work. Also refactored the code so the model is compressed based on the storage. We may try compress based on the prediction time. TODO: Add test that fail the test when the correction terms are not correctly computed. (was: Added detailed mathematical derivation of how scaling and LeastSquaresAggregator work. Also refactored the code. TODO: Add test that fail the test when the correction terms are not correctly computed.) Added mathematical derivation in comment and compressed the model to LinearRegression with ElasticNet - Key: SPARK-7222 URL: https://issues.apache.org/jira/browse/SPARK-7222 Project: Spark Issue Type: Improvement Components: ML Reporter: DB Tsai Added detailed mathematical derivation of how scaling and LeastSquaresAggregator work. Also refactored the code so the model is compressed based on the storage. We may try compress based on the prediction time. TODO: Add test that fail the test when the correction terms are not correctly computed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7213) Exception while copying Hadoop config files due to permission issues
[ https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7213: --- Component/s: YARN Exception while copying Hadoop config files due to permission issues Key: SPARK-7213 URL: https://issues.apache.org/jira/browse/SPARK-7213 Project: Spark Issue Type: Bug Components: YARN Reporter: Nishkam Ravi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException
[ https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518807#comment-14518807 ] Imran Rashid commented on SPARK-5945: - [~kayousterhout] can you please clarify -- did you want to just hardcode to 4, or did you want to reuse {{spark.task.maxFailures}} for stage failures as well? Spark should not retry a stage infinitely on a FetchFailedException --- Key: SPARK-5945 URL: https://issues.apache.org/jira/browse/SPARK-5945 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid Assignee: Ilya Ganelin While investigating SPARK-5928, I noticed some very strange behavior in the way spark retries stages after a FetchFailedException. It seems that on a FetchFailedException, instead of simply killing the task and retrying, Spark aborts the stage and retries. If it just retried the task, the task might fail 4 times and then trigger the usual job killing mechanism. But by killing the stage instead, the max retry logic is skipped (it looks to me like there is no limit for retries on a stage). After a bit of discussion with Kay Ousterhout, it seems the idea is that if a fetch fails, we assume that the block manager we are fetching from has failed, and that it will succeed if we retry the stage w/out that block manager. In that case, it wouldn't make any sense to retry the task, since its doomed to fail every time, so we might as well kill the whole stage. But this raises two questions: 1) Is it really safe to assume that a FetchFailedException means that the BlockManager has failed, and ti will work if we just try another one? SPARK-5928 shows that there are at least some cases where that assumption is wrong. Even if we fix that case, this logic seems brittle to the next case we find. I guess the idea is that this behavior is what gives us the R in RDD ... but it seems like its not really that robust and maybe should be reconsidered. 2) Should stages only be retried a limited number of times? It would be pretty easy to put in a limited number of retries per stage. Though again, we encounter issues with keeping things resilient. Theoretically one stage could have many retries, but due to failures in different stages further downstream, so we might need to track the cause of each retry as well to still have the desired behavior. In general it just seems there is some flakiness in the retry logic. This is the only reproducible example I have at the moment, but I vaguely recall hitting other cases of strange behavior w/ retries when trying to run long pipelines. Eg., if one executor is stuck in a GC during a fetch, the fetch fails, but the executor eventually comes back and the stage gets retried again, but the same GC issues happen the second time around, etc. Copied from SPARK-5928, here's the example program that can regularly produce a loop of stage failures. Note that it will only fail from a remote fetch, so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6803) [SparkR] Support SparkR Streaming
[ https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6803: - Target Version/s: 1.5.0 (was: 1.4.0) [SparkR] Support SparkR Streaming - Key: SPARK-6803 URL: https://issues.apache.org/jira/browse/SPARK-6803 Project: Spark Issue Type: New Feature Components: SparkR, Streaming Reporter: Hao Fix For: 1.4.0 Adds R API for Spark Streaming. A experimental version is presented in repo [1]. which follows the PySpark streaming design. Also, this PR can be further broken down into sub task issues. [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.
[ https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519005#comment-14519005 ] Nicolas PHUNG commented on SPARK-3601: -- For GenericData.Array Avro, I use the following snippet from [Flink|https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/kryo/Serializers.java]: {code} // Avoid issue with avro array serialization https://issues.apache.org/jira/browse/FLINK-1391 public class Serializers { /** * Special serializer for Java collections enforcing certain instance types. * Avro is serializing collections with an GenericData.Array type. Kryo is not able to handle * this type, so we use ArrayLists. */ public static class SpecificInstanceCollectionSerializerT extends java.util.ArrayList? extends CollectionSerializer implements Serializable { private ClassT type; public SpecificInstanceCollectionSerializer(ClassT type) { this.type = type; } @Override protected Collection create(Kryo kryo, Input input, ClassCollection type) { return kryo.newInstance(this.type); } @Override protected Collection createCopy(Kryo kryo, Collection original) { return kryo.newInstance(this.type); } } } {code} And I have register in Kryo with the following scala code : {code} kryo.register(classOf[GenericData.Array[_]], new SpecificInstanceCollectionSerializer(classOf[java.util.ArrayList[_]])); {code} I hope this help. Kryo NPE for output operations on Avro complex Objects even after registering. -- Key: SPARK-3601 URL: https://issues.apache.org/jira/browse/SPARK-3601 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: local, standalone cluster Reporter: mohan gaddam Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. Im using Java as programming language. when used complex message throws NPE, stack trace as follows: == ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResMsg) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) In the above exception, Datum and ResMsg are project specific classes generated by avro using the below avdl snippet: == record KeyValueObject { union{boolean, int, long, float, double, bytes, string} name; union {boolean, int, long, float, double, bytes, string, arrayunion{boolean, int, long, float, double, bytes, string, KeyValueObject},
[jira] [Updated] (SPARK-6833) Extend `addPackage` so that any given R file can be sourced in the worker before functions are run.
[ https://issues.apache.org/jira/browse/SPARK-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6833: - Target Version/s: 1.5.0 (was: 1.4.0) Extend `addPackage` so that any given R file can be sourced in the worker before functions are run. --- Key: SPARK-6833 URL: https://issues.apache.org/jira/browse/SPARK-6833 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor Similar to how extra python files or packages can be specified (in zip / egg formats), it will be good to support the ability to add extra R files to the executors working directory. One thing that needs to be investigated is if this will just work out of the box using the spark-submit flag --files ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6813) SparkR style guide
[ https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6813: - Target Version/s: 1.5.0 (was: 1.4.0) SparkR style guide -- Key: SPARK-6813 URL: https://issues.apache.org/jira/browse/SPARK-6813 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman We should develop a SparkR style guide document based on the some of the guidelines we use and some of the best practices in R. Some examples of R style guide are: http://r-pkgs.had.co.nz/r.html#style http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html A related issue is to work on a automatic style checking tool. https://github.com/jimhester/lintr seems promising We could have a R style guide based on the one from google [1], and adjust some of them with the conversation in Spark: 1. Line Length: maximum 100 characters 2. no limit on function name (API should be similar as in other languages) 3. Allow S4 objects/methods -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7222) Added mathematical derivation in comment to LinearRegression with ElasticNet
DB Tsai created SPARK-7222: -- Summary: Added mathematical derivation in comment to LinearRegression with ElasticNet Key: SPARK-7222 URL: https://issues.apache.org/jira/browse/SPARK-7222 Project: Spark Issue Type: Documentation Components: ML Reporter: DB Tsai Added detailed mathematical derivation of how scaling and LeastSquaresAggregator work. Also refactored the code. TODO: Add test that fail the test when the correction terms are not correctly computed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7225) CombineLimits optimizer does not work
[ https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7225: --- Assignee: Apache Spark CombineLimits optimizer does not work - Key: SPARK-7225 URL: https://issues.apache.org/jira/browse/SPARK-7225 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhongshuai Pei Assignee: Apache Spark The optimized logical plan of select key from (select key from src limit 100) t2 limit 10 like that {quote} == Optimized Logical Plan == Limit 10 Limit 100 Project [key#3] MetastoreRelation default, src, None {quote} It did not combineLimits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7080) Binary processing based aggregate operator
[ https://issues.apache.org/jira/browse/SPARK-7080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7080. Resolution: Fixed Fix Version/s: 1.4.0 Binary processing based aggregate operator -- Key: SPARK-7080 URL: https://issues.apache.org/jira/browse/SPARK-7080 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Assignee: Josh Rosen Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6809) Make numPartitions optional in pairRDD APIs
[ https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6809: - Priority: Major (was: Critical) Make numPartitions optional in pairRDD APIs --- Key: SPARK-6809 URL: https://issues.apache.org/jira/browse/SPARK-6809 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7224) Mock repositories for testing with --packages
Burak Yavuz created SPARK-7224: -- Summary: Mock repositories for testing with --packages Key: SPARK-7224 URL: https://issues.apache.org/jira/browse/SPARK-7224 Project: Spark Issue Type: Test Components: Spark Submit Reporter: Burak Yavuz Priority: Critical Create mock repositories (folders with jars and pom in MAven format) for testing --packages without the need for internet connection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6820) Convert NAs to null type in SparkR DataFrames
[ https://issues.apache.org/jira/browse/SPARK-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6820: - Priority: Critical (was: Major) Convert NAs to null type in SparkR DataFrames - Key: SPARK-6820 URL: https://issues.apache.org/jira/browse/SPARK-6820 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman Priority: Critical While converting RDD or local R DataFrame to a SparkR DataFrame we need to handle missing values or NAs. We should convert NAs to SparkSQL's null type to handle the conversion correctly -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6799) Add dataframe examples for SparkR
[ https://issues.apache.org/jira/browse/SPARK-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6799: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-7228 Add dataframe examples for SparkR - Key: SPARK-6799 URL: https://issues.apache.org/jira/browse/SPARK-6799 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Priority: Critical We should add more data frame usage examples for SparkR . This can be similar to the python examples at https://github.com/apache/spark/blob/1b2aab8d5b9cc2ff702506038bd71aa8debe7ca0/examples/src/main/python/sql.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7202) Add SparseMatrixPickler to SerDe
[ https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7202: --- Assignee: (was: Apache Spark) Add SparseMatrixPickler to SerDe Key: SPARK-7202 URL: https://issues.apache.org/jira/browse/SPARK-7202 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar We need Sparse MatrixPicker similar to that of DenseMatrixPickler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6809) Make numPartitions optional in pairRDD APIs
[ https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6809: - Target Version/s: 1.5.0 (was: 1.4.0) Make numPartitions optional in pairRDD APIs --- Key: SPARK-6809 URL: https://issues.apache.org/jira/browse/SPARK-6809 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7225) CombineLimits do not works
Zhongshuai Pei created SPARK-7225: - Summary: CombineLimits do not works Key: SPARK-7225 URL: https://issues.apache.org/jira/browse/SPARK-7225 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhongshuai Pei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6826) `hashCode` support for arbitrary R objects
[ https://issues.apache.org/jira/browse/SPARK-6826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6826: - Target Version/s: 1.5.0 (was: 1.4.0) `hashCode` support for arbitrary R objects -- Key: SPARK-6826 URL: https://issues.apache.org/jira/browse/SPARK-6826 Project: Spark Issue Type: Bug Components: SparkR Reporter: Shivaram Venkataraman From the SparkR JIRA digest::digest looks interesting, but it seems to be more heavyweight than our requirements. One relatively easy way to do this is to serialize the given R object into a string (serialize(object, ascii=T)) and then just call the string hashCode function on this. FWIW it looks like digest follows a similar strategy where the md5sum / shasum etc. are calculated on serialized objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7032) SparkSQL incorrect results when using UNION/EXCEPT with GROUP BY clause
[ https://issues.apache.org/jira/browse/SPARK-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518944#comment-14518944 ] Reynold Xin commented on SPARK-7032: cc [~cloud_fan] would you have time to take a look at this? SparkSQL incorrect results when using UNION/EXCEPT with GROUP BY clause --- Key: SPARK-7032 URL: https://issues.apache.org/jira/browse/SPARK-7032 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.2, 1.3.1 Reporter: Lior Chaga When using UNION/EXCEPT clause with GROUP BY clause in spark sql, results do not match expected. In the following example, only 1 record should be in first table and not in second (as when grouping by key field, the counter for key=1 is 10 in both tables). Each of the clauses by itself is working properly when running exclusively. {code} //import com.addthis.metrics.reporter.config.ReporterConfig; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.api.java.JavaSQLContext; import org.apache.spark.sql.api.java.Row; import java.io.IOException; import java.io.Serializable; import java.util.ArrayList; import java.util.List; public class SimpleApp { public static void main(String[] args) throws IOException { SparkConf conf = new SparkConf().setAppName(Simple Application) .setMaster(local[1]); JavaSparkContext sc = new JavaSparkContext(conf); ListMyObject firstList = new ArrayListMyObject(2); firstList.add(new MyObject(1, 10)); firstList.add(new MyObject(2, 10)); ListMyObject secondList = new ArrayListMyObject(3); secondList.add(new MyObject(1, 4)); secondList.add(new MyObject(1, 6)); secondList.add(new MyObject(2, 8)); JavaRDDMyObject firstRdd = sc.parallelize(firstList); JavaRDDMyObject secondRdd = sc.parallelize(secondList); JavaSQLContext sqlc = new JavaSQLContext(sc); sqlc.applySchema(firstRdd, MyObject.class).registerTempTable(table1); sqlc.sqlContext().cacheTable(table1); sqlc.applySchema(secondRdd, MyObject.class).registerTempTable(table2); sqlc.sqlContext().cacheTable(table2); ListRow firstMinusSecond = sqlc.sql( SELECT key, counter FROM table1 + EXCEPT + SELECT key, SUM(counter) FROM table2 + GROUP BY key ).collect(); System.out.println(num of rows in first but not in second = [ + firstMinusSecond.size() + ]); sc.close(); System.exit(0); } public static class MyObject implements Serializable { public MyObject(Integer key, Integer counter) { this.key = key; this.counter = counter; } private Integer key; private Integer counter; public Integer getKey() { return key; } public void setKey(Integer key) { this.key = key; } public Integer getCounter() { return counter; } public void setCounter(Integer counter) { this.counter = counter; } } } {code} Same example, give or take, with DataFrames - when not using groupBy works good, with groupBy I get 2 rows instead of 1: {code} SparkConf conf = new SparkConf().setAppName(Simple Application) .setMaster(local[1]); JavaSparkContext sc = new JavaSparkContext(conf); ListMyObject firstList = new ArrayListMyObject(2); firstList.add(new MyObject(1, 10)); firstList.add(new MyObject(2, 10)); ListMyObject secondList = new ArrayListMyObject(3); secondList.add(new MyObject(1, 10)); secondList.add(new MyObject(2, 8)); JavaRDDMyObject firstRdd = sc.parallelize(firstList); JavaRDDMyObject secondRdd = sc.parallelize(secondList); SQLContext sqlc = new SQLContext(sc); DataFrame firstDataFrame = sqlc.createDataFrame(firstRdd, MyObject.class); DataFrame secondDataFrame = sqlc.createDataFrame(secondRdd, MyObject.class); Row[] collect = firstDataFrame.except(secondDataFrame).collect(); System.out.println(num of rows in first but not in second = [ + collect.length + ]); DataFrame secondAggregated = secondDataFrame.groupBy(key).sum(counter); Row[] collectAgg = firstDataFrame.except(secondAggregated).collect(); System.out.println(num of rows in first but not in second = [ + collectAgg.length + ]); // should be 1 row, but there are 2 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Created] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
Shivaram Venkataraman created SPARK-7230: Summary: Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7224) Mock repositories for testing with --packages
[ https://issues.apache.org/jira/browse/SPARK-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7224: --- Assignee: Burak Yavuz Mock repositories for testing with --packages - Key: SPARK-7224 URL: https://issues.apache.org/jira/browse/SPARK-7224 Project: Spark Issue Type: Test Components: Spark Submit Reporter: Burak Yavuz Assignee: Burak Yavuz Priority: Critical Create mock repositories (folders with jars and pom in MAven format) for testing --packages without the need for internet connection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7076) Binary processing compact tuple representation
[ https://issues.apache.org/jira/browse/SPARK-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7076. Resolution: Fixed Fix Version/s: 1.4.0 Binary processing compact tuple representation -- Key: SPARK-7076 URL: https://issues.apache.org/jira/browse/SPARK-7076 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Assignee: Josh Rosen Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7232) Add a Substitution batch for spark sql analyzer
[ https://issues.apache.org/jira/browse/SPARK-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519090#comment-14519090 ] Apache Spark commented on SPARK-7232: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/5776 Add a Substitution batch for spark sql analyzer --- Key: SPARK-7232 URL: https://issues.apache.org/jira/browse/SPARK-7232 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Fei Wang Added a new batch named `Substitution` before Resolution batch. The motivation for this is there are kind of cases we want to do some substitution on the parsed logical plan before resolve it. Consider this two cases: 1 CTE, for cte we first build a row logical plan 'With Map(q1 - 'Subquery q1 'Project ['key] 'UnresolvedRelation [src], None) 'Project [*] 'Filter ('key = 5) 'UnresolvedRelation [q1], None In `With` logicalplan here is a map stored the (q1- subquery), we want first take off the with command and substitute the q1 of UnresolvedRelation by the subquery 2 Another example is Window function, in window function user may define some windows, we also need substitute the window name of child by the concrete window. this should also done in the Substitution batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7226) Support math functions in R DataFrame
Reynold Xin created SPARK-7226: -- Summary: Support math functions in R DataFrame Key: SPARK-7226 URL: https://issues.apache.org/jira/browse/SPARK-7226 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7204) Call sites in UI are not accurate for DataFrame operations
[ https://issues.apache.org/jira/browse/SPARK-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7204. Resolution: Fixed Fix Version/s: 1.4.0 1.3.2 Call sites in UI are not accurate for DataFrame operations -- Key: SPARK-7204 URL: https://issues.apache.org/jira/browse/SPARK-7204 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Critical Fix For: 1.3.2, 1.4.0 Spark core computes callsites by climbing up the stack until we reach the stack frame at the boundary of user code and spark code. The way we compute whether a given frame is internal (Spark) or user code does not work correctly with the new dataframe API. Once the scope work goes in, we'll have a nicer way to express units of operator scope, but until then there is a simple fix where we just make sure the SQL internal classes are also skipped as we climb up the stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7232) Add a Substitution batch for spark sql analyzer
[ https://issues.apache.org/jira/browse/SPARK-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7232: --- Assignee: Apache Spark Add a Substitution batch for spark sql analyzer --- Key: SPARK-7232 URL: https://issues.apache.org/jira/browse/SPARK-7232 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Fei Wang Assignee: Apache Spark Added a new batch named `Substitution` before Resolution batch. The motivation for this is there are kind of cases we want to do some substitution on the parsed logical plan before resolve it. Consider this two cases: 1 CTE, for cte we first build a row logical plan 'With Map(q1 - 'Subquery q1 'Project ['key] 'UnresolvedRelation [src], None) 'Project [*] 'Filter ('key = 5) 'UnresolvedRelation [q1], None In `With` logicalplan here is a map stored the (q1- subquery), we want first take off the with command and substitute the q1 of UnresolvedRelation by the subquery 2 Another example is Window function, in window function user may define some windows, we also need substitute the window name of child by the concrete window. this should also done in the Substitution batch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7223) Rename RPC askWithReply - askWithReply, sendWithReply - ask
[ https://issues.apache.org/jira/browse/SPARK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518778#comment-14518778 ] Apache Spark commented on SPARK-7223: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/5768 Rename RPC askWithReply - askWithReply, sendWithReply - ask - Key: SPARK-7223 URL: https://issues.apache.org/jira/browse/SPARK-7223 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Current naming is too confusing between askWithReply and sendWithReply. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7229) SpecificMutableRow should take integer type as internal representation for DateType
[ https://issues.apache.org/jira/browse/SPARK-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7229: --- Assignee: Apache Spark SpecificMutableRow should take integer type as internal representation for DateType --- Key: SPARK-7229 URL: https://issues.apache.org/jira/browse/SPARK-7229 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Apache Spark {code} test(test DATE types in cache) { val rows = TestSQLContext.jdbc(urlWithUserAndPass, TEST.TIMETYPES).collect() TestSQLContext.jdbc(urlWithUserAndPass, TEST.TIMETYPES).cache().registerTempTable(mycached_date) val cachedRows = sql(select * from mycached_date).collect() assert(rows(0).getAs[java.sql.Date](1) === java.sql.Date.valueOf(1996-01-01)) assert(cachedRows(0).getAs[java.sql.Date](1) === java.sql.Date.valueOf(1996-01-01)) } {code} java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getInt(SpecificMutableRow.scala:252) at org.apache.spark.sql.columnar.IntColumnStats.gatherStats(ColumnStats.scala:208) at org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:56) at org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78) at org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:148) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:242) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {panel} {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5890) Add FeatureDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-5890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518862#comment-14518862 ] Xusen Yin commented on SPARK-5890: -- I start to do it. Add FeatureDiscretizer -- Key: SPARK-5890 URL: https://issues.apache.org/jira/browse/SPARK-5890 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng A `FeatureDiscretizer` takes a column with continuous features and outputs a column with binned categorical features. {code} val fd = new FeatureDiscretizer() .setInputCol(age) .setNumBins(32) .setOutputCol(ageBins) {code} This should an automatic feature discretizer, which uses a simple algorithm like approximate quantiles to discretize features. It should set the ML attribute correctly in the output column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7222) Added mathematical derivation in comment and compressed the model to LinearRegression with ElasticNet
[ https://issues.apache.org/jira/browse/SPARK-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-7222: --- Summary: Added mathematical derivation in comment and compressed the model to LinearRegression with ElasticNet (was: Added mathematical derivation in comment to LinearRegression with ElasticNet) Added mathematical derivation in comment and compressed the model to LinearRegression with ElasticNet - Key: SPARK-7222 URL: https://issues.apache.org/jira/browse/SPARK-7222 Project: Spark Issue Type: Documentation Components: ML Reporter: DB Tsai Added detailed mathematical derivation of how scaling and LeastSquaresAggregator work. Also refactored the code. TODO: Add test that fail the test when the correction terms are not correctly computed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7215) Make repartition and coalesce a part of the query plan
[ https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7215. Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Burak Yavuz Make repartition and coalesce a part of the query plan -- Key: SPARK-7215 URL: https://issues.apache.org/jira/browse/SPARK-7215 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Burak Yavuz Assignee: Burak Yavuz Priority: Critical Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6814) Support sorting for any data type in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6814: - Target Version/s: 1.5.0 (was: 1.4.0) Support sorting for any data type in SparkR --- Key: SPARK-6814 URL: https://issues.apache.org/jira/browse/SPARK-6814 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Critical I get various return status == 0 is false and unimplemented type errors trying to get data out of any rdd with top() or collect(). The errors are not consistent. I think spark is installed properly because some operations do work. I apologize if I'm missing something easy or not providing the right diagnostic info – I'm new to SparkR, and this seems to be the only resource for SparkR issues. Some logs: {code} Browse[1] top(estep.rdd, 1L) Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : unimplemented type 'list' in 'orderVector1' Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order Execution halted 15/02/13 19:11:57 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14) org.apache.spark.SparkException: R computation failed with Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : unimplemented type 'list' in 'orderVector1' Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order Execution halted at edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/02/13 19:11:57 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 14, localhost): org.apache.spark.SparkException: R computation failed with Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : unimplemented type 'list' in 'orderVector1' Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order Execution halted edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7229) SpecificMutableRow should take integer type as internal representation for DateType
[ https://issues.apache.org/jira/browse/SPARK-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7229: --- Assignee: (was: Apache Spark) SpecificMutableRow should take integer type as internal representation for DateType --- Key: SPARK-7229 URL: https://issues.apache.org/jira/browse/SPARK-7229 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao {code} test(test DATE types in cache) { val rows = TestSQLContext.jdbc(urlWithUserAndPass, TEST.TIMETYPES).collect() TestSQLContext.jdbc(urlWithUserAndPass, TEST.TIMETYPES).cache().registerTempTable(mycached_date) val cachedRows = sql(select * from mycached_date).collect() assert(rows(0).getAs[java.sql.Date](1) === java.sql.Date.valueOf(1996-01-01)) assert(cachedRows(0).getAs[java.sql.Date](1) === java.sql.Date.valueOf(1996-01-01)) } {code} java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getInt(SpecificMutableRow.scala:252) at org.apache.spark.sql.columnar.IntColumnStats.gatherStats(ColumnStats.scala:208) at org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:56) at org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78) at org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:148) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:242) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {panel} {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7225) CombineLimits do not works
[ https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei updated SPARK-7225: -- Description: The optimized logical plan of select key from (select key from src limit 100) t2 limit 10 like that {quote} == Optimized Logical Plan == Limit 10 Limit 100 Project [key#3] MetastoreRelation default, src, None {quote} It did not combineLimits was: The optimized logical plan of select key from (select key from src limit 100) t2 limit 10 like that {quote} == Optimized Logical Plan == Limit 10 Limit 100 Project [key#3] MetastoreRelation default, src, None {quote} It did not CombineLimits do not works -- Key: SPARK-7225 URL: https://issues.apache.org/jira/browse/SPARK-7225 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhongshuai Pei The optimized logical plan of select key from (select key from src limit 100) t2 limit 10 like that {quote} == Optimized Logical Plan == Limit 10 Limit 100 Project [key#3] MetastoreRelation default, src, None {quote} It did not combineLimits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7228) SparkR public API for 1.4 release
Shivaram Venkataraman created SPARK-7228: Summary: SparkR public API for 1.4 release Key: SPARK-7228 URL: https://issues.apache.org/jira/browse/SPARK-7228 Project: Spark Issue Type: Umbrella Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This in an umbrella ticket to track the public APIs and documentation to be released as a part of SparkR in the 1.4 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6832) Handle partial reads in SparkR JVM to worker communication
[ https://issues.apache.org/jira/browse/SPARK-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6832: - Target Version/s: 1.5.0 (was: 1.4.0) Handle partial reads in SparkR JVM to worker communication -- Key: SPARK-6832 URL: https://issues.apache.org/jira/browse/SPARK-6832 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor After we move to use socket between R worker and JVM, it's possible that readBin() in R will return partial results (for example, interrupted by signal). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7188) Support math functions in DataFrames in Python
[ https://issues.apache.org/jira/browse/SPARK-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7188. Resolution: Fixed Fix Version/s: 1.4.0 Support math functions in DataFrames in Python -- Key: SPARK-7188 URL: https://issues.apache.org/jira/browse/SPARK-7188 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Burak Yavuz Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7225) CombineLimits optimizer does not work
[ https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei updated SPARK-7225: -- Summary: CombineLimits optimizer does not work (was: CombineLimits optimizer does not works) CombineLimits optimizer does not work - Key: SPARK-7225 URL: https://issues.apache.org/jira/browse/SPARK-7225 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhongshuai Pei The optimized logical plan of select key from (select key from src limit 100) t2 limit 10 like that {quote} == Optimized Logical Plan == Limit 10 Limit 100 Project [key#3] MetastoreRelation default, src, None {quote} It did not combineLimits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7225) CombineLimits optimizer does not works
[ https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei updated SPARK-7225: -- Summary: CombineLimits optimizer does not works (was: CombineLimits in Optimizer does not works) CombineLimits optimizer does not works -- Key: SPARK-7225 URL: https://issues.apache.org/jira/browse/SPARK-7225 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhongshuai Pei The optimized logical plan of select key from (select key from src limit 100) t2 limit 10 like that {quote} == Optimized Logical Plan == Limit 10 Limit 100 Project [key#3] MetastoreRelation default, src, None {quote} It did not combineLimits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7222) Added mathematical derivation in comment and compressed the model to LinearRegression with ElasticNet
[ https://issues.apache.org/jira/browse/SPARK-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-7222: --- Issue Type: Improvement (was: Documentation) Added mathematical derivation in comment and compressed the model to LinearRegression with ElasticNet - Key: SPARK-7222 URL: https://issues.apache.org/jira/browse/SPARK-7222 Project: Spark Issue Type: Improvement Components: ML Reporter: DB Tsai Added detailed mathematical derivation of how scaling and LeastSquaresAggregator work. Also refactored the code. TODO: Add test that fail the test when the correction terms are not correctly computed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7217) Add configuration to disable stopping of SparkContext when StreamingContext.stop()
[ https://issues.apache.org/jira/browse/SPARK-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519593#comment-14519593 ] Sean Owen commented on SPARK-7217: -- FWIW I'd expect the current behavior since things like {{InputStream.close()}} would always close the underlying stream, if one exists, in the JDK. I assume you're not proposing changing that. How about a new optional param to control whether to stop the underlying stream? Or make the implementation of SparkContext for a specific app un-stoppable instead? Add configuration to disable stopping of SparkContext when StreamingContext.stop() -- Key: SPARK-7217 URL: https://issues.apache.org/jira/browse/SPARK-7217 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.1 Reporter: Tathagata Das Assignee: Tathagata Das In environments like notebooks, the SparkContext is managed by the underlying infrastructure and it is expected that the SparkContext will not be stopped. However, StreamingContext.stop() calls SparkContext.stop() as a non-intuitive side-effect. This JIRA is to add a configuration in SparkConf that sets the default StreamingContext stop behavior. It should be such that the existing behavior does not change for existing users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7236) AkkaUtils askWithReply sleeps indefinitely when a timeout exception is thrown
[ https://issues.apache.org/jira/browse/SPARK-7236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-7236: Attachment: SparkLongSleepAfterTimeout.scala Attaching some code to reproduce this issue. AkkaUtils askWithReply sleeps indefinitely when a timeout exception is thrown - Key: SPARK-7236 URL: https://issues.apache.org/jira/browse/SPARK-7236 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Bryan Cutler Priority: Trivial Labels: quickfix Attachments: SparkLongSleepAfterTimeout.scala When {{AkkaUtils.askWithReply}} gets a TimeoutException, the default parameters {{maxAttempts = 1}} and {{retryInterval = Int.Max}} lead to the thread sleeping for a good while. I noticed this issue when testing for SPARK-6980 and using this function without invoking Spark jobs, so perhaps it acts differently in another context. If this function is on its final attempt to ask and it fails, it should return immediately. Also, perhaps a better default {{retryInterval}} would be {{0}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6989) Spark 1.3 REPL for Scala 2.11 (2.11.2) fails to start, emitting various arcane compiler errors
[ https://issues.apache.org/jira/browse/SPARK-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519730#comment-14519730 ] Michael Allman commented on SPARK-6989: --- Thank you for looking into this. I've been away on vacation for the past week. I've set aside our Scala 2.11 deployment in favor of the 2.10 deployment. We'll probably try again with Spark 1.4. Cheers. Spark 1.3 REPL for Scala 2.11 (2.11.2) fails to start, emitting various arcane compiler errors -- Key: SPARK-6989 URL: https://issues.apache.org/jira/browse/SPARK-6989 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.0 Environment: Java 1.8.0_40 on Ubuntu 14.04.1 Reporter: Michael Allman Assignee: Prashant Sharma Attachments: spark_repl_2.11_errors.txt When starting the Spark 1.3 spark-shell compiled for Scala 2.11, I get a random assortment of compiler errors. I will attach a transcript. One thing I've noticed is that they seem to be less frequent when I increase the driver heap size to 5 GB or so. By comparison, the Spark 1.1 spark-shell on Scala 2.10 has been rock solid with a 512 MB heap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException
[ https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519650#comment-14519650 ] Kay Ousterhout commented on SPARK-5945: --- I wanted to hardcode to 4 (totally agree with the sentiment you expressed earlier in this thread, that it doesn't make sense / is very confusing to re-use a config parameter for two different things). Spark should not retry a stage infinitely on a FetchFailedException --- Key: SPARK-5945 URL: https://issues.apache.org/jira/browse/SPARK-5945 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid Assignee: Ilya Ganelin While investigating SPARK-5928, I noticed some very strange behavior in the way spark retries stages after a FetchFailedException. It seems that on a FetchFailedException, instead of simply killing the task and retrying, Spark aborts the stage and retries. If it just retried the task, the task might fail 4 times and then trigger the usual job killing mechanism. But by killing the stage instead, the max retry logic is skipped (it looks to me like there is no limit for retries on a stage). After a bit of discussion with Kay Ousterhout, it seems the idea is that if a fetch fails, we assume that the block manager we are fetching from has failed, and that it will succeed if we retry the stage w/out that block manager. In that case, it wouldn't make any sense to retry the task, since its doomed to fail every time, so we might as well kill the whole stage. But this raises two questions: 1) Is it really safe to assume that a FetchFailedException means that the BlockManager has failed, and ti will work if we just try another one? SPARK-5928 shows that there are at least some cases where that assumption is wrong. Even if we fix that case, this logic seems brittle to the next case we find. I guess the idea is that this behavior is what gives us the R in RDD ... but it seems like its not really that robust and maybe should be reconsidered. 2) Should stages only be retried a limited number of times? It would be pretty easy to put in a limited number of retries per stage. Though again, we encounter issues with keeping things resilient. Theoretically one stage could have many retries, but due to failures in different stages further downstream, so we might need to track the cause of each retry as well to still have the desired behavior. In general it just seems there is some flakiness in the retry logic. This is the only reproducible example I have at the moment, but I vaguely recall hitting other cases of strange behavior w/ retries when trying to run long pipelines. Eg., if one executor is stuck in a GC during a fetch, the fetch fails, but the executor eventually comes back and the stage gets retried again, but the same GC issues happen the second time around, etc. Copied from SPARK-5928, here's the example program that can regularly produce a loop of stage failures. Note that it will only fail from a remote fetch, so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519600#comment-14519600 ] Sean Owen commented on SPARK-7189: -- I thought that was the point, but maybe I misunderstand: you have to err on the side of re-processing a file even if it doesn't look like it changed. Right? History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7209) Adding new Manning book Spark in Action to the official Spark Webpage
[ https://issues.apache.org/jira/browse/SPARK-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7209. -- Resolution: Fixed Looks like Patrick just added it, yes. Adding new Manning book Spark in Action to the official Spark Webpage --- Key: SPARK-7209 URL: https://issues.apache.org/jira/browse/SPARK-7209 Project: Spark Issue Type: Task Components: Documentation Reporter: Aleksandar Dragosavljevic Priority: Minor Attachments: Spark in Action.jpg Original Estimate: 1h Remaining Estimate: 1h Manning Publications is developing a book Spark in Action written by Marko Bonaci and Petar Zecevic (http://www.manning.com/bonaci) and it would be great if the book could be added to the list of books at the official Spark Webpage (https://spark.apache.org/documentation.html). This book teaches readers to use Spark for stream and batch data processing. It starts with an introduction to the Spark architecture and ecosystem followed by a taste of Spark's command line interface. Readers then discover the most fundamental concepts and abstractions of Spark, particularly Resilient Distributed Datasets (RDDs) and the basic data transformations that RDDs provide. The first part of the book also introduces you to writing Spark applications using the the core APIs. Next, you learn about different Spark components: how to work with structured data using Spark SQL, how to process near-real time data with Spark Streaming, how to apply machine learning algorithms with Spark MLlib, how to apply graph algorithms on graph-shaped data using Spark GraphX, and a clear introduction to Spark clustering. The book is already available to the public as a part of our Manning Early Access Program (MEAP) where we deliver chapters to the public as soon as they are written. We believe it will offer significant support to the Spark users and the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7223) Rename RPC askWithReply - askWithReply, sendWithReply - ask
[ https://issues.apache.org/jira/browse/SPARK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7223. Resolution: Fixed Fix Version/s: 1.4.0 Rename RPC askWithReply - askWithReply, sendWithReply - ask - Key: SPARK-7223 URL: https://issues.apache.org/jira/browse/SPARK-7223 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.4.0 Current naming is too confusing between askWithReply and sendWithReply. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex
[ https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7194: - Component/s: MLlib Priority: Minor (was: Major) Affects Version/s: 1.3.1 Go ahead and set priority and component, and maybe affects version for improvements. You can write {{Vectors.dense(array).toSparse}} - that may be simpler still and doesn't need a new method? Or this could also be a little simpler with array.zipWithIndex.filter(_._1 != 0.0).map(_.swap) Vectors factors method for sparse vectors should accept the output of zipWithIndex -- Key: SPARK-7194 URL: https://issues.apache.org/jira/browse/SPARK-7194 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1 Reporter: Juliet Hougland Priority: Minor Let's say we have an RDD of Array[Double] where zero values are explictly recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD of sparse vectors, we currently have to: arr_doubles.map{ array = val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple = tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1)) Vectors.sparse(arrray.length, indexElem) } Notice that there is a map step at the end to switch the order of the index and the element value after .zipWithIndex. There should be a factory method on the Vectors class that allows you to avoid this flipping of tuple elements when using zipWithIndex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7238) Upgrade protobuf-java (com.google.protobuf) version from 2.4.1 to 2.5.0
[ https://issues.apache.org/jira/browse/SPARK-7238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Favio Vázquez closed SPARK-7238. Resolution: Won't Fix Upgrade protobuf-java (com.google.protobuf) version from 2.4.1 to 2.5.0 --- Key: SPARK-7238 URL: https://issues.apache.org/jira/browse/SPARK-7238 Project: Spark Issue Type: Dependency upgrade Components: Build Affects Versions: 1.3.1 Environment: Ubuntu 14.04. Apache Mesos in cluster mode with HDFS from cloudera 2.5.0-cdh5.3.3. Reporter: Favio Vázquez Priority: Blocker Labels: 2.5.0-cdh5.3.3, CDH5, HDFS, Mesos Fix For: 1.3.1, 1.3.0 This upgrade is needed when building spark for CDH5 2.5.0-cdh5.3.3 due to incompatibilities in the protobuf version used by com.google.protobuf and the one used in hadoop. The default version of protobuf is set to 2.4.1 in the global properties, and this is stated in the pom.xml file: !-- In theory we need not directly depend on protobuf since Spark does not directly use it. However, when building with Hadoop/YARN 2.2 Maven doesn't correctly bump the protobuf version up from the one Mesos gives. For now we include this variable to explicitly bump the version when building with YARN. It would be nice to figure out why Maven can't resolve this correctly (like SBT does). -- So this upgrade will only be affecting the com.google.protobuf version of java-protobuf. Tested for the Cloudera distribution 2.5.0-cdh5.3.3 using Mesos 0.22.0 in cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7248) Random number generators for DataFrames
Xiangrui Meng created SPARK-7248: Summary: Random number generators for DataFrames Key: SPARK-7248 URL: https://issues.apache.org/jira/browse/SPARK-7248 Project: Spark Issue Type: New Feature Components: SQL Reporter: Xiangrui Meng Assignee: Burak Yavuz This is an umbrella JIRA for random number generators for DataFrames. The initial set of RNGs would be `rand` and `randn`, which takes a seed. {code} df.select(*, rand(11L).as(rand)) {code} Where those methods should live is TBD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7247) Add Pandas' shift method to the Dataframe API
Olivier Girardot created SPARK-7247: --- Summary: Add Pandas' shift method to the Dataframe API Key: SPARK-7247 URL: https://issues.apache.org/jira/browse/SPARK-7247 Project: Spark Issue Type: Wish Components: SQL Affects Versions: 1.3.1 Reporter: Olivier Girardot Priority: Minor Spark's DataFrame provide several of the capabilities of Pandas and R Dataframe but in a distributed fashion which makes it almost easy to rewrite Pandas code to Spark. Almost, but there is a feature that's difficult to workaround right now, that's the shift method : http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html I'm working with some data scientists that are using this feature a lot in order to check for rows equality after a sort by some keys. Example (in pandas) : {code} df['delta'] = (df.START_DATE.shift(-1) - df.END_DATE).astype('timedelta64[D]') {code} I think this would be troublesome to add, I don't even know if this change would be do-able. But as a user, it would be useful for me. Olivier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7250) computeInverse for RowMatrix
[ https://issues.apache.org/jira/browse/SPARK-7250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7250: --- Assignee: (was: Apache Spark) computeInverse for RowMatrix Key: SPARK-7250 URL: https://issues.apache.org/jira/browse/SPARK-7250 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Stephanie Rivera -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7250) computeInverse for RowMatrix
[ https://issues.apache.org/jira/browse/SPARK-7250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520053#comment-14520053 ] Apache Spark commented on SPARK-7250: - User 'SpyderRiverA' has created a pull request for this issue: https://github.com/apache/spark/pull/5785 computeInverse for RowMatrix Key: SPARK-7250 URL: https://issues.apache.org/jira/browse/SPARK-7250 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Stephanie Rivera -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7250) computeInverse for RowMatrix
[ https://issues.apache.org/jira/browse/SPARK-7250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7250: --- Assignee: Apache Spark computeInverse for RowMatrix Key: SPARK-7250 URL: https://issues.apache.org/jira/browse/SPARK-7250 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Stephanie Rivera Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520058#comment-14520058 ] Patrick Wendell commented on SPARK-7230: I think this is a good idea. We should expose a narrower higher level API here and then look at user feedback to understand whether we want to support something lower level. From my experience with PySpark, it was a huge effort (probably more than 5X the original contribution) to actually implement everything in the lowest level Spark API's. And for the R community I don't think those low level ETL API's are that useful. So I'd be inclined to keep it simple at the beginning and then add complexity if we see new user demand. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7252) Add support for creating new Hive and HBase delegation tokens
Hari Shreedharan created SPARK-7252: --- Summary: Add support for creating new Hive and HBase delegation tokens Key: SPARK-7252 URL: https://issues.apache.org/jira/browse/SPARK-7252 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.1 Reporter: Hari Shreedharan In SPARK-5342, support is being added for long running apps to be able to write to HDFS, but this does not work for Hive and HBase. We need to add the same support for these too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7237) Many user provided closures are not actually cleaned
[ https://issues.apache.org/jira/browse/SPARK-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7237: --- Assignee: Andrew Or (was: Apache Spark) Many user provided closures are not actually cleaned Key: SPARK-7237 URL: https://issues.apache.org/jira/browse/SPARK-7237 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Andrew Or It appears that many operations throughout Spark actually do not actually clean the closures provided by the user. Simple reproduction: {code} def test(): Unit = { sc.parallelize(1 to 10).mapPartitions { iter = return; iter }.collect() } {code} Clearly, the inner closure is not serializable, but when we serialize it we should expect the ClosureCleaner to fail fast and complain loudly about return statements. Instead, we get a mysterious stack trace: {code} java.io.NotSerializableException: java.lang.Object Serialization stack: - object not serializable (class: java.lang.Object, value: java.lang.Object@6db4b914) - field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, name: nonLocalReturnKey1$1, type: class java.lang.Object) - object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, function1) - field (class: org.apache.spark.rdd.RDD$$anonfun$14, name: f$4, type: interface scala.Function1) - object (class org.apache.spark.rdd.RDD$$anonfun$14, function3) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:314) {code} What might have caused this? If you look at the code for mapPartitions, you'll notice that we never explicitly clean the closure passed in by the user. Instead, we only wrap it in another closure and clean the outer one: {code} def mapPartitions[U: ClassTag]( f: Iterator[T] = Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = { val func = (context: TaskContext, index: Int, iter: Iterator[T]) = f(iter) new MapPartitionsRDD(this, sc.clean(func), preservesPartitioning) } {code} This is not sufficient, however, because the user provided closure is actually a field of the outer closure, which doesn't get cleaned. If we rewrite the above by cleaning the inner closure preemptively: {code} def mapPartitions[U: ClassTag]( f: Iterator[T] = Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = { val cleanedFunc = clean(f) new MapPartitionsRDD( this, (context: TaskContext, index: Int, iter: Iterator[T]) = cleanedFunc(iter), preservesPartitioning) } {code} Then we get the exception that we would expect by running the test() example above: {code} org.apache.spark.SparkException: Return statements aren't allowed in Spark closures at org.apache.spark.util.ReturnStatementFinder$$anon$1.visitTypeInsn(ClosureCleaner.scala:357) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:215) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1759) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:640) {code} This needs to be done in a few places throughout the Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7237) Many user provided closures are not actually cleaned
[ https://issues.apache.org/jira/browse/SPARK-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-7237: - Description: It appears that many operations throughout Spark actually do not actually clean the closures provided by the user. Simple reproduction: {code} def test(): Unit = { sc.parallelize(1 to 10).mapPartitions { iter = return; iter }.collect() } {code} Clearly, the inner closure is not serializable, but when we serialize it we should expect the ClosureCleaner to fail fast and complain loudly about return statements. Instead, we get a mysterious stack trace: {code} java.io.NotSerializableException: java.lang.Object Serialization stack: - object not serializable (class: java.lang.Object, value: java.lang.Object@6db4b914) - field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, name: nonLocalReturnKey1$1, type: class java.lang.Object) - object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, function1) - field (class: org.apache.spark.rdd.RDD$$anonfun$14, name: f$4, type: interface scala.Function1) - object (class org.apache.spark.rdd.RDD$$anonfun$14, function3) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:314) {code} What might have caused this? If you look at the code for mapPartitions, you'll notice that we never explicitly clean the closure passed in by the user. Instead, we only wrap it in another closure and clean only the outer one: {code} def mapPartitions[U: ClassTag]( f: Iterator[T] = Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = { val func = (context: TaskContext, index: Int, iter: Iterator[T]) = f(iter) new MapPartitionsRDD(this, sc.clean(func), preservesPartitioning) } {code} This is not sufficient, however, because the user provided closure is actually a field of the outer closure, and this inner closure doesn't get cleaned. If we rewrite the above by cleaning the inner closure preemptively, as we have done in other places: {code} def mapPartitions[U: ClassTag]( f: Iterator[T] = Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = { val cleanedFunc = clean(f) new MapPartitionsRDD( this, (context: TaskContext, index: Int, iter: Iterator[T]) = cleanedFunc(iter), preservesPartitioning) } {code} Then we get the exception that we would expect by running the test() example above: {code} org.apache.spark.SparkException: Return statements aren't allowed in Spark closures at org.apache.spark.util.ReturnStatementFinder$$anon$1.visitTypeInsn(ClosureCleaner.scala:357) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:215) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1759) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:640) {code} It seems to me that we simply forgot to do this in a few places (e.g. mapPartitions, keyBy, aggregateByKey), because in other similar places we do this correctly (e.g. groupBy, combineByKey, zipPartitions). was: It appears that many operations throughout Spark actually do not actually clean the closures provided by the user. Simple reproduction: {code} def test(): Unit = { sc.parallelize(1 to 10).mapPartitions { iter = return; iter }.collect() } {code} Clearly, the inner closure is not serializable, but when we serialize it we should expect the ClosureCleaner to fail fast and complain loudly about return statements. Instead, we get a mysterious stack trace: {code} java.io.NotSerializableException: java.lang.Object Serialization stack: - object not serializable (class: java.lang.Object, value: java.lang.Object@6db4b914) - field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, name: nonLocalReturnKey1$1, type: class java.lang.Object) - object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, function1) - field (class: org.apache.spark.rdd.RDD$$anonfun$14, name: f$4, type: interface scala.Function1) - object (class org.apache.spark.rdd.RDD$$anonfun$14, function3) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at