[jira] [Assigned] (SPARK-6941) Provide a better error message to explain that tables created from RDDs are immutable
[ https://issues.apache.org/jira/browse/SPARK-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6941: --- Assignee: Yijie Shen (was: Apache Spark) Provide a better error message to explain that tables created from RDDs are immutable - Key: SPARK-6941 URL: https://issues.apache.org/jira/browse/SPARK-6941 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yijie Shen Priority: Blocker We should explicitly let users know that tables created from RDDs are immutable and new rows cannot be inserted into it. We can add a better error message and also explain it in the programming guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6941) Provide a better error message to explain that tables created from RDDs are immutable
[ https://issues.apache.org/jira/browse/SPARK-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6941: --- Assignee: Apache Spark (was: Yijie Shen) Provide a better error message to explain that tables created from RDDs are immutable - Key: SPARK-6941 URL: https://issues.apache.org/jira/browse/SPARK-6941 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Apache Spark Priority: Blocker We should explicitly let users know that tables created from RDDs are immutable and new rows cannot be inserted into it. We can add a better error message and also explain it in the programming guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8951) support CJK characters in collect()
[ https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621834#comment-14621834 ] Jaehong Choi commented on SPARK-8951: - Thanks for your advice. You're right. This is about supporting Unicode indeed. I'll open a PR for this issue. I didn't think about the null terminator much. I am going to figure it out as well. support CJK characters in collect() --- Key: SPARK-8951 URL: https://issues.apache.org/jira/browse/SPARK-8951 Project: Spark Issue Type: Bug Components: SparkR Reporter: Jaehong Choi Priority: Minor Attachments: SerDe.scala.diff Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK. I found out that SerDe in R API only supports ASCII format for strings right now as commented in source code. So, I fixed SerDe.scala a little to support CJK as the file attached. I did not care efficiency, but just wanted to see if it works. {noformat} people.json {name:가나} {name:테스트123, age:30} {name:Justin, age:19} df - read.df(sqlContext, ./people.json, json) head(df) Error in rawtochar(string) : embedded nul in string : '\0 \x98' {noformat} {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala} // NOTE: Only works for ASCII right now def writeString(out: DataOutputStream, value: String): Unit = { val len = value.length out.writeInt(len + 1) // For the \0 out.writeBytes(value) out.writeByte(0) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8923) Add @since tags to mllib.fpm
[ https://issues.apache.org/jira/browse/SPARK-8923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621821#comment-14621821 ] Apache Spark commented on SPARK-8923: - User 'rahulpalamuttam' has created a pull request for this issue: https://github.com/apache/spark/pull/7341 Add @since tags to mllib.fpm Key: SPARK-8923 URL: https://issues.apache.org/jira/browse/SPARK-8923 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Priority: Minor Labels: starter Original Estimate: 0.5h Remaining Estimate: 0.5h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8923) Add @since tags to mllib.fpm
[ https://issues.apache.org/jira/browse/SPARK-8923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8923: --- Assignee: (was: Apache Spark) Add @since tags to mllib.fpm Key: SPARK-8923 URL: https://issues.apache.org/jira/browse/SPARK-8923 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Priority: Minor Labels: starter Original Estimate: 0.5h Remaining Estimate: 0.5h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621832#comment-14621832 ] Shay Rojansky commented on SPARK-7736: -- Neelesh, not sure I understood what you're saying exactly... I agree with Esben that at the end of the day, if a Spark application fails (by throwing an exception), and does so on all Yarn application attempts, that the Yarn status of that application definitely should be FAILED... Exception not failing Python applications (in yarn cluster mode) Key: SPARK-7736 URL: https://issues.apache.org/jira/browse/SPARK-7736 Project: Spark Issue Type: Bug Components: YARN Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 Reporter: Shay Rojansky It seems that exceptions thrown in Python spark apps after the SparkContext is instantiated don't cause the application to fail, at least in Yarn: the application is marked as SUCCEEDED. Note that any exception right before the SparkContext correctly places the application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8973) Spark Executor usage Cpu 100+%
Xu Chen created SPARK-8973: -- Summary: Spark Executor usage Cpu 100+% Key: SPARK-8973 URL: https://issues.apache.org/jira/browse/SPARK-8973 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Xu Chen Spark Executor usage Cpu 100+% Use Spark-Sql-CLI to count a CACHE TABLE , when I look out the top command I got some Cpu 100+% processes that Spark Executors when I use jstack to check it I found this thread {code:java} Executor task launch worker-1 daemon prio=10 tid=0x7fc9983eb000 nid=0x2f3 runnable [0x7fc9893f9000] java.lang.Thread.State: RUNNABLE at scala.collection.mutable.HashMap.update(HashMap.scala:80) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.gatherCompressibilityStats(compressionSchemes.scala:233) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.gatherCompressibilityStats(CompressibleColumnBuilder.scala:72) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:80) at org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:148) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124) at scala.collection.Iterator$$anon$12.next(Iterator.scala:357) at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153) at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1187) at org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply$mcV$sp(DiskStore.scala:81) at org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply(DiskStore.scala:81) at org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply(DiskStore.scala:81) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285) at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:82) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:788) - locked 0x0007a9471e30 (a org.apache.spark.storage.BlockInfo) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:635) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:242) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6941) Provide a better error message to explain that tables created from RDDs are immutable
[ https://issues.apache.org/jira/browse/SPARK-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621827#comment-14621827 ] Apache Spark commented on SPARK-6941: - User 'yijieshen' has created a pull request for this issue: https://github.com/apache/spark/pull/7342 Provide a better error message to explain that tables created from RDDs are immutable - Key: SPARK-6941 URL: https://issues.apache.org/jira/browse/SPARK-6941 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yijie Shen Priority: Blocker We should explicitly let users know that tables created from RDDs are immutable and new rows cannot be inserted into it. We can add a better error message and also explain it in the programming guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621830#comment-14621830 ] zhengruifeng commented on SPARK-7008: - Yes, LBFGS provide a faster convergence rate. An implementation of Factorization Machine (LibFM) -- Key: SPARK-7008 URL: https://issues.apache.org/jira/browse/SPARK-7008 Project: Spark Issue Type: New Feature Components: MLlib Reporter: zhengruifeng Labels: features Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, QQ20150421-2.png An implementation of Factorization Machines based on Scala and Spark MLlib. FM is a kind of machine learning algorithm for multi-linear regression, and is widely used for recommendation. FM works well in recent years' recommendation competitions. Ref: http://libfm.org/ http://doi.acm.org/10.1145/2168752.2168771 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7751) Add @since to stable and experimental methods in MLlib
[ https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620486#comment-14620486 ] Patrick Baier edited comment on SPARK-7751 at 7/10/15 7:19 AM: --- I built this short bash script to search for the version of methods: {code:borderStyle=solid} #$1=sourceFile to search #$2=string to search for versions=$(git tag) for v in $versions do echo Checking version $v versionedFile=$(git show $v:$1) matches=$(echo $versionedFile | grep -c $2) if [ $matches -gt 0 ] then echo Introduced in version $v exit 0 fi done echo search string $2 not found! {code} Note: You must be in the spark home directory to run it. Example usage: $1=mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala $2=override protected def createModel was (Author: pbaier): I built this short batch script here to search for the version of methods: {code:borderStyle=solid} #$1=sourceFile to search #$2=string to search for versions=$(git tag) for v in $versions do echo Checking version $v versionedFile=$(git show $v:$1) matches=$(echo $versionedFile | grep -c $2) if [ $matches -gt 0 ] then echo Introduced in version $v exit 0 fi done echo search string $2 not found! {code} Note: You must be in the spark home directory to run it. Example usage: $1=mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala $2=override protected def createModel Add @since to stable and experimental methods in MLlib -- Key: SPARK-7751 URL: https://issues.apache.org/jira/browse/SPARK-7751 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Labels: starter This is useful to check whether a feature exists in some version of Spark. This is an umbrella JIRA to track the progress. We want to have @since tag for both stable (those without any Experimental/DeveloperApi/AlphaComponent annotations) and experimental methods in MLlib: * an example PR for Scala: https://github.com/apache/spark/pull/6101 * an example PR for Python: https://github.com/apache/spark/pull/6295 We need to dig the history of git commit to figure out what was the Spark version when a method was first introduced. Take `NaiveBayes.setModelType` as an example. We can grep `def setModelType` at different version git tags. {code} meng@xm:~/src/spark $ git show v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType meng@xm:~/src/spark $ git show v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType def setModelType(modelType: String): NaiveBayes = { {code} If there are better ways, please let us know. We cannot add all @since tags in a single PR, which is hard to review. So we made some subtasks for each package, for example `org.apache.spark.classification`. Feel free to add more sub-tasks for Python and the `spark.ml` package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6051) Add an option for DirectKafkaInputDStream to commit the offsets into ZK
[ https://issues.apache.org/jira/browse/SPARK-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6051. -- Resolution: Won't Fix Add an option for DirectKafkaInputDStream to commit the offsets into ZK --- Key: SPARK-6051 URL: https://issues.apache.org/jira/browse/SPARK-6051 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Currently in DirectKafkaInputDStream, offset is managed by Spark Streaming itself without ZK or Kafka involved, which will make several third-party offset monitoring tools fail to monitor the status of Kafka consumer. So here as a option to commit the offset to ZK when each job is finished, the process is implemented as a asynchronized way, so the main processing flow will not be blocked, already tested with KafkaOffsetMonitor tools. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3219. -- Resolution: Won't Fix K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns Labels: clustering The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8972) Incorrect result for rollup
[ https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8972: --- Assignee: Apache Spark Incorrect result for rollup --- Key: SPARK-8972 URL: https://issues.apache.org/jira/browse/SPARK-8972 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Apache Spark Priority: Critical {code:java} import sqlContext.implicits._ case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(foo) sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group by key%100 with rollup).show(100) // output +---+---++ |cnt|_c1|GROUPING__ID| +---+---++ | 1| 4| 0| | 1| 4| 1| | 1| 5| 0| | 1| 5| 1| | 1| 1| 0| | 1| 1| 1| | 1| 2| 0| | 1| 2| 1| | 1| 3| 0| | 1| 3| 1| +---+---++ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8973) Spark Executor usage Cpu 100+%
[ https://issues.apache.org/jira/browse/SPARK-8973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8973. -- Resolution: Not A Problem Target Version/s: (was: 1.4.0) Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Just having a busy executor is not a problem. You'd have to state a clearer problem. Spark Executor usage Cpu 100+% --- Key: SPARK-8973 URL: https://issues.apache.org/jira/browse/SPARK-8973 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Xu Chen Spark Executor usage Cpu 100+% Use Spark-Sql-CLI to count a CACHE TABLE , when I look out the top command I got some Cpu 100+% processes that Spark Executors when I use jstack to check it I found this thread {code:java} Executor task launch worker-1 daemon prio=10 tid=0x7fc9983eb000 nid=0x2f3 runnable [0x7fc9893f9000] java.lang.Thread.State: RUNNABLE at scala.collection.mutable.HashMap.update(HashMap.scala:80) at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.gatherCompressibilityStats(compressionSchemes.scala:233) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.gatherCompressibilityStats(CompressibleColumnBuilder.scala:72) at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:80) at org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:148) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124) at scala.collection.Iterator$$anon$12.next(Iterator.scala:357) at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153) at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1187) at org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply$mcV$sp(DiskStore.scala:81) at org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply(DiskStore.scala:81) at org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply(DiskStore.scala:81) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285) at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:82) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:788) - locked 0x0007a9471e30 (a org.apache.spark.storage.BlockInfo) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:635) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:242) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8972) Incorrect result for rollup
[ https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621929#comment-14621929 ] Apache Spark commented on SPARK-8972: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/7343 Incorrect result for rollup --- Key: SPARK-8972 URL: https://issues.apache.org/jira/browse/SPARK-8972 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Critical {code:java} import sqlContext.implicits._ case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(foo) sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group by key%100 with rollup).show(100) // output +---+---++ |cnt|_c1|GROUPING__ID| +---+---++ | 1| 4| 0| | 1| 4| 1| | 1| 5| 0| | 1| 5| 1| | 1| 1| 0| | 1| 1| 1| | 1| 2| 0| | 1| 2| 1| | 1| 3| 0| | 1| 3| 1| +---+---++ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8968) dynamic partitioning in spark sql performance issue due to the high GC overhead
[ https://issues.apache.org/jira/browse/SPARK-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Wang updated SPARK-8968: Summary: dynamic partitioning in spark sql performance issue due to the high GC overhead (was: shuffled by the partition clomns when dynamic partitioning to optimize the memory overhead) dynamic partitioning in spark sql performance issue due to the high GC overhead --- Key: SPARK-8968 URL: https://issues.apache.org/jira/browse/SPARK-8968 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Fei Wang now the dynamic partitioning show the bad performance for big data due to the GC/memory overhead. this is because each task each partition now we open a writer to write the data, this will cause many small files and high GC. We can shuffle data by the partition columns so that each partition will have ony one partition file and this also reduce the gc overhead -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7018) Refactor dev/run-tests-jenkins into Python
[ https://issues.apache.org/jira/browse/SPARK-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621885#comment-14621885 ] Josh Rosen commented on SPARK-7018: --- Hey [~boyork], any updates on this refactoring? If you're interested, I'm available to chat to discuss whether we should break this up into a series of smaller incremental subtasks (e.g. leaving the `dev/tests` or some of the linting integration scripts as bash for now). The Jenkins script has become moderately complicated so we may need to think about whether we need to do any re-architecting as part of this refactoring. Refactor dev/run-tests-jenkins into Python -- Key: SPARK-7018 URL: https://issues.apache.org/jira/browse/SPARK-7018 Project: Spark Issue Type: Sub-task Components: Build, Project Infra Reporter: Brennon York This issue is to specifically track the progress of the {{dev/run-tests-jenkins}} script into Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]
[ https://issues.apache.org/jira/browse/SPARK-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621900#comment-14621900 ] Ma Xiaoyu commented on SPARK-6882: -- Can you try add in spark-env.sh's classpath and make sure it stay before other jars. Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] Key: SPARK-6882 URL: https://issues.apache.org/jira/browse/SPARK-6882 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1, 1.3.0, 1.4.0 Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled * Apache Hive 0.13.1 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 Reporter: Andrew Lee When Kerberos is enabled, I get the following exceptions. {code} 2015-03-13 18:26:05,363 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService (ThriftBinaryCLIService.java:run(93)) - Error: java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] {code} I tried it in * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 with * Apache Hive 0.13.1 * Apache Hadoop 2.4.1 Build command {code} mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests install {code} When starting Spark ThriftServer in {{yarn-client}} mode, the command to start thriftserver looks like this {code} ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf hive.server2.thrift.bind.host=$(hostname) --master yarn-client {code} {{hostname}} points to the current hostname of the machine I'm using. Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1) {code} 2015-03-13 18:26:05,363 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService (ThriftBinaryCLIService.java:run(93)) - Error: java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56) at org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118) at org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133) at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43) at java.lang.Thread.run(Thread.java:744) {code} I'm wondering if this is due to the same problem described in HIVE-8154 HIVE-7620 due to an older code based for the Spark ThriftServer? Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to run against a Kerberos cluster (Apache 2.4.1). My hive-site.xml looks like the following for spark/conf. The kerberos keytab and tgt are configured correctly, I'm able to connect to metastore, but the subsequent steps failed due to the exception. {code} property namehive.semantic.analyzer.factory.impl/name valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value /property property namehive.metastore.execute.setugi/name valuetrue/value /property property namehive.stats.autogather/name valuefalse/value /property property namehive.session.history.enabled/name valuetrue/value /property property namehive.querylog.location/name value/tmp/home/hive/log/${user.name}/value /property property namehive.exec.local.scratchdir/name value/tmp/hive/scratch/${user.name}/value /property property namehive.metastore.uris/name valuethrift://somehostname:9083/value /property !-- HIVE SERVER 2 -- property namehive.server2.authentication/name valueKERBEROS/value /property property namehive.server2.authentication.kerberos.principal/name value***/value /property property namehive.server2.authentication.kerberos.keytab/name value***/value /property property namehive.server2.thrift.sasl.qop/name valueauth/value descriptionSasl QOP value; one of 'auth', 'auth-int' and 'auth-conf'/description /property property namehive.server2.enable.impersonation/name descriptionEnable user impersonation for HiveServer2/description valuetrue/value /property !-- HIVE METASTORE -- property namehive.metastore.sasl.enabled/name valuetrue/value /property property namehive.metastore.kerberos.keytab.file/name value***/value /property property namehive.metastore.kerberos.principal/name
[jira] [Commented] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]
[ https://issues.apache.org/jira/browse/SPARK-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621899#comment-14621899 ] Ma Xiaoyu commented on SPARK-6882: -- Can you try add in spark-env.sh's classpath and make sure it stay before other jars. Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] Key: SPARK-6882 URL: https://issues.apache.org/jira/browse/SPARK-6882 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1, 1.3.0, 1.4.0 Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled * Apache Hive 0.13.1 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 Reporter: Andrew Lee When Kerberos is enabled, I get the following exceptions. {code} 2015-03-13 18:26:05,363 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService (ThriftBinaryCLIService.java:run(93)) - Error: java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] {code} I tried it in * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 with * Apache Hive 0.13.1 * Apache Hadoop 2.4.1 Build command {code} mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests install {code} When starting Spark ThriftServer in {{yarn-client}} mode, the command to start thriftserver looks like this {code} ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf hive.server2.thrift.bind.host=$(hostname) --master yarn-client {code} {{hostname}} points to the current hostname of the machine I'm using. Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1) {code} 2015-03-13 18:26:05,363 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService (ThriftBinaryCLIService.java:run(93)) - Error: java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56) at org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118) at org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133) at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43) at java.lang.Thread.run(Thread.java:744) {code} I'm wondering if this is due to the same problem described in HIVE-8154 HIVE-7620 due to an older code based for the Spark ThriftServer? Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to run against a Kerberos cluster (Apache 2.4.1). My hive-site.xml looks like the following for spark/conf. The kerberos keytab and tgt are configured correctly, I'm able to connect to metastore, but the subsequent steps failed due to the exception. {code} property namehive.semantic.analyzer.factory.impl/name valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value /property property namehive.metastore.execute.setugi/name valuetrue/value /property property namehive.stats.autogather/name valuefalse/value /property property namehive.session.history.enabled/name valuetrue/value /property property namehive.querylog.location/name value/tmp/home/hive/log/${user.name}/value /property property namehive.exec.local.scratchdir/name value/tmp/hive/scratch/${user.name}/value /property property namehive.metastore.uris/name valuethrift://somehostname:9083/value /property !-- HIVE SERVER 2 -- property namehive.server2.authentication/name valueKERBEROS/value /property property namehive.server2.authentication.kerberos.principal/name value***/value /property property namehive.server2.authentication.kerberos.keytab/name value***/value /property property namehive.server2.thrift.sasl.qop/name valueauth/value descriptionSasl QOP value; one of 'auth', 'auth-int' and 'auth-conf'/description /property property namehive.server2.enable.impersonation/name descriptionEnable user impersonation for HiveServer2/description valuetrue/value /property !-- HIVE METASTORE -- property namehive.metastore.sasl.enabled/name valuetrue/value /property property namehive.metastore.kerberos.keytab.file/name value***/value /property property namehive.metastore.kerberos.principal/name
[jira] [Commented] (SPARK-8968) shuffled by the partition clomns when dynamic partitioning to optimize the memory overhead
[ https://issues.apache.org/jira/browse/SPARK-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621923#comment-14621923 ] Sean Owen commented on SPARK-8968: -- [~scwf] can you reword this? I can't make out what your'e describing in the title or description. shuffled by the partition clomns when dynamic partitioning to optimize the memory overhead -- Key: SPARK-8968 URL: https://issues.apache.org/jira/browse/SPARK-8968 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Fei Wang now the dynamic partitioning show the bad performance for big data due to the GC/memory overhead. this is because each task each partition now we open a writer to write the data, this will cause many small files and high GC. We can shuffle data by the partition columns so that each partition will have ony one partition file and this also reduce the gc overhead -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8972) Incorrect result for rollup
[ https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8972: --- Assignee: (was: Apache Spark) Incorrect result for rollup --- Key: SPARK-8972 URL: https://issues.apache.org/jira/browse/SPARK-8972 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Critical {code:java} import sqlContext.implicits._ case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(foo) sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group by key%100 with rollup).show(100) // output +---+---++ |cnt|_c1|GROUPING__ID| +---+---++ | 1| 4| 0| | 1| 4| 1| | 1| 5| 0| | 1| 5| 1| | 1| 1| 0| | 1| 1| 1| | 1| 2| 0| | 1| 2| 1| | 1| 3| 0| | 1| 3| 1| +---+---++ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8839) Thrift Sever will throw `java.util.NoSuchElementException: key not found` exception when many clients connect it
[ https://issues.apache.org/jira/browse/SPARK-8839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8839: - Assignee: SaintBacchus Thrift Sever will throw `java.util.NoSuchElementException: key not found` exception when many clients connect it - Key: SPARK-8839 URL: https://issues.apache.org/jira/browse/SPARK-8839 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: SaintBacchus Assignee: SaintBacchus Fix For: 1.5.0 If there are about 150+ JDBC clients connectting to the Thrift Server, some clients will throw such exception: {code:title=Exception message|borderStyle=solid} java.sql.SQLException: java.util.NoSuchElementException: key not found: 90d93e56-7f6d-45bf-b340-e3ee09dd60fc at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:155) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8968) dynamic partitioning in spark sql performance issue due to the high GC overhead
[ https://issues.apache.org/jira/browse/SPARK-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621935#comment-14621935 ] Fei Wang commented on SPARK-8968: - changed, how about this? dynamic partitioning in spark sql performance issue due to the high GC overhead --- Key: SPARK-8968 URL: https://issues.apache.org/jira/browse/SPARK-8968 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Fei Wang now the dynamic partitioning show the bad performance for big data due to the GC/memory overhead. this is because each task each partition now we open a writer to write the data, this will cause many small files and high GC. We can shuffle data by the partition columns so that each partition will have ony one partition file and this also reduce the gc overhead -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8940) Don't overwrite given schema if it is not null for createDataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8940: - Assignee: Liang-Chi Hsieh Don't overwrite given schema if it is not null for createDataFrame in SparkR Key: SPARK-8940 URL: https://issues.apache.org/jira/browse/SPARK-8940 Project: Spark Issue Type: Bug Components: SparkR Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Fix For: 1.5.0 The given schema parameter will be overwritten in createDataFrame now. If it is not null, we shouldn't overwrite it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8830) levenshtein directly on top of UTF8String
[ https://issues.apache.org/jira/browse/SPARK-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8830: - Assignee: Tarek Auel levenshtein directly on top of UTF8String - Key: SPARK-8830 URL: https://issues.apache.org/jira/browse/SPARK-8830 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Tarek Auel Fix For: 1.5.0 We currently rely on commons-lang's levenshtein implementation. Ideally, we should have our own implementation to: 1. Reduce external dependency 2. Work directly against UTF8String so we don't need to convert to/from java.lang.String back and forth. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622041#comment-14622041 ] Walter Petersen commented on SPARK-3155: Hi all, I'm new out there. Please tell me: - Is the proposed implementation based on a well-known research paper ? If so, which one ? - Is is issue still relevant ? Is someone currently implementing the feature ? Thanks Support DecisionTree pruning Key: SPARK-3155 URL: https://issues.apache.org/jira/browse/SPARK-3155 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Improvement: accuracy, computation Summary: Pruning is a common method for preventing overfitting with decision trees. A smart implementation can prune the tree during training in order to avoid training parts of the tree which would be pruned eventually anyways. DecisionTree does not currently support pruning. Pruning: A “pruning” of a tree is a subtree with the same root node, but with zero or more branches removed. A naive implementation prunes as follows: (1) Train a depth K tree using a training set. (2) Compute the optimal prediction at each node (including internal nodes) based on the training set. (3) Take a held-out validation set, and use the tree to make predictions for each validation example. This allows one to compute the validation error made at each node in the tree (based on the predictions computed in step (2).) (4) For each pair of leafs with the same parent, compare the total error on the validation set made by the leafs’ predictions with the error made by the parent’s predictions. Remove the leafs if the parent has lower error. A smarter implementation prunes during training, computing the error on the validation set made by each node as it is trained. Whenever two children increase the validation error, they are pruned, and no more training is required on that branch. It is common to use about 1/3 of the data for pruning. Note that pruning is important when using a tree directly for prediction. It is less important when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8974) The spark-dynamic-executor-allocation may be not supported
KaiXinXIaoLei created SPARK-8974: Summary: The spark-dynamic-executor-allocation may be not supported Key: SPARK-8974 URL: https://issues.apache.org/jira/browse/SPARK-8974 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei Fix For: 1.4.1 In yarn-client mode and config option spark.dynamicAllocation.enabled is true, when the state of ApplicationMaster is dead or disconnected, if the tasks are submitted before new ApplicationMaster start. The thread of spark-dynamic-executor-allocation will throw exception, So feture of dynamicAllocation are not supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8974) The spark-dynamic-executor-allocation may be not supported
[ https://issues.apache.org/jira/browse/SPARK-8974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-8974: - Description: In yarn-client mode and config option spark.dynamicAllocation.enabled is true, when the state of ApplicationMaster is dead or disconnected, if the tasks are submitted before new ApplicationMaster start. The thread of spark-dynamic-executor-allocation will throw exception, When ApplicationMaster is running and not tasks are running, the number of executor is not zero. So feture of dynamicAllocation are not supported. (was: In yarn-client mode and config option spark.dynamicAllocation.enabled is true, when the state of ApplicationMaster is dead or disconnected, if the tasks are submitted before new ApplicationMaster start. The thread of spark-dynamic-executor-allocation will throw exception, So feture of dynamicAllocation are not supported.) The spark-dynamic-executor-allocation may be not supported -- Key: SPARK-8974 URL: https://issues.apache.org/jira/browse/SPARK-8974 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei Fix For: 1.4.1 In yarn-client mode and config option spark.dynamicAllocation.enabled is true, when the state of ApplicationMaster is dead or disconnected, if the tasks are submitted before new ApplicationMaster start. The thread of spark-dynamic-executor-allocation will throw exception, When ApplicationMaster is running and not tasks are running, the number of executor is not zero. So feture of dynamicAllocation are not supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8843) DStream transform function receives null instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-8843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622071#comment-14622071 ] Vincent Debergue commented on SPARK-8843: - Fine for me, I'll use the emptyRDD instead. Thanks for looking into that. DStream transform function receives null instead of RDD --- Key: SPARK-8843 URL: https://issues.apache.org/jira/browse/SPARK-8843 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Vincent Debergue When using the {{transform}} function on a {{DStream}} with empty values, we can get a {{NullPointerException}} You can reproduce the issue with this piece of code in the spark-shell: {code} import org.apache.spark.streaming.dstream.InputDStream import org.apache.spark.streaming._ import scala.reflect.ClassTag class EmptyDStream[T: scala.reflect.ClassTag](ssc: org.apache.spark.streaming.StreamingContext) extends InputDStream[T](ssc) { override def compute(t: org.apache.spark.streaming.Time) = None override def start() = {} override def stop() = {} } val ssc = new StreamingContext(sc, Seconds(2)) val in = new EmptyDStream[Int](ssc) val out = in.transform { rdd = rdd.map(_ + 1) // rdd is in fact null here } out.print() ssc.start() // NullPointerException {code} This bug is very likely to come from the usage of {{orNull}} on the scala {{Option}}: https://github.com/apache/spark/blob/branch-1.4/streaming/src/main/scala/org/apache/spark/streaming/dstream/TransformedDStream.scala#L40 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8820) Add a configuration to set the checkpoint directory for convenience.
[ https://issues.apache.org/jira/browse/SPARK-8820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8820: - Affects Version/s: (was: 1.5.0) Priority: Minor (was: Major) Add a configuration to set the checkpoint directory for convenience. Key: SPARK-8820 URL: https://issues.apache.org/jira/browse/SPARK-8820 Project: Spark Issue Type: Improvement Components: Streaming Reporter: SaintBacchus Priority: Minor Add a configuration named *spark.streaming.checkpointDir* to set the checkpoint directory. It will overwrite by user if they also call *StreamingContext#checkpoint*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7977) Disallow println
[ https://issues.apache.org/jira/browse/SPARK-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7977. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7093 [https://github.com/apache/spark/pull/7093] Disallow println Key: SPARK-7977 URL: https://issues.apache.org/jira/browse/SPARK-7977 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Reynold Xin Labels: starter Fix For: 1.5.0 Very often we see pull requests that added println from debugging, but the author forgot to remove it before code review. We can use the regex checker to disallow println. For legitimate use of println, we can then disable the rule where they are used. Add to scalastyle-config.xml file: {code} check customId=println level=error class=org.scalastyle.scalariform.TokenChecker enabled=true parametersparameter name=regex^println$/parameter/parameters customMessage![CDATA[Are you sure you want to println? If yes, wrap the code block with // scalastyle:off println println(...) // scalastyle:on println]]/customMessage /check {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8960) Style cleanup of spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622147#comment-14622147 ] Sean Owen commented on SPARK-8960: -- [~shivaram] [~nchammas] are you generally in favor of this idea or not so sure about it? I hadn't heard an objection to it and may free everyone up to work on the code more rapidly. Do you want to push on the separate repo idea? I'd support that. Style cleanup of spark_ec2.py - Key: SPARK-8960 URL: https://issues.apache.org/jira/browse/SPARK-8960 Project: Spark Issue Type: Task Components: EC2 Affects Versions: 1.4.0 Reporter: Daniel Darabos Priority: Trivial The spark_ec2.py script could use some cleanup I think. There are simple style issues like mixing single and double quotes, but also some rather un-Pythonic constructs (e.g. https://github.com/apache/spark/pull/6336#commitcomment-12088624 that sparked this JIRA). Whenever I read it, I always find something that is too minor for a pull request/JIRA, but I'd fix it if it was my code. Perhaps we can address such issues in this JIRA. The intention is not to introduce any behavioral changes. It's hard to verify this without testing, so perhaps we should also add some kind of test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8815) illegal java package names in jar
[ https://issues.apache.org/jira/browse/SPARK-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8815. -- Resolution: Not A Problem illegal java package names in jar - Key: SPARK-8815 URL: https://issues.apache.org/jira/browse/SPARK-8815 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Sam Halliday Priority: Minor In ENSIME we were unable to index the spark jars and we investigated further... you have classes that look like this: org.spark-project.guava.annotations.VisibleForTesting Hyphens are not legal package names according to the java language spec, so I'm amazed that this can actually be read at runtime... certainly no compiler I know would allow it. What I suspect is happening is that you're using a build plugin that internalises some of your dependencies and it is using your groupId but not validating it... and then blindly using that name in the ASM manipulation. You might want to report this upstream with your build plugin. For your next release, I recommend using an explicit name that is not your groupId. i.e. convert hyphens to underscores as Gosling recommends. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8825) Spark build fails
[ https://issues.apache.org/jira/browse/SPARK-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8825. -- Resolution: Duplicate Spark build fails - Key: SPARK-8825 URL: https://issues.apache.org/jira/browse/SPARK-8825 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Environment: Linux, Ubuntu 14.04 Reporter: Nicholas Brown Building spark (mvn install -DskipTests=true) is failing in the Spark Project Core module. The following error is being given: [ERROR] while compiling: /home/nick/spark-1.4.0/core/src/main/scala/org/apache/spark/util/random/package.scala during phase: jvm library version: version 2.10.4 compiler version: version 2.10.4 reconstructed args: -deprecation -bootclasspath /opt/jdk1.8.0_25/jre/lib/resources.jar:/opt/jdk1.8.0_25/jre/lib/rt.jar:/opt/jdk1.8.0_25/jre/lib/sunrsasign.jar:/opt/jdk1.8.0_25/jre/lib/jsse.jar:/opt/jdk1.8.0_25/jre/lib/jce.jar:/opt/jdk1.8.0_25/jre/lib/charsets.jar:/opt/jdk1.8.0_25/jre/lib/jfr.jar:/opt/jdk1.8.0_25/jre/classes:/home/nick/.m2/repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar -feature -classpath
[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1466#comment-1466 ] Daniel Darabos commented on SPARK-4879: --- I wonder if this issue is serious enough to note in the documentation. What do you think about adding a big fat warning for speculative execution until it is fixed? Enabling speculative execution may lead to missing output files? Or perhaps add verification pass that checks if all the outputs are present and raises an exception if not. Silently dropping output files is a horrible bug. We've been debugging a somewhat mythological data corruption issue for about a month, and now we realize that this issue (SPARK-4879) is a very plausible explanation. We have never been able to reproduce it, but we have a log file, and it shows a speculative task for a {{saveAsNewAPIHadoopFile}} stage. Missing output partitions after job completes with speculative execution Key: SPARK-4879 URL: https://issues.apache.org/jira/browse/SPARK-4879 Project: Spark Issue Type: Bug Components: Input/Output, Spark Core Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical Labels: backport-needed Fix For: 1.3.0 Attachments: speculation.txt, speculation2.txt When speculative execution is enabled ({{spark.speculation=true}}), jobs that save output files may report that they have completed successfully even though some output partitions written by speculative tasks may be missing. h3. Reproduction This symptom was reported to me by a Spark user and I've been doing my own investigation to try to come up with an in-house reproduction. I'm still working on a reliable local reproduction for this issue, which is a little tricky because Spark won't schedule speculated tasks on the same host as the original task, so you need an actual (or containerized) multi-host cluster to test speculation. Here's a simple reproduction of some of the symptoms on EC2, which can be run in {{spark-shell}} with {{--conf spark.speculation=true}}: {code} // Rig a job such that all but one of the tasks complete instantly // and one task runs for 20 seconds on its first attempt and instantly // on its second attempt: val numTasks = 100 sc.parallelize(1 to numTasks, numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) = if (ctx.partitionId == 0) { // If this is the one task that should run really slow if (ctx.attemptId == 0) { // If this is the first attempt, run slow Thread.sleep(20 * 1000) } } iter }.map(x = (x, x)).saveAsTextFile(/test4) {code} When I run this, I end up with a job that completes quickly (due to speculation) but reports failures from the speculated task: {code} [...] 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal (100/100) 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at console:22) finished in 0.856 s 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at console:22, took 0.885438374 s 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event for 70.1 in stage 3.0 because task 70 has already completed successfully scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): java.io.IOException: Failed to save output of task: attempt_201412110141_0003_m_49_413 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} One interesting thing to note about this stack trace: if we look at
[jira] [Updated] (SPARK-8974) The spark-dynamic-executor-allocation may be not supported
[ https://issues.apache.org/jira/browse/SPARK-8974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8974: - Target Version/s: (was: 1.4.0) Fix Version/s: (was: 1.4.1) [~KaiXinXIaoLei] I'd ask again that you read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Your JIRA doesn't make sense: it can't be Fixed for 1.4.1 already, since there is no change here. It can't Target Version 1.4.0 which is already released The spark-dynamic-executor-allocation may be not supported -- Key: SPARK-8974 URL: https://issues.apache.org/jira/browse/SPARK-8974 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei In yarn-client mode and config option spark.dynamicAllocation.enabled is true, when the state of ApplicationMaster is dead or disconnected, if the tasks are submitted before new ApplicationMaster start. The thread of spark-dynamic-executor-allocation will throw exception, When ApplicationMaster is running and not tasks are running, the number of executor is not zero. So feture of dynamicAllocation are not supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4879) Missing output partitions after job completes with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1466#comment-1466 ] Daniel Darabos edited comment on SPARK-4879 at 7/10/15 12:25 PM: - I wonder if this issue is serious enough to note in the documentation. What do you think about adding a big fat warning for speculative execution until it is fixed? Enabling speculative execution may lead to missing output files? Or perhaps add a verification pass that checks if all the outputs are present and raises an exception if not. Silently dropping output files is a horrible bug. We've been debugging a somewhat mythological data corruption issue for about a month, and now we realize that this issue (SPARK-4879) is a very plausible explanation. We have never been able to reproduce it, but we have a log file, and it shows a speculative task for a {{saveAsNewAPIHadoopFile}} stage. was (Author: darabos): I wonder if this issue is serious enough to note in the documentation. What do you think about adding a big fat warning for speculative execution until it is fixed? Enabling speculative execution may lead to missing output files? Or perhaps add verification pass that checks if all the outputs are present and raises an exception if not. Silently dropping output files is a horrible bug. We've been debugging a somewhat mythological data corruption issue for about a month, and now we realize that this issue (SPARK-4879) is a very plausible explanation. We have never been able to reproduce it, but we have a log file, and it shows a speculative task for a {{saveAsNewAPIHadoopFile}} stage. Missing output partitions after job completes with speculative execution Key: SPARK-4879 URL: https://issues.apache.org/jira/browse/SPARK-4879 Project: Spark Issue Type: Bug Components: Input/Output, Spark Core Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical Labels: backport-needed Fix For: 1.3.0 Attachments: speculation.txt, speculation2.txt When speculative execution is enabled ({{spark.speculation=true}}), jobs that save output files may report that they have completed successfully even though some output partitions written by speculative tasks may be missing. h3. Reproduction This symptom was reported to me by a Spark user and I've been doing my own investigation to try to come up with an in-house reproduction. I'm still working on a reliable local reproduction for this issue, which is a little tricky because Spark won't schedule speculated tasks on the same host as the original task, so you need an actual (or containerized) multi-host cluster to test speculation. Here's a simple reproduction of some of the symptoms on EC2, which can be run in {{spark-shell}} with {{--conf spark.speculation=true}}: {code} // Rig a job such that all but one of the tasks complete instantly // and one task runs for 20 seconds on its first attempt and instantly // on its second attempt: val numTasks = 100 sc.parallelize(1 to numTasks, numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) = if (ctx.partitionId == 0) { // If this is the one task that should run really slow if (ctx.attemptId == 0) { // If this is the first attempt, run slow Thread.sleep(20 * 1000) } } iter }.map(x = (x, x)).saveAsTextFile(/test4) {code} When I run this, I end up with a job that completes quickly (due to speculation) but reports failures from the speculated task: {code} [...] 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal (100/100) 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at console:22) finished in 0.856 s 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at console:22, took 0.885438374 s 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event for 70.1 in stage 3.0 because task 70 has already completed successfully scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): java.io.IOException: Failed to save output of task: attempt_201412110141_0003_m_49_413 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
[jira] [Assigned] (SPARK-8995) Cast date strings with date, date and time and just time information to DateType and TimestampTzpe
[ https://issues.apache.org/jira/browse/SPARK-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8995: --- Assignee: (was: Apache Spark) Cast date strings with date, date and time and just time information to DateType and TimestampTzpe -- Key: SPARK-8995 URL: https://issues.apache.org/jira/browse/SPARK-8995 Project: Spark Issue Type: Improvement Components: SQL Reporter: Tarek Auel Tests of https://github.com/apache/spark/pull/6981 fails, because we can not cast strings like '13:18:08' to a valid date and extract the hours later. It's not possible to parse strings that contains date and time information to date, like '2015-03-18 12:25:49' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7735) Raise Exception on non-zero exit from pyspark pipe commands
[ https://issues.apache.org/jira/browse/SPARK-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-7735: -- Assignee: (was: Davies Liu) Raise Exception on non-zero exit from pyspark pipe commands --- Key: SPARK-7735 URL: https://issues.apache.org/jira/browse/SPARK-7735 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0, 1.3.1 Reporter: Scott Taylor Priority: Minor Labels: newbie, patch Fix For: 1.5.0 In pyspark errors are ignored when using the rdd.pipe function. This is different to the scala behaviour where abnormal exit of the piped command is raised. I have submitted a pull request on github which I believe will bring the pyspark behaviour closer to the scala behaviour. A simple case of where this bug may be problematic is using a network bash utility to perform computations on an rdd. Currently, network errors will be ignored and blank results returned when it would be more desirable to raise an exception so that spark can retry the failed task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8995) Cast date strings with date, date and time and just time information to DateType and TimestampTzpe
[ https://issues.apache.org/jira/browse/SPARK-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8995: --- Assignee: Apache Spark Cast date strings with date, date and time and just time information to DateType and TimestampTzpe -- Key: SPARK-8995 URL: https://issues.apache.org/jira/browse/SPARK-8995 Project: Spark Issue Type: Improvement Components: SQL Reporter: Tarek Auel Assignee: Apache Spark Tests of https://github.com/apache/spark/pull/6981 fails, because we can not cast strings like '13:18:08' to a valid date and extract the hours later. It's not possible to parse strings that contains date and time information to date, like '2015-03-18 12:25:49' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8995) Cast date strings with date, date and time and just time information to DateType and TimestampTzpe
[ https://issues.apache.org/jira/browse/SPARK-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarek Auel updated SPARK-8995: -- Description: Tests of https://github.com/apache/spark/pull/6981 fail, because we can not cast strings like '13:18:08' to a valid date and extract the hours later. It's not possible to parse strings that contains date and time information to date, like '2015-03-18 12:25:49' (was: Tests of https://github.com/apache/spark/pull/6981 fails, because we can not cast strings like '13:18:08' to a valid date and extract the hours later. It's not possible to parse strings that contains date and time information to date, like '2015-03-18 12:25:49') Cast date strings with date, date and time and just time information to DateType and TimestampTzpe -- Key: SPARK-8995 URL: https://issues.apache.org/jira/browse/SPARK-8995 Project: Spark Issue Type: Improvement Components: SQL Reporter: Tarek Auel Tests of https://github.com/apache/spark/pull/6981 fail, because we can not cast strings like '13:18:08' to a valid date and extract the hours later. It's not possible to parse strings that contains date and time information to date, like '2015-03-18 12:25:49' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8998) Collect enough frequent prefixes before projection in PrefixSpan
Xiangrui Meng created SPARK-8998: Summary: Collect enough frequent prefixes before projection in PrefixSpan Key: SPARK-8998 URL: https://issues.apache.org/jira/browse/SPARK-8998 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Zhang JiaJin The implementation in SPARK-6487 might have scalability issues when the number of frequent items is very small. In this case, we can generate candidate sets of higher orders using Apriori-like algorithms and count them, until we collect enough prefixes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8974) The spark-dynamic-executor-allocation may be not supported
[ https://issues.apache.org/jira/browse/SPARK-8974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8974: --- Assignee: (was: Apache Spark) The spark-dynamic-executor-allocation may be not supported -- Key: SPARK-8974 URL: https://issues.apache.org/jira/browse/SPARK-8974 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei Fix For: 1.5.0 In yarn-client mode and config option spark.dynamicAllocation.enabled is true, when the state of ApplicationMaster is dead or disconnected, if the tasks are submitted before new ApplicationMaster start. The thread of spark-dynamic-executor-allocation will throw exception, When ApplicationMaster is running and not tasks are running, the number of executor is not zero. So feture of dynamicAllocation are not supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8598) Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs
[ https://issues.apache.org/jira/browse/SPARK-8598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8598. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6994 [https://github.com/apache/spark/pull/6994] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs --- Key: SPARK-8598 URL: https://issues.apache.org/jira/browse/SPARK-8598 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jose Cambronero Assignee: Jose Cambronero Priority: Minor Fix For: 1.5.0 We have implemented a 1-sample, two-sided version of the Kolmogorov Smirnov test, which tests the null hypothesis that the sample comes from a given continuous distribution. We provide various functions to access the functionality: namely, a function that takes an RDD[Double] of the data and a lambda to calculate the CDF, a function that takes an RDD[Double] and an Iterator[(Double,Double,Double)] = Iterator[Double] which uses mapPartition to provide an optimized way to perform the calculation when the CDF calculation requires a non-serializable object (e.g. the apache math commons real distributions), and finally a function that takes an RDD[Double] and a String name of the theoretical distribution to be used. The appropriate result class has been added, as well as tests to the HypothesisTestSuite -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8997) Improve LocalPrefixSpan performance
Xiangrui Meng created SPARK-8997: Summary: Improve LocalPrefixSpan performance Key: SPARK-8997 URL: https://issues.apache.org/jira/browse/SPARK-8997 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Feynman Liang We can improve the performance by: 1. run should output Iterator instead of Array 2. Local count shouldn't use groupBy, which creates too many arrays. We can use PrimitiveKeyOpenHashMap 3. We can use list to avoid materialize frequent sequences -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623218#comment-14623218 ] Xiangrui Meng commented on SPARK-6487: -- Please check linked JIRAs for follow-up work. Add sequential pattern mining algorithm to Spark MLlib -- Key: SPARK-6487 URL: https://issues.apache.org/jira/browse/SPARK-6487 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Zhang JiaJin Assignee: Zhang JiaJin Priority: Critical Fix For: 1.5.0 [~mengxr] [~zhangyouhua] Sequential pattern mining is an important branch in the pattern mining. In the past the actual work, we use the sequence mining (mainly PrefixSpan algorithm) to find the telecommunication signaling sequence pattern, achieved good results. But once the data is too large, the operation time is too long, even can not meet the the service requirements. We are ready to implement the PrefixSpan algorithm in spark, and applied to our subsequent work. The related Paper: PrefixSpan: Pei, Jian, et al. Mining sequential patterns by pattern-growth: The prefixspan approach. Knowledge and Data Engineering, IEEE Transactions on 16.11 (2004): 1424-1440. Parallel Algorithm: Cong, Shengnan, Jiawei Han, and David Padua. Parallel mining of closed sequential patterns. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005. Distributed Algorithm: Wei, Yong-qing, Dong Liu, and Lin-shan Duan. Distributed PrefixSpan algorithm based on MapReduce. Information Technology in Medicine and Education (ITME), 2012 International Symposium on. Vol. 2. IEEE, 2012. Pattern mining and sequential mining Knowledge: Han, Jiawei, et al. Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery 15.1 (2007): 55-86. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8999) Support non-temporal sequence in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8999: - Description: In SPARK-6487, we assume that all items are ordered. However, we should support non-temporal sequences in PrefixSpan. This should be done before 1.5 because it changes PrefixSpan APIs. (was: In SPARK-6487, we assume that all items are ordered. However, we should support non-temporal sequences in PrefixSpan.) Support non-temporal sequence in PrefixSpan --- Key: SPARK-8999 URL: https://issues.apache.org/jira/browse/SPARK-8999 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Priority: Critical In SPARK-6487, we assume that all items are ordered. However, we should support non-temporal sequences in PrefixSpan. This should be done before 1.5 because it changes PrefixSpan APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8999) Support non-temporal sequence in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8999: - Description: In SPARK-6487, we assume that all items are ordered. However, we should support non-temporal sequences in PrefixSpan. Support non-temporal sequence in PrefixSpan --- Key: SPARK-8999 URL: https://issues.apache.org/jira/browse/SPARK-8999 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Priority: Critical In SPARK-6487, we assume that all items are ordered. However, we should support non-temporal sequences in PrefixSpan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8999) Support non-temporal sequence in PrefixSpan
Xiangrui Meng created SPARK-8999: Summary: Support non-temporal sequence in PrefixSpan Key: SPARK-8999 URL: https://issues.apache.org/jira/browse/SPARK-8999 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8962) Disallow Class.forName
[ https://issues.apache.org/jira/browse/SPARK-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8962: --- Assignee: (was: Apache Spark) Disallow Class.forName -- Key: SPARK-8962 URL: https://issues.apache.org/jira/browse/SPARK-8962 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Josh Rosen We should add a regex rule to Scalastyle which prohibits the use of {{Class.forName}}. We should not use Class.forName directly because this will load classes from the system's default class loader rather than the appropriate context loader. Instead, we should be calling Utils.classForName instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8962) Disallow Class.forName
[ https://issues.apache.org/jira/browse/SPARK-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623067#comment-14623067 ] Apache Spark commented on SPARK-8962: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7350 Disallow Class.forName -- Key: SPARK-8962 URL: https://issues.apache.org/jira/browse/SPARK-8962 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Josh Rosen We should add a regex rule to Scalastyle which prohibits the use of {{Class.forName}}. We should not use Class.forName directly because this will load classes from the system's default class loader rather than the appropriate context loader. Instead, we should be calling Utils.classForName instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8834) Throttle DStreams dynamically through back-pressure information
[ https://issues.apache.org/jira/browse/SPARK-8834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] François Garillot closed SPARK-8834. Resolution: Duplicate Throttle DStreams dynamically through back-pressure information --- Key: SPARK-8834 URL: https://issues.apache.org/jira/browse/SPARK-8834 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: François Garillot This aims to have Spark Streaming be more resilient to high-throughput situations through back-pressure signaling dynamic throttling. The Design doc can be found there: https://issues.apache.org/jira/browse/SPARK-8834 An (outdated) [PoC implementation|https://github.com/typesafehub/spark/pull/13] exists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6487. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7258 [https://github.com/apache/spark/pull/7258] Add sequential pattern mining algorithm to Spark MLlib -- Key: SPARK-6487 URL: https://issues.apache.org/jira/browse/SPARK-6487 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Zhang JiaJin Assignee: Zhang JiaJin Priority: Critical Fix For: 1.5.0 [~mengxr] [~zhangyouhua] Sequential pattern mining is an important branch in the pattern mining. In the past the actual work, we use the sequence mining (mainly PrefixSpan algorithm) to find the telecommunication signaling sequence pattern, achieved good results. But once the data is too large, the operation time is too long, even can not meet the the service requirements. We are ready to implement the PrefixSpan algorithm in spark, and applied to our subsequent work. The related Paper: PrefixSpan: Pei, Jian, et al. Mining sequential patterns by pattern-growth: The prefixspan approach. Knowledge and Data Engineering, IEEE Transactions on 16.11 (2004): 1424-1440. Parallel Algorithm: Cong, Shengnan, Jiawei Han, and David Padua. Parallel mining of closed sequential patterns. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005. Distributed Algorithm: Wei, Yong-qing, Dong Liu, and Lin-shan Duan. Distributed PrefixSpan algorithm based on MapReduce. Information Technology in Medicine and Education (ITME), 2012 International Symposium on. Vol. 2. IEEE, 2012. Pattern mining and sequential mining Knowledge: Han, Jiawei, et al. Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery 15.1 (2007): 55-86. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8994) Tiny cleanups to Params, Pipeline
[ https://issues.apache.org/jira/browse/SPARK-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8994. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7349 [https://github.com/apache/spark/pull/7349] Tiny cleanups to Params, Pipeline - Key: SPARK-8994 URL: https://issues.apache.org/jira/browse/SPARK-8994 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Fix For: 1.5.0 Small cleanups per remaining comments in [https://github.com/apache/spark/pull/5820] which resolved [SPARK-5956] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8941) Standalone cluster worker does not accept multiple masters on launch
[ https://issues.apache.org/jira/browse/SPARK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623226#comment-14623226 ] Jesper Lundgren edited comment on SPARK-8941 at 7/11/15 4:31 AM: - Maybe it is better to close this issue and open a new one for the API change and the documentation issues. I'll try to review some of the issues we had with the stand alone cluster and see if I should create JIRA tickets for some of them. ex, when using supervised mode in HA cluster, there is not a well documented procedure to force stop and disable restart of a driver (in case the driver exits with the wrong exit code). I know of the kill command bin/spark-class org.apache.spark.deploy.Client kill But in my experience it does not always work. was (Author: koudelka): Maybe it is better to close this issue and open a new one for the API change and the documentation issues. I'll probably try to review some of the issues we had with the stand alone cluster and see if I should create JIRA tickets for some of them. ex, when using supervised mode in HA cluster, there is not a well documented procedure to force stop and disable restart of a driver (in case the driver exits with the wrong exit code). I know of the kill command bin/spark-class org.apache.spark.deploy.Client kill But in my experience it does not always work. Standalone cluster worker does not accept multiple masters on launch Key: SPARK-8941 URL: https://issues.apache.org/jira/browse/SPARK-8941 Project: Spark Issue Type: Bug Components: Deploy, Documentation Affects Versions: 1.4.0, 1.4.1 Reporter: Jesper Lundgren Priority: Critical Before 1.4 it was possible to launch a worker node using a comma separated list of master nodes. ex: sbin/start-slave.sh 1 spark://localhost:7077,localhost:7078 starting org.apache.spark.deploy.worker.Worker, logging to /Users/jesper/Downloads/spark-1.4.0-bin-cdh4/sbin/../logs/spark-jesper-org.apache.spark.deploy.worker.Worker-1-Jespers-MacBook-Air.local.out failed to launch org.apache.spark.deploy.worker.Worker: Default is conf/spark-defaults.conf. 15/07/09 12:33:06 INFO Utils: Shutdown hook called Spark 1.2 and 1.3.1 accepts multiple masters in this format. update: start-slave.sh only expects master lists in 1.4 (no instance number) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8941) Standalone cluster worker does not accept multiple masters on launch
[ https://issues.apache.org/jira/browse/SPARK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623226#comment-14623226 ] Jesper Lundgren edited comment on SPARK-8941 at 7/11/15 4:31 AM: - Maybe it is better to close this issue and open a new one for the API change and the documentation issues. I'll try to review some of the issues we have had with the stand alone cluster to see if I should create JIRA tickets for some of them. ex, when using supervised mode in HA cluster, there is not a well documented procedure to force stop and disable restart of a driver (in case the driver exits with the wrong exit code). I know of the kill command bin/spark-class org.apache.spark.deploy.Client kill But in my experience it does not always work. was (Author: koudelka): Maybe it is better to close this issue and open a new one for the API change and the documentation issues. I'll try to review some of the issues we had with the stand alone cluster and see if I should create JIRA tickets for some of them. ex, when using supervised mode in HA cluster, there is not a well documented procedure to force stop and disable restart of a driver (in case the driver exits with the wrong exit code). I know of the kill command bin/spark-class org.apache.spark.deploy.Client kill But in my experience it does not always work. Standalone cluster worker does not accept multiple masters on launch Key: SPARK-8941 URL: https://issues.apache.org/jira/browse/SPARK-8941 Project: Spark Issue Type: Bug Components: Deploy, Documentation Affects Versions: 1.4.0, 1.4.1 Reporter: Jesper Lundgren Priority: Critical Before 1.4 it was possible to launch a worker node using a comma separated list of master nodes. ex: sbin/start-slave.sh 1 spark://localhost:7077,localhost:7078 starting org.apache.spark.deploy.worker.Worker, logging to /Users/jesper/Downloads/spark-1.4.0-bin-cdh4/sbin/../logs/spark-jesper-org.apache.spark.deploy.worker.Worker-1-Jespers-MacBook-Air.local.out failed to launch org.apache.spark.deploy.worker.Worker: Default is conf/spark-defaults.conf. 15/07/09 12:33:06 INFO Utils: Shutdown hook called Spark 1.2 and 1.3.1 accepts multiple masters in this format. update: start-slave.sh only expects master lists in 1.4 (no instance number) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]
[ https://issues.apache.org/jira/browse/SPARK-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623230#comment-14623230 ] Andrew Lee commented on SPARK-6882: --- I don't think updating spark-env.sh {{SPARK_CLASSPATH}} will be a good idea since this conflicts with {{--driver-class-path}} in yarn-client mode. But if this is the current work around, I can specify it with a different directory with SPARK_CONF_DIR just to get it up and running. Regarding Bin's approach, I believe you will need to enable {{spark.yarn.user.classpath.first}} according to SPARK-939, but I think it should be picking up user JAR y default now, isn't? Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] Key: SPARK-6882 URL: https://issues.apache.org/jira/browse/SPARK-6882 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1, 1.3.0, 1.4.0 Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled * Apache Hive 0.13.1 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 Reporter: Andrew Lee When Kerberos is enabled, I get the following exceptions. {code} 2015-03-13 18:26:05,363 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService (ThriftBinaryCLIService.java:run(93)) - Error: java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] {code} I tried it in * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 with * Apache Hive 0.13.1 * Apache Hadoop 2.4.1 Build command {code} mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests install {code} When starting Spark ThriftServer in {{yarn-client}} mode, the command to start thriftserver looks like this {code} ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf hive.server2.thrift.bind.host=$(hostname) --master yarn-client {code} {{hostname}} points to the current hostname of the machine I'm using. Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1) {code} 2015-03-13 18:26:05,363 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService (ThriftBinaryCLIService.java:run(93)) - Error: java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56) at org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118) at org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133) at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43) at java.lang.Thread.run(Thread.java:744) {code} I'm wondering if this is due to the same problem described in HIVE-8154 HIVE-7620 due to an older code based for the Spark ThriftServer? Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to run against a Kerberos cluster (Apache 2.4.1). My hive-site.xml looks like the following for spark/conf. The kerberos keytab and tgt are configured correctly, I'm able to connect to metastore, but the subsequent steps failed due to the exception. {code} property namehive.semantic.analyzer.factory.impl/name valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value /property property namehive.metastore.execute.setugi/name valuetrue/value /property property namehive.stats.autogather/name valuefalse/value /property property namehive.session.history.enabled/name valuetrue/value /property property namehive.querylog.location/name value/tmp/home/hive/log/${user.name}/value /property property namehive.exec.local.scratchdir/name value/tmp/hive/scratch/${user.name}/value /property property namehive.metastore.uris/name valuethrift://somehostname:9083/value /property !-- HIVE SERVER 2 -- property namehive.server2.authentication/name valueKERBEROS/value /property property namehive.server2.authentication.kerberos.principal/name value***/value /property property namehive.server2.authentication.kerberos.keytab/name value***/value /property property namehive.server2.thrift.sasl.qop/name valueauth/value descriptionSasl QOP value; one of 'auth', 'auth-int' and 'auth-conf'/description /property property
[jira] [Comment Edited] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]
[ https://issues.apache.org/jira/browse/SPARK-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623230#comment-14623230 ] Andrew Lee edited comment on SPARK-6882 at 7/11/15 4:35 AM: I don't think updating spark-env.sh {{SPARK_CLASSPATH}} will be a good idea since this conflicts with {{--driver-class-path}} in yarn-client mode. But if this is the current work around, I can specify it with a different directory with SPARK_CONF_DIR just to get it up and running. Regarding Bin's approach, I believe you will need to enable {{spark.yarn.user.classpath.first}} according to SPARK-939, but I think it should be picking up user JAR by default now, isn't? was (Author: alee526): I don't think updating spark-env.sh {{SPARK_CLASSPATH}} will be a good idea since this conflicts with {{--driver-class-path}} in yarn-client mode. But if this is the current work around, I can specify it with a different directory with SPARK_CONF_DIR just to get it up and running. Regarding Bin's approach, I believe you will need to enable {{spark.yarn.user.classpath.first}} according to SPARK-939, but I think it should be picking up user JAR y default now, isn't? Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] Key: SPARK-6882 URL: https://issues.apache.org/jira/browse/SPARK-6882 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1, 1.3.0, 1.4.0 Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled * Apache Hive 0.13.1 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 Reporter: Andrew Lee When Kerberos is enabled, I get the following exceptions. {code} 2015-03-13 18:26:05,363 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService (ThriftBinaryCLIService.java:run(93)) - Error: java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] {code} I tried it in * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 with * Apache Hive 0.13.1 * Apache Hadoop 2.4.1 Build command {code} mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests install {code} When starting Spark ThriftServer in {{yarn-client}} mode, the command to start thriftserver looks like this {code} ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf hive.server2.thrift.bind.host=$(hostname) --master yarn-client {code} {{hostname}} points to the current hostname of the machine I'm using. Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1) {code} 2015-03-13 18:26:05,363 ERROR org.apache.hive.service.cli.thrift.ThriftCLIService (ThriftBinaryCLIService.java:run(93)) - Error: java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth] at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56) at org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118) at org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133) at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43) at java.lang.Thread.run(Thread.java:744) {code} I'm wondering if this is due to the same problem described in HIVE-8154 HIVE-7620 due to an older code based for the Spark ThriftServer? Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to run against a Kerberos cluster (Apache 2.4.1). My hive-site.xml looks like the following for spark/conf. The kerberos keytab and tgt are configured correctly, I'm able to connect to metastore, but the subsequent steps failed due to the exception. {code} property namehive.semantic.analyzer.factory.impl/name valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value /property property namehive.metastore.execute.setugi/name valuetrue/value /property property namehive.stats.autogather/name valuefalse/value /property property namehive.session.history.enabled/name valuetrue/value /property property namehive.querylog.location/name value/tmp/home/hive/log/${user.name}/value /property property namehive.exec.local.scratchdir/name value/tmp/hive/scratch/${user.name}/value /property property
[jira] [Created] (SPARK-9000) Support generic item type in PrefixSpan
Xiangrui Meng created SPARK-9000: Summary: Support generic item type in PrefixSpan Key: SPARK-9000 URL: https://issues.apache.org/jira/browse/SPARK-9000 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Priority: Critical In SPARK-6487, we only support Int type. It requires users to encode other types into integer to use PrefixSpan. We should be able to do this inside PrefixSpan, similar to FPGrowth. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9000) Support generic item type in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9000: - Description: In SPARK-6487, we only support Int type. It requires users to encode other types into integer to use PrefixSpan. We should be able to do this inside PrefixSpan, similar to FPGrowth. This should be done before 1.5 since it changes APIs. (was: In SPARK-6487, we only support Int type. It requires users to encode other types into integer to use PrefixSpan. We should be able to do this inside PrefixSpan, similar to FPGrowth.) Support generic item type in PrefixSpan --- Key: SPARK-9000 URL: https://issues.apache.org/jira/browse/SPARK-9000 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Priority: Critical In SPARK-6487, we only support Int type. It requires users to encode other types into integer to use PrefixSpan. We should be able to do this inside PrefixSpan, similar to FPGrowth. This should be done before 1.5 since it changes APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8835) Provide pluggable Congestion Strategies to deal with Streaming load
[ https://issues.apache.org/jira/browse/SPARK-8835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] François Garillot updated SPARK-8835: - Description: Second part of [SPARK-7398|https://issues.apache.org/jira/browse/SPARK-7398] (which has an over-arching, high-level design doc). An (outdated) [PoC implementation|https://github.com/huitseeker/spark/tree/ReactiveStreamingBackPressureControl/] exists. was: Second part of [SPARK-7398|https://issues.apache.org/jira/browse/SPARK-7398] (which has an over-arching, high-level design doc). An (outdated) [PoC implementation|https://github.com/typesafehub/spark/pull/13] exists. Provide pluggable Congestion Strategies to deal with Streaming load --- Key: SPARK-8835 URL: https://issues.apache.org/jira/browse/SPARK-8835 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: François Garillot Second part of [SPARK-7398|https://issues.apache.org/jira/browse/SPARK-7398] (which has an over-arching, high-level design doc). An (outdated) [PoC implementation|https://github.com/huitseeker/spark/tree/ReactiveStreamingBackPressureControl/] exists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8986) GaussianMixture should take smoothing param
[ https://issues.apache.org/jira/browse/SPARK-8986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623071#comment-14623071 ] Joseph K. Bradley commented on SPARK-8986: -- Thanks for looking around some! I was not really thinking of anything fancy. I was hoping existing libraries would do something like add a small constant to the diagonal of the covariance matrix of each Gaussian. If there is no standard to follow, we could just do that. It'd be interesting to investigate fancier approaches in another JIRA. GaussianMixture should take smoothing param --- Key: SPARK-8986 URL: https://issues.apache.org/jira/browse/SPARK-8986 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Original Estimate: 144h Remaining Estimate: 144h Gaussian mixture models should take a smoothing parameter which makes the algorithm robust against degenerate data or bad initializations. Whomever takes this JIRA should look at other libraries (sklearn, R packages, Weka, etc.) to see how they do smoothing and what their API looks like. Please summarize your findings here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8996) Add Python API for Kolmogorov-Smirnov Test
Xiangrui Meng created SPARK-8996: Summary: Add Python API for Kolmogorov-Smirnov Test Key: SPARK-8996 URL: https://issues.apache.org/jira/browse/SPARK-8996 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Add Python API for the Kolmogorov-Smirnov test implemented in SPARK-8598. It should be similar to ChiSqTest in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8941) Standalone cluster worker does not accept multiple masters on launch
[ https://issues.apache.org/jira/browse/SPARK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623226#comment-14623226 ] Jesper Lundgren commented on SPARK-8941: Maybe it is better to close this issue and open a new one for the API change and the documentation issues. I'll probably try to review some of the issues we had with the stand alone cluster and see if I should create JIRA tickets for some of them. ex, when using supervised mode in HA cluster, there is not a well documented procedure to force stop and disable restart of a driver (in case the driver exits with the wrong exit code). I know of the kill command bin/spark-class org.apache.spark.deploy.Client kill But in my experience it does not always work. Standalone cluster worker does not accept multiple masters on launch Key: SPARK-8941 URL: https://issues.apache.org/jira/browse/SPARK-8941 Project: Spark Issue Type: Bug Components: Deploy, Documentation Affects Versions: 1.4.0, 1.4.1 Reporter: Jesper Lundgren Priority: Critical Before 1.4 it was possible to launch a worker node using a comma separated list of master nodes. ex: sbin/start-slave.sh 1 spark://localhost:7077,localhost:7078 starting org.apache.spark.deploy.worker.Worker, logging to /Users/jesper/Downloads/spark-1.4.0-bin-cdh4/sbin/../logs/spark-jesper-org.apache.spark.deploy.worker.Worker-1-Jespers-MacBook-Air.local.out failed to launch org.apache.spark.deploy.worker.Worker: Default is conf/spark-defaults.conf. 15/07/09 12:33:06 INFO Utils: Shutdown hook called Spark 1.2 and 1.3.1 accepts multiple masters in this format. update: start-slave.sh only expects master lists in 1.4 (no instance number) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8962) Disallow Class.forName
[ https://issues.apache.org/jira/browse/SPARK-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8962: --- Assignee: Apache Spark Disallow Class.forName -- Key: SPARK-8962 URL: https://issues.apache.org/jira/browse/SPARK-8962 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Josh Rosen Assignee: Apache Spark We should add a regex rule to Scalastyle which prohibits the use of {{Class.forName}}. We should not use Class.forName directly because this will load classes from the system's default class loader rather than the appropriate context loader. Instead, we should be calling Utils.classForName instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6684) Add checkpointing to GradientBoostedTrees
[ https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623095#comment-14623095 ] Joseph K. Bradley commented on SPARK-6684: -- I have heard this may be an issue for some users who use many iterations. Add checkpointing to GradientBoostedTrees - Key: SPARK-6684 URL: https://issues.apache.org/jira/browse/SPARK-6684 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley We should add checkpointing to GradientBoostedTrees since it maintains RDDs with long lineages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8977) Define the RateEstimator interface, and implement the ReceiverRateController
Iulian Dragos created SPARK-8977: Summary: Define the RateEstimator interface, and implement the ReceiverRateController Key: SPARK-8977 URL: https://issues.apache.org/jira/browse/SPARK-8977 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Iulian Dragos Fix For: 1.5.0 Full [design doc|https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing] Implement a rate controller for receiver-based InputDStreams that estimates a maximum rate and sends it to each receiver supervisor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8013) Get JDBC server working with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-8013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8013. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6903 [https://github.com/apache/spark/pull/6903] Get JDBC server working with Scala 2.11 --- Key: SPARK-8013 URL: https://issues.apache.org/jira/browse/SPARK-8013 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Patrick Wendell Assignee: Iulian Dragos Priority: Critical Fix For: 1.5.0 It's worth some investigation here, but I believe the simplest solution is to see if we can get Scala to shade it's use of JLine to avoid JLine conflicts between Hive and the Spark repl. It's also possible that there is a simpler internal solution to the conflict (I haven't looked at it in a long time). So doing some investigation of that would be good. IIRC, there is use of Jline in our own repl code, in addition to in Hive and also in the Scala 2.11 repl. Back when we created the 2.11 build I couldn't harmonize all the versions in a nice way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8980) Setup cluster with spark-ec2 scripts as non-root user
Mathieu DESPRIEE created SPARK-8980: --- Summary: Setup cluster with spark-ec2 scripts as non-root user Key: SPARK-8980 URL: https://issues.apache.org/jira/browse/SPARK-8980 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.4.0 Reporter: Mathieu DESPRIEE Priority: Minor Spark-ec2 scripts installs everything as root, which is not a best practice. Suggestion to use a sudoer instead (ec2-user, available in the AMI, is). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6154) Support Kafka, JDBC in Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6154. -- Resolution: Fixed Assignee: Iulian Dragos Fix Version/s: 1.5.0 I think this is resolved by SPARK-8013, effectively? https://github.com/apache/spark/pull/6903 Support Kafka, JDBC in Scala 2.11 - Key: SPARK-6154 URL: https://issues.apache.org/jira/browse/SPARK-6154 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Assignee: Iulian Dragos Fix For: 1.5.0 Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation failed when -Phive-thriftserver is enabled. [info] Compiling 9 Scala sources to /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes... [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2 5: object ConsoleReader is not a member of package jline [error] import jline.{ConsoleReader, History} [error]^ [warn] Class jline.Completor not found - continuing with a stub. [warn] Class jline.ConsoleReader not found - continuing with a stub. [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1 65: not found: type ConsoleReader [error] val reader = new ConsoleReader() Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8976) Python 3 crash: ValueError: invalid mode 'a+' (only r, w, b allowed)
[ https://issues.apache.org/jira/browse/SPARK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622466#comment-14622466 ] Olivier Delalleau commented on SPARK-8976: -- NB: I fixed the issue replacing l. 149 of worker.py with: sock_file = sock.makefile(rwb, 65536) but I'm not sure it's a good fix (and I don't know if it's compabile with Python 2) Python 3 crash: ValueError: invalid mode 'a+' (only r, w, b allowed) Key: SPARK-8976 URL: https://issues.apache.org/jira/browse/SPARK-8976 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: Windows 7 Reporter: Olivier Delalleau See Github report: https://github.com/apache/spark/pull/5173#issuecomment-113410652 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7944) Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path
[ https://issues.apache.org/jira/browse/SPARK-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7944: - Assignee: Iulian Dragos Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path Key: SPARK-7944 URL: https://issues.apache.org/jira/browse/SPARK-7944 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.1, 1.4.0 Environment: scala 2.11 Reporter: Alexander Nakos Assignee: Iulian Dragos Priority: Critical Fix For: 1.5.0 Attachments: spark_shell_output.txt, spark_shell_output_2.10.txt When I run the spark-shell with the --jars argument and supply a path to a single jar file, none of the classes in the jar are available in the REPL. I have encountered this same behaviour in both 1.3.1 and 1.4.0_RC-03 builds for scala 2.11. I have yet to do a 1.4.0 RC-03 build for scala 2.10, but the contents of the jar are available in the 1.3.1_2.10 REPL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8980) Setup cluster with spark-ec2 scripts as non-root user
[ https://issues.apache.org/jira/browse/SPARK-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622482#comment-14622482 ] Sean Owen commented on SPARK-8980: -- Wasn't the conclusion from the thread that this isn't going to happen? Setup cluster with spark-ec2 scripts as non-root user - Key: SPARK-8980 URL: https://issues.apache.org/jira/browse/SPARK-8980 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.4.0 Reporter: Mathieu DESPRIEE Priority: Minor Spark-ec2 scripts installs everything as root, which is not a best practice. Suggestion to use a sudoer instead (ec2-user, available in the AMI, is). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7944) Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path
[ https://issues.apache.org/jira/browse/SPARK-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7944. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6903 [https://github.com/apache/spark/pull/6903] Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path Key: SPARK-7944 URL: https://issues.apache.org/jira/browse/SPARK-7944 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.1, 1.4.0 Environment: scala 2.11 Reporter: Alexander Nakos Priority: Critical Fix For: 1.5.0 Attachments: spark_shell_output.txt, spark_shell_output_2.10.txt When I run the spark-shell with the --jars argument and supply a path to a single jar file, none of the classes in the jar are available in the REPL. I have encountered this same behaviour in both 1.3.1 and 1.4.0_RC-03 builds for scala 2.11. I have yet to do a 1.4.0 RC-03 build for scala 2.10, but the contents of the jar are available in the 1.3.1_2.10 REPL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622510#comment-14622510 ] Iulian Dragos commented on SPARK-5281: -- Thanks for pointing them out. Glad it wasn't too bad :) Registering table on RDD is giving MissingRequirementError -- Key: SPARK-5281 URL: https://issues.apache.org/jira/browse/SPARK-5281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.1 Reporter: sarsol Assignee: Iulian Dragos Priority: Critical Fix For: 1.4.0 Application crashes on this line {{rdd.registerTempTable(temp)}} in 1.2 version when using sbt or Eclipse SCALA IDE Stacktrace: {code} Exception in thread main scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program Files\Java\jre7\lib\resources.jar;C:\Program Files\Java\jre7\lib\rt.jar;C:\Program Files\Java\jre7\lib\sunrsasign.jar;C:\Program Files\Java\jre7\lib\jsse.jar;C:\Program Files\Java\jre7\lib\jce.jar;C:\Program Files\Java\jre7\lib\charsets.jar;C:\Program Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) at com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8974) The spark-dynamic-executor-allocation may be not supported
[ https://issues.apache.org/jira/browse/SPARK-8974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622101#comment-14622101 ] Sean Owen commented on SPARK-8974: -- Why do you say this means it's not supported? it sounds like it works, but are you saying there is a problem in error recovery? executor allocation should fail in this case. But it should succeed if the AM recovers. The spark-dynamic-executor-allocation may be not supported -- Key: SPARK-8974 URL: https://issues.apache.org/jira/browse/SPARK-8974 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: KaiXinXIaoLei Fix For: 1.4.1 In yarn-client mode and config option spark.dynamicAllocation.enabled is true, when the state of ApplicationMaster is dead or disconnected, if the tasks are submitted before new ApplicationMaster start. The thread of spark-dynamic-executor-allocation will throw exception, When ApplicationMaster is running and not tasks are running, the number of executor is not zero. So feture of dynamicAllocation are not supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7977) Disallow println
[ https://issues.apache.org/jira/browse/SPARK-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7977: - Assignee: Jon Alter Disallow println Key: SPARK-7977 URL: https://issues.apache.org/jira/browse/SPARK-7977 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Reynold Xin Assignee: Jon Alter Labels: starter Fix For: 1.5.0 Very often we see pull requests that added println from debugging, but the author forgot to remove it before code review. We can use the regex checker to disallow println. For legitimate use of println, we can then disable the rule where they are used. Add to scalastyle-config.xml file: {code} check customId=println level=error class=org.scalastyle.scalariform.TokenChecker enabled=true parametersparameter name=regex^println$/parameter/parameters customMessage![CDATA[Are you sure you want to println? If yes, wrap the code block with // scalastyle:off println println(...) // scalastyle:on println]]/customMessage /check {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8815) illegal java package names in jar
[ https://issues.apache.org/jira/browse/SPARK-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622211#comment-14622211 ] Sam Halliday commented on SPARK-8815: - Interesting. BTW, I see you're at ScalaX in December. I'll see you there! I gave a talk last year about high performance mathematics (i.e. netlib-java), but this year I'll be talking about generic programming. illegal java package names in jar - Key: SPARK-8815 URL: https://issues.apache.org/jira/browse/SPARK-8815 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Sam Halliday Priority: Minor In ENSIME we were unable to index the spark jars and we investigated further... you have classes that look like this: org.spark-project.guava.annotations.VisibleForTesting Hyphens are not legal package names according to the java language spec, so I'm amazed that this can actually be read at runtime... certainly no compiler I know would allow it. What I suspect is happening is that you're using a build plugin that internalises some of your dependencies and it is using your groupId but not validating it... and then blindly using that name in the ASM manipulation. You might want to report this upstream with your build plugin. For your next release, I recommend using an explicit name that is not your groupId. i.e. convert hyphens to underscores as Gosling recommends. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8975) Implement a mechanism to send a new rate from the driver to the block generator
Iulian Dragos created SPARK-8975: Summary: Implement a mechanism to send a new rate from the driver to the block generator Key: SPARK-8975 URL: https://issues.apache.org/jira/browse/SPARK-8975 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Iulian Dragos Full design doc [here|https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing] - Add a new message, {{RateUpdate(newRate: Long)}} that ReceiverSupervisor handles in its endpoint - Add a new method to ReceiverTracker {{def sendRateUpdate(streamId: Int, newRate: Long): Unit}} this method sends an asynchronous RateUpdate message to the receiver supervisor corresponding to streamId - update the rate in the corresponding block generator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622342#comment-14622342 ] RJ Nowling commented on SPARK-3644: --- [~joshrosen] Thanks for pointing to the new JIRA! :) REST API for Spark application info (jobs / stages / tasks / storage info) -- Key: SPARK-3644 URL: https://issues.apache.org/jira/browse/SPARK-3644 Project: Spark Issue Type: New Feature Components: Spark Core, Web UI Reporter: Josh Rosen Assignee: Imran Rashid Fix For: 1.4.0 This JIRA is a forum to draft a design proposal for a REST interface for accessing information about Spark applications, such as job / stage / task / storage status. There have been a number of proposals to serve JSON representations of the information displayed in Spark's web UI. Given that we might redesign the pages of the web UI (and possibly re-implement the UI as a client of a REST API), the API endpoints and their responses should be independent of what we choose to display on particular web UI pages / layouts. Let's start a discussion of what a good REST API would look like from first-principles. We can discuss what urls / endpoints expose access to data, how our JSON responses will be formatted, how fields will be named, how the API will be documented and tested, etc. Some links for inspiration: https://developer.github.com/v3/ http://developer.netflix.com/docs/REST_API_Reference https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8976) Python 3 crash: ValueError: invalid mode 'a+' (only r, w, b allowed)
Olivier Delalleau created SPARK-8976: Summary: Python 3 crash: ValueError: invalid mode 'a+' (only r, w, b allowed) Key: SPARK-8976 URL: https://issues.apache.org/jira/browse/SPARK-8976 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: Windows 7 Reporter: Olivier Delalleau See Github report: https://github.com/apache/spark/pull/5173#issuecomment-113410652 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8982) Worker hostnames not showing in Master web ui when launched with start-slaves.sh
Ben Zimmer created SPARK-8982: - Summary: Worker hostnames not showing in Master web ui when launched with start-slaves.sh Key: SPARK-8982 URL: https://issues.apache.org/jira/browse/SPARK-8982 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Ben Zimmer Priority: Minor If a --host argument is not provided to Worker, WorkerArguments uses Utils.localHostName to find the host name. SPARK-6440 changed the functionality of Utils.localHostName to retrieve the local IP address instead of host name. Since start-slave.sh does not provide the --host argument, clusters started with start-slaves.sh now show IP addresses instead of hostnames in the Master web UI. This is inconvenient when starting and debugging small clusters. A simple fix would be to find the local machine's hostname in start-slave.sh and pass it as the --host argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8982) Worker hostnames not showing in Master web ui when launched with start-slaves.sh
[ https://issues.apache.org/jira/browse/SPARK-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622688#comment-14622688 ] Apache Spark commented on SPARK-8982: - User 'bdzimmer' has created a pull request for this issue: https://github.com/apache/spark/pull/7345 Worker hostnames not showing in Master web ui when launched with start-slaves.sh Key: SPARK-8982 URL: https://issues.apache.org/jira/browse/SPARK-8982 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Ben Zimmer Priority: Minor If a --host argument is not provided to Worker, WorkerArguments uses Utils.localHostName to find the host name. SPARK-6440 changed the functionality of Utils.localHostName to retrieve the local IP address instead of host name. Since start-slave.sh does not provide the --host argument, clusters started with start-slaves.sh now show IP addresses instead of hostnames in the Master web UI. This is inconvenient when starting and debugging small clusters. A simple fix would be to find the local machine's hostname in start-slave.sh and pass it as the --host argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8982) Worker hostnames not showing in Master web ui when launched with start-slaves.sh
[ https://issues.apache.org/jira/browse/SPARK-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8982: --- Assignee: Apache Spark Worker hostnames not showing in Master web ui when launched with start-slaves.sh Key: SPARK-8982 URL: https://issues.apache.org/jira/browse/SPARK-8982 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Ben Zimmer Assignee: Apache Spark Priority: Minor If a --host argument is not provided to Worker, WorkerArguments uses Utils.localHostName to find the host name. SPARK-6440 changed the functionality of Utils.localHostName to retrieve the local IP address instead of host name. Since start-slave.sh does not provide the --host argument, clusters started with start-slaves.sh now show IP addresses instead of hostnames in the Master web UI. This is inconvenient when starting and debugging small clusters. A simple fix would be to find the local machine's hostname in start-slave.sh and pass it as the --host argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6447) Add quick-links to StagePage to jump to Accumulator/Task tables
[ https://issues.apache.org/jira/browse/SPARK-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams resolved SPARK-6447. -- Resolution: Duplicate Add quick-links to StagePage to jump to Accumulator/Task tables --- Key: SPARK-6447 URL: https://issues.apache.org/jira/browse/SPARK-6447 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.3.0 Reporter: Ryan Williams Priority: Minor When there are many executors it is tedious to have to scroll down the page to find the start of the Accumulators / Tasks tables. We should add links at the top of the page that jump to a URL fragment bound to them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8986) GaussianMixture should take smoothing param
Joseph K. Bradley created SPARK-8986: Summary: GaussianMixture should take smoothing param Key: SPARK-8986 URL: https://issues.apache.org/jira/browse/SPARK-8986 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Gaussian mixture models should take a smoothing parameter which makes the algorithm robust against degenerate data or bad initializations. Whomever takes this JIRA should look at other libraries (sklearn, R packages, Weka, etc.) to see how they do smoothing and what their API looks like. Please summarize your findings here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet
[ https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622704#comment-14622704 ] Matt Massie commented on SPARK-7263: [~rxin] What are your thoughts? I'd like to keep moving this forward. Add new shuffle manager which stores shuffle blocks in Parquet -- Key: SPARK-7263 URL: https://issues.apache.org/jira/browse/SPARK-7263 Project: Spark Issue Type: New Feature Components: Block Manager Reporter: Matt Massie I have a working prototype of this feature that can be viewed at https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1 Setting the spark.shuffle.manager to parquet enables this shuffle manager. The dictionary support that Parquet provides appreciably reduces the amount of memory that objects use; however, once Parquet data is shuffled, all the dictionary information is lost and the column-oriented data is written to shuffle blocks in a record-oriented fashion. This shuffle manager addresses this issue by reading and writing all shuffle blocks in the Parquet format. If shuffle objects are Avro records, then the Avro $SCHEMA is converted to Parquet schema and used directly, otherwise, the Parquet schema is generated via reflection. Currently, the only non-Avro keys supported is primitive types. The reflection code can be improved (or replaced) to support complex records. The ParquetShufflePair class allows the shuffle key and value to be stored in Parquet blocks as a single record with a single schema. This commit adds the following new Spark configuration options: spark.shuffle.parquet.compression - sets the Parquet compression codec spark.shuffle.parquet.blocksize - sets the Parquet block size spark.shuffle.parquet.pagesize - set the Parquet page size spark.shuffle.parquet.enabledictionary - turns dictionary encoding on/off Parquet does not (and has no plans to) support a streaming API. Metadata sections are scattered through a Parquet file making a streaming API difficult. As such, the ShuffleBlockFetcherIterator has been modified to fetch the entire contents of map outputs into temporary blocks before loading the data into the reducer. Interesting future asides: o There is no need to define a data serializer (although Spark requires it) o Parquet support predicate pushdown and projection which could be used at between shuffle stages to improve performance in the future -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8983) ML Tuning Cross-Validation Improvements
Feynman Liang created SPARK-8983: Summary: ML Tuning Cross-Validation Improvements Key: SPARK-8983 URL: https://issues.apache.org/jira/browse/SPARK-8983 Project: Spark Issue Type: Umbrella Components: ML Reporter: Feynman Liang This is an umbrella for grouping together various improvements to pipelines tuning features, centralizing developer communication, and encouraging code reuse and common interfaces. We currently only support k-fold CV in {{CrossValidator}} while competing packages (e.g. [R caret|http://topepo.github.io/caret/splitting.html]) are much more feature rich, supporting balanced class labels, hold-out for time-series data, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-1301) Add UI elements to collapse Aggregated Metrics by Executor pane on stage page
[ https://issues.apache.org/jira/browse/SPARK-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-1301: --- Assignee: (was: Apache Spark) Add UI elements to collapse Aggregated Metrics by Executor pane on stage page --- Key: SPARK-1301 URL: https://issues.apache.org/jira/browse/SPARK-1301 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Matei Zaharia Priority: Minor Labels: Starter This table is useful but it takes up a lot of space on larger clusters, hiding the more commonly accessed stage page. We could also move the table below if collapsing it is difficult. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8069) Add support for cutoff to RandomForestClassifier
[ https://issues.apache.org/jira/browse/SPARK-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8069: - Description: Consider adding support for cutoffs similar to http://cran.r-project.org/web/packages/randomForest/randomForest.pdf (Joseph) I just wrote a [little design doc | https://docs.google.com/document/d/1nV6m7sqViHkEpawelq1S5_QLWWAouSlv81eiEEjKuJY/edit?usp=sharing] for this. was:Consider adding support for cutoffs similar to http://cran.r-project.org/web/packages/randomForest/randomForest.pdf Add support for cutoff to RandomForestClassifier Key: SPARK-8069 URL: https://issues.apache.org/jira/browse/SPARK-8069 Project: Spark Issue Type: Improvement Components: ML Reporter: holdenk Priority: Minor Original Estimate: 240h Remaining Estimate: 240h Consider adding support for cutoffs similar to http://cran.r-project.org/web/packages/randomForest/randomForest.pdf (Joseph) I just wrote a [little design doc | https://docs.google.com/document/d/1nV6m7sqViHkEpawelq1S5_QLWWAouSlv81eiEEjKuJY/edit?usp=sharing] for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8985) Create a test harness to improve Spark's combinatorial test coverage of non-default configuration
Josh Rosen created SPARK-8985: - Summary: Create a test harness to improve Spark's combinatorial test coverage of non-default configuration Key: SPARK-8985 URL: https://issues.apache.org/jira/browse/SPARK-8985 Project: Spark Issue Type: Bug Components: Tests Reporter: Josh Rosen Large numbers of Spark bugs could be caught by running a trivial set of end-to-end tests with a non-standard SparkConf configuration. This ticket exists to assemble a list of such bugs and the configurations which would have caught them. I think that we should build a separate Jenkins harness which runs end-to-end tests across a huge configuration matrix in order to detect these issues. If the test configuration matrix grows to be too large to be tested daily, then we can explore combinatorial testing approaches to test fewer configurations while still achieving a high level of combinatorial coverage. **Bugs listed in order of the test configurations which would have caught them:** * spark.python.worker.reuse=false: ** SPARK-8976 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7210) Test matrix decompositions for speed vs. numerical stability for Gaussians
[ https://issues.apache.org/jira/browse/SPARK-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622737#comment-14622737 ] Joseph K. Bradley commented on SPARK-7210: -- More thoughts from Reza: We should consider degenerate cases, and to say we handle them correctly, we can compare with R as a reasonable gold standard. E.g., how does it handle normal PDFs when the covariance matrix is not full rank? Relatedly, we should add a smoothing parameter to GaussianMixture. That might actually be higher priority than this JIRA. I'll make a JIRA for that and link it from the umbrella. Test matrix decompositions for speed vs. numerical stability for Gaussians -- Key: SPARK-7210 URL: https://issues.apache.org/jira/browse/SPARK-7210 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor We currently use SVD for inverting the Gaussian's covariance matrix and computing the determinant. SVD is numerically stable but slow. We could experiment with Cholesky, etc. to figure out a better option, or a better option for certain settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8976) Python 3 crash: ValueError: invalid mode 'a+' (only r, w, b allowed)
[ https://issues.apache.org/jira/browse/SPARK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622623#comment-14622623 ] Josh Rosen commented on SPARK-8976: --- I think that this problem is Windows-specific. The code near line 149 of worker.py will typically not be executed on non-Windows machines as long as {{spark.python.worker.reuse=true}} (the default). I think the right fix is adding a regression test which tries running simple PySpark jobs with {{spark.python.worker.reuse=false}}, then fixing the underlying bug by passing rwb instead of a+. If we get a regression test working on Jenkins, then we'll be able to verify that the fix is safe for Python 2 and 3 because Jenkins tests both of those Python versions. Would you like to submit a pull request for this? I'd do it myself but I'm a bit swamped with other work right now. Python 3 crash: ValueError: invalid mode 'a+' (only r, w, b allowed) Key: SPARK-8976 URL: https://issues.apache.org/jira/browse/SPARK-8976 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: Windows 7 Reporter: Olivier Delalleau See Github report: https://github.com/apache/spark/pull/5173#issuecomment-113410652 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622666#comment-14622666 ] Paweł Kopiczko commented on SPARK-8981: --- Sure. I believe that when executor is spawned it has access to `appName` and `applicationId` properties of `SparkContext` instance. I'd like it to put these values in MDC https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/MDC.html#put(java.lang.String, java.lang.Object). Having those values in MDC and setting `%X{appName}` and `%X{applicationId}` in log4j's PatternLayout would allow filtering out specific application logs from a single file. Does that make sense? Set applicationId and appName in log4j MDC -- Key: SPARK-8981 URL: https://issues.apache.org/jira/browse/SPARK-8981 Project: Spark Issue Type: New Feature Reporter: Paweł Kopiczko Priority: Minor It would be nice to have, because it's good to have logs in one file when using log agents (like logentires) in standalone mode. Also allows configuring rolling file appender without a mess when multiple applications are running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622666#comment-14622666 ] Paweł Kopiczko edited comment on SPARK-8981 at 7/10/15 6:00 PM: Sure. I believe that when executor is spawned it has access to appName and applicationId properties of `SparkContext` instance. I'd like it to put these values in MDC https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/MDC.html#put(java.lang.String, java.lang.Object). Having those values in MDC and setting %X{appName} and %X{applicationId} in log4j's PatternLayout would allow filtering out specific application logs from a single file. Does that make sense? was (Author: kopiczko): Sure. I believe that when executor is spawned it has access to `appName` and `applicationId` properties of `SparkContext` instance. I'd like it to put these values in MDC https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/MDC.html#put(java.lang.String, java.lang.Object). Having those values in MDC and setting `%X{appName}` and `%X{applicationId}` in log4j's PatternLayout would allow filtering out specific application logs from a single file. Does that make sense? Set applicationId and appName in log4j MDC -- Key: SPARK-8981 URL: https://issues.apache.org/jira/browse/SPARK-8981 Project: Spark Issue Type: New Feature Reporter: Paweł Kopiczko Priority: Minor It would be nice to have, because it's good to have logs in one file when using log agents (like logentires) in standalone mode. Also allows configuring rolling file appender without a mess when multiple applications are running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622666#comment-14622666 ] Paweł Kopiczko edited comment on SPARK-8981 at 7/10/15 6:04 PM: Sure. I believe that when executor is spawned it has access to {{appName}} and {{applicationId}} properties of `SparkContext` instance. I'd like it to put these values in MDC https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/MDC.html#put(java.lang.String, java.lang.Object). Having those values in MDC and setting %X\{appName\} and %X\{applicationId\} in log4j's PatternLayout would allow filtering out specific application logs from a single file. Does that make sense? was (Author: kopiczko): Sure. I believe that when executor is spawned it has access to appName and applicationId properties of `SparkContext` instance. I'd like it to put these values in MDC https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/MDC.html#put(java.lang.String, java.lang.Object). Having those values in MDC and setting %X{appName} and %X{applicationId} in log4j's PatternLayout would allow filtering out specific application logs from a single file. Does that make sense? Set applicationId and appName in log4j MDC -- Key: SPARK-8981 URL: https://issues.apache.org/jira/browse/SPARK-8981 Project: Spark Issue Type: New Feature Reporter: Paweł Kopiczko Priority: Minor It would be nice to have, because it's good to have logs in one file when using log agents (like logentires) in standalone mode. Also allows configuring rolling file appender without a mess when multiple applications are running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-1301) Add UI elements to collapse Aggregated Metrics by Executor pane on stage page
[ https://issues.apache.org/jira/browse/SPARK-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-1301: --- Assignee: Apache Spark Add UI elements to collapse Aggregated Metrics by Executor pane on stage page --- Key: SPARK-1301 URL: https://issues.apache.org/jira/browse/SPARK-1301 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Matei Zaharia Assignee: Apache Spark Priority: Minor Labels: Starter This table is useful but it takes up a lot of space on larger clusters, hiding the more commonly accessed stage page. We could also move the table below if collapsing it is difficult. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8069) Add support for cutoff to RandomForestClassifier
[ https://issues.apache.org/jira/browse/SPARK-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8069: - Assignee: holdenk Add support for cutoff to RandomForestClassifier Key: SPARK-8069 URL: https://issues.apache.org/jira/browse/SPARK-8069 Project: Spark Issue Type: Improvement Components: ML Reporter: holdenk Assignee: holdenk Priority: Minor Original Estimate: 240h Remaining Estimate: 240h Consider adding support for cutoffs similar to http://cran.r-project.org/web/packages/randomForest/randomForest.pdf (Joseph) I just wrote a [little design doc | https://docs.google.com/document/d/1nV6m7sqViHkEpawelq1S5_QLWWAouSlv81eiEEjKuJY/edit?usp=sharing] for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8984) Developer documentation for ML Pipelines
[ https://issues.apache.org/jira/browse/SPARK-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622745#comment-14622745 ] Joseph K. Bradley commented on SPARK-8984: -- Linked attribute doc JIRA since attributes will be important (and fairly complex) for developers. Developer documentation for ML Pipelines Key: SPARK-8984 URL: https://issues.apache.org/jira/browse/SPARK-8984 Project: Spark Issue Type: Umbrella Components: Documentation, ML Reporter: Feynman Liang Priority: Minor This issue will track work on developer-specific documentation for the ML Pipelines API. The goal is to provide documentation for how to write custom estimators/transformers, various concepts (e.g. Params, attributes) and the rationale behind design decisions. We do not aim to duplicate the [ML programming guide|http://spark.apache.org/docs/latest/ml-guide.html]. Rather, the target audience is developers and contributors to ML pipelines. Documentation is currently read-only on [Google Docs|https://docs.google.com/document/d/1rRc2o8AIH2Y4U8A7P3yopbT-fAXV1u2UO3wBAQ-vQYM/edit?usp=sharing]. Please ask if you would like to contribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622529#comment-14622529 ] Joseph K. Bradley commented on SPARK-3155: -- I don't know if there is a nice paper explaining the implementation, but I do know it's quite standard based on hearsay, so I suspect there are papers or docs explaining it. The issue is still very relevant. No one is implementing the feature as far as I know. However, do be aware of [SPARK-7131], which has an open PR (to be merged soon, I hope). Support DecisionTree pruning Key: SPARK-3155 URL: https://issues.apache.org/jira/browse/SPARK-3155 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Improvement: accuracy, computation Summary: Pruning is a common method for preventing overfitting with decision trees. A smart implementation can prune the tree during training in order to avoid training parts of the tree which would be pruned eventually anyways. DecisionTree does not currently support pruning. Pruning: A “pruning” of a tree is a subtree with the same root node, but with zero or more branches removed. A naive implementation prunes as follows: (1) Train a depth K tree using a training set. (2) Compute the optimal prediction at each node (including internal nodes) based on the training set. (3) Take a held-out validation set, and use the tree to make predictions for each validation example. This allows one to compute the validation error made at each node in the tree (based on the predictions computed in step (2).) (4) For each pair of leafs with the same parent, compare the total error on the validation set made by the leafs’ predictions with the error made by the parent’s predictions. Remove the leafs if the parent has lower error. A smarter implementation prunes during training, computing the error on the validation set made by each node as it is trained. Whenever two children increase the validation error, they are pruned, and no more training is required on that branch. It is common to use about 1/3 of the data for pruning. Note that pruning is important when using a tree directly for prediction. It is less important when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org