[jira] [Commented] (SPARK-3731) RDD caching stops working in pyspark after some time
[ https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156125#comment-14156125 ] Milan Straka commented on SPARK-3731: - I will get to it later today and attach a dataset and program which exhibit this behaviour locally. I believe I will find it because I saw this behaviour in many local runs. RDD caching stops working in pyspark after some time Key: SPARK-3731 URL: https://issues.apache.org/jira/browse/SPARK-3731 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Environment: Linux, 32bit, both in local mode or in standalone cluster mode Reporter: Milan Straka Attachments: worker.log Consider a file F which when loaded with sc.textFile and cached takes up slightly more than half of free memory for RDD cache. When in PySpark the following is executed: 1) a = sc.textFile(F) 2) a.cache().count() 3) b = sc.textFile(F) 4) b.cache().count() and then the following is repeated (for example 10 times): a) a.unpersist().cache().count() b) b.unpersist().cache().count() after some time, there are no RDD cached in memory. Also, since that time, no other RDD ever gets cached (the worker always reports something like WARN CacheManager: Not enough space to cache partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that all executors have 0MB memory used (which is consistent with the CacheManager warning). When doing the same in scala, everything works perfectly. I understand that this is a vague description, but I do no know how to describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3573) Dataset
[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156129#comment-14156129 ] Patrick Wendell commented on SPARK-3573: I think people are hung up on the term SQL - SchemaRDD is designed to simply represent richer types on top of the core RDD API. In fact we though originally of naming the package schema instead of sql for exactly this reason. SchemaRDD is in the sql/core package right now, but we could pull the public interface of a Schema RDD into another package in the future (and maybe we'd drop exposing anything about the logical plan here). I'd like to see a common representation of typed data be used across both SQL and MLlib and longer term other libraries as well. I don't see an insurmountable semantic gap between an R-style data frame and a relational table. In fact, if you look across other projects today - almost all of them are trying to unify these types of data representations. So I'd support seeing where maybe we can enhance or extend SchemaRDD to better support numeric data sets. And if we find there is just too large of a gap here, then we could look at implementing a second dataset abstraction. If nothing else this is a test of whether SchemaRDD is sufficiently extensible to be useful in contexts beyond SQL (which is its original design). Dataset --- Key: SPARK-3573 URL: https://issues.apache.org/jira/browse/SPARK-3573 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra ML-specific metadata embedded in its schema. .Sample code Suppose we have training events stored on HDFS and user/ad features in Hive, we want to assemble features for training and then apply decision tree. The proposed pipeline with dataset looks like the following (need more refinements): {code} sqlContext.jsonFile(/path/to/training/events, 0.01).registerTempTable(event) val training = sqlContext.sql( SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, event.action AS label, user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures, ad.targetGender AS targetGender FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;).cache() val indexer = new Indexer() val interactor = new Interactor() val fvAssembler = new FeatureVectorAssembler() val treeClassifer = new DecisionTreeClassifer() val paramMap = new ParamMap() .put(indexer.features, Map(userCountryIndex - userCountry)) .put(indexer.sortByFrequency, true) .put(interactor.features, Map(genderMatch - Array(userGender, targetGender))) .put(fvAssembler.features, Map(features - Array(genderMatch, userCountryIndex, userFeatures))) .put(fvAssembler.dense, true) .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes features and label columns. val pipeline = Pipeline.create(indexer, interactor, fvAssembler, treeClassifier) val model = pipeline.fit(training, paramMap) sqlContext.jsonFile(/path/to/events, 0.01).registerTempTable(event) val test = sqlContext.sql( SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures, ad.targetGender AS targetGender FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;) val prediction = model.transform(test).select('eventId, 'prediction) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3371) Spark SQL: Renaming a function expression with group by gives error
[ https://issues.apache.org/jira/browse/SPARK-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3371. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2511 [https://github.com/apache/spark/pull/2511] Spark SQL: Renaming a function expression with group by gives error --- Key: SPARK-3371 URL: https://issues.apache.org/jira/browse/SPARK-3371 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Pei-Lun Lee Fix For: 1.2.0 {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val rdd = sc.parallelize(List({foo:bar})) sqlContext.jsonRDD(rdd).registerAsTable(t1) sqlContext.registerFunction(len, (s: String) = s.length) sqlContext.sql(select len(foo) as a, count(1) from t1 group by len(foo)).collect() {code} running above code in spark-shell gives the following error {noformat} 14/09/03 17:20:13 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 214) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: foo#0 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:201) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:199) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) {noformat} remove as a in the query causes no error -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3762) clear all SparkEnv references after stop
Davies Liu created SPARK-3762: - Summary: clear all SparkEnv references after stop Key: SPARK-3762 URL: https://issues.apache.org/jira/browse/SPARK-3762 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu Priority: Critical SparkEnv is cached in ThreadLocal object, so after stop and create a new SparkContext, old SparkEnv is still used by some threads, it will trigger many problems. We should clear all the references after stop a SparkEnv. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3759) SparkSubmitDriverBootstrapper should return exit code of driver process
[ https://issues.apache.org/jira/browse/SPARK-3759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156157#comment-14156157 ] Eric Eijkelenboom commented on SPARK-3759: -- Yes, no problem! SparkSubmitDriverBootstrapper should return exit code of driver process --- Key: SPARK-3759 URL: https://issues.apache.org/jira/browse/SPARK-3759 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.1.0 Environment: Linux, Windows, Scala/Java Reporter: Eric Eijkelenboom Priority: Minor Original Estimate: 24h Remaining Estimate: 24h SparkSubmitDriverBootstrapper.scala currently always returns exit code 0. Instead, it should return the exit code of the driver process. Suggested code change in SparkSubmitDriverBootstrapper, line 157: {code} val returnCode = process.waitFor() sys.exit(returnCode) {code} Workaround for this issue: Instead of specifying 'driver.extra*' properties in spark-defaults.conf, pass these properties to spark-submit directly. This will launch the driver program without the use of SparkSubmitDriverBootstrapper. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3762) clear all SparkEnv references after stop
[ https://issues.apache.org/jira/browse/SPARK-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156159#comment-14156159 ] Apache Spark commented on SPARK-3762: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2624 clear all SparkEnv references after stop Key: SPARK-3762 URL: https://issues.apache.org/jira/browse/SPARK-3762 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu Assignee: Davies Liu Priority: Critical SparkEnv is cached in ThreadLocal object, so after stop and create a new SparkContext, old SparkEnv is still used by some threads, it will trigger many problems. We should clear all the references after stop a SparkEnv. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time
[ https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milan Straka updated SPARK-3731: Attachment: spark-3731.py RDD caching stops working in pyspark after some time Key: SPARK-3731 URL: https://issues.apache.org/jira/browse/SPARK-3731 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Environment: Linux, 32bit, both in local mode or in standalone cluster mode Reporter: Milan Straka Attachments: spark-3731.py, worker.log Consider a file F which when loaded with sc.textFile and cached takes up slightly more than half of free memory for RDD cache. When in PySpark the following is executed: 1) a = sc.textFile(F) 2) a.cache().count() 3) b = sc.textFile(F) 4) b.cache().count() and then the following is repeated (for example 10 times): a) a.unpersist().cache().count() b) b.unpersist().cache().count() after some time, there are no RDD cached in memory. Also, since that time, no other RDD ever gets cached (the worker always reports something like WARN CacheManager: Not enough space to cache partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that all executors have 0MB memory used (which is consistent with the CacheManager warning). When doing the same in scala, everything works perfectly. I understand that this is a vague description, but I do no know how to describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2461) Add a toString method to GeneralizedLinearModel
[ https://issues.apache.org/jira/browse/SPARK-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156175#comment-14156175 ] Apache Spark commented on SPARK-2461: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2625 Add a toString method to GeneralizedLinearModel --- Key: SPARK-2461 URL: https://issues.apache.org/jira/browse/SPARK-2461 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3731) RDD caching stops working in pyspark after some time
[ https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156174#comment-14156174 ] Milan Straka commented on SPARK-3731: - I have attached reproducible program, input file and log from the local run. You have to uncompress the input file before executing. I have used unmodified spark-1.1.0-bin-hadoop2.4.tgz. Before the for loop, 3 out of 4 partitions are cached. After first iteration of the for loop, only 2 are cached, after second iteration only 1 and after third iteration 0 partitions are cached. I believe this behaviour is not dependent on StorageLevel used, I am using MEMORY_ONLY for the cached partitions to be big even on easily compresible file, but I have encountered the issue when using MEMORY_ONLY_SER. I also believe that this behaviour is triggered only when some RDD partition does _not_ fit into memory. Before that happens, caching and uncaching work as expected. An equivalent scala program seems to be working fine. RDD caching stops working in pyspark after some time Key: SPARK-3731 URL: https://issues.apache.org/jira/browse/SPARK-3731 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Environment: Linux, 32bit, both in local mode or in standalone cluster mode Reporter: Milan Straka Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, worker.log Consider a file F which when loaded with sc.textFile and cached takes up slightly more than half of free memory for RDD cache. When in PySpark the following is executed: 1) a = sc.textFile(F) 2) a.cache().count() 3) b = sc.textFile(F) 4) b.cache().count() and then the following is repeated (for example 10 times): a) a.unpersist().cache().count() b) b.unpersist().cache().count() after some time, there are no RDD cached in memory. Also, since that time, no other RDD ever gets cached (the worker always reports something like WARN CacheManager: Not enough space to cache partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that all executors have 0MB memory used (which is consistent with the CacheManager warning). When doing the same in scala, everything works perfectly. I understand that this is a vague description, but I do no know how to describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1767) Prefer HDFS-cached replicas when scheduling data-local tasks
[ https://issues.apache.org/jira/browse/SPARK-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1767. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Colin Patrick McCabe Fixed by:https://github.com/apache/spark/pull/1486 Prefer HDFS-cached replicas when scheduling data-local tasks Key: SPARK-1767 URL: https://issues.apache.org/jira/browse/SPARK-1767 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Colin Patrick McCabe Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3007) Add Dynamic Partition support to Spark Sql hive
[ https://issues.apache.org/jira/browse/SPARK-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156225#comment-14156225 ] Apache Spark commented on SPARK-3007: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/2626 Add Dynamic Partition support to Spark Sql hive --- Key: SPARK-3007 URL: https://issues.apache.org/jira/browse/SPARK-3007 Project: Spark Issue Type: Improvement Components: SQL Reporter: baishuo Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156236#comment-14156236 ] Ziv Huang commented on SPARK-3687: -- The following is the jstack dump of one CoarseGrainedExecutorBackend when the job hangs (the spark version is 1.1.0): Attach Listener daemon prio=10 tid=0x7fded0001000 nid=0x7836 waiting on condition [0x] java.lang.Thread.State: RUNNABLE Hashed wheel timer #1 daemon prio=10 tid=0x7fde9c001000 nid=0x7811 waiting on condition [0x7fdf26a84000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:503) at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:401) at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at java.lang.Thread.run(Thread.java:745) New I/O server boss #6 daemon prio=10 tid=0x7fdeb4084000 nid=0x7810 runnable [0x7fdf26b85000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) - locked 0x0007db53acc0 (a sun.nio.ch.Util$2) - locked 0x0007db53acb0 (a java.util.Collections$UnmodifiableSet) - locked 0x0007db53ab98 (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102) at org.jboss.netty.channel.socket.nio.NioServerBoss.select(NioServerBoss.java:163) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:206) at org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42) at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) New I/O worker #5 daemon prio=10 tid=0x7fdeb4037000 nid=0x780f runnable [0x7fdf26c86000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) - locked 0x0007db529f98 (a sun.nio.ch.Util$2) - locked 0x0007db529f88 (a java.util.Collections$UnmodifiableSet) - locked 0x0007db529e70 (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:64) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:409) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:206) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) New I/O worker #4 daemon prio=10 tid=0x7fdeb4032800 nid=0x780e runnable [0x7fdf26d87000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) - locked 0x0007db528610 (a sun.nio.ch.Util$2) - locked 0x0007db528600 (a java.util.Collections$UnmodifiableSet) - locked 0x0007db5284e8 (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:64) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:409) at
[jira] [Created] (SPARK-3763) The example of building with sbt should be sbt assembly instead of sbt compile
Kousuke Saruta created SPARK-3763: - Summary: The example of building with sbt should be sbt assembly instead of sbt compile Key: SPARK-3763 URL: https://issues.apache.org/jira/browse/SPARK-3763 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Trivial In building-spark.md, there are some examples for making assembled package with maven but the example for building with sbt is only about for compiling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3759) SparkSubmitDriverBootstrapper should return exit code of driver process
[ https://issues.apache.org/jira/browse/SPARK-3759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156239#comment-14156239 ] Apache Spark commented on SPARK-3759: - User 'ericeijkelenboom' has created a pull request for this issue: https://github.com/apache/spark/pull/2628 SparkSubmitDriverBootstrapper should return exit code of driver process --- Key: SPARK-3759 URL: https://issues.apache.org/jira/browse/SPARK-3759 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.1.0 Environment: Linux, Windows, Scala/Java Reporter: Eric Eijkelenboom Priority: Minor Original Estimate: 24h Remaining Estimate: 24h SparkSubmitDriverBootstrapper.scala currently always returns exit code 0. Instead, it should return the exit code of the driver process. Suggested code change in SparkSubmitDriverBootstrapper, line 157: {code} val returnCode = process.waitFor() sys.exit(returnCode) {code} Workaround for this issue: Instead of specifying 'driver.extra*' properties in spark-defaults.conf, pass these properties to spark-submit directly. This will launch the driver program without the use of SparkSubmitDriverBootstrapper. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3763) The example of building with sbt should be sbt assembly instead of sbt compile
[ https://issues.apache.org/jira/browse/SPARK-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156238#comment-14156238 ] Apache Spark commented on SPARK-3763: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2627 The example of building with sbt should be sbt assembly instead of sbt compile -- Key: SPARK-3763 URL: https://issues.apache.org/jira/browse/SPARK-3763 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Trivial In building-spark.md, there are some examples for making assembled package with maven but the example for building with sbt is only about for compiling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
Takuya Ueshin created SPARK-3764: Summary: Invalid dependencies of artifacts in Maven Central Repository. Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2809) update chill to version 0.5.0
[ https://issues.apache.org/jira/browse/SPARK-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2809: --- Summary: update chill to version 0.5.0 (was: update chill to version 0.4) update chill to version 0.5.0 - Key: SPARK-2809 URL: https://issues.apache.org/jira/browse/SPARK-2809 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati First twitter chill_2.11 0.4 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156311#comment-14156311 ] Sean Owen commented on SPARK-3764: -- This is correct and as intended. Without any additional flags, yes, the version of Hadoop referenced by Spark would be 1.0.4. You should not rely on this though. If your app uses Spark but not Hadoop, it's not relevant as you are not packaging Spark or Hadoop dependencies in your app. If you use Spark and Hadoop APIs, you need to explicitly depend on the version of Hadoop you use on your cluster (but still not bundle with your app). Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2809) update chill to version 0.5.0
[ https://issues.apache.org/jira/browse/SPARK-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156362#comment-14156362 ] Sean Owen commented on SPARK-2809: -- PS chill 0.5.0 is the first to support Scala 2.11, so now this is actionable. http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22chill_2.11%22 update chill to version 0.5.0 - Key: SPARK-2809 URL: https://issues.apache.org/jira/browse/SPARK-2809 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati Assignee: Guoqiang Li First twitter chill_2.11 0.4 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
[ https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156365#comment-14156365 ] Igor Tkachenko edited comment on SPARK-3761 at 10/2/14 10:55 AM: - I've tried sbt 12.4, but unfortunately with no luck. I'd like to emphasize that I am using spark-assembly lib with version _2.10;1.0.0-cdh5.1.0 from repository http://repository.cloudera.com/artifactory/repo/ as we are using cloudera image CDH 5.1.0 and can't use any other version due to serialization issues. was (Author: legart): I've tried sbt 12.4, but unfortunately with no luck. I'd like to emphasize that I am using with version _2.10;1.0.0-cdh5.1.0 from repository http://repository.cloudera.com/artifactory/repo/ as we are using cloudera image CDH 5.1.0 and can't use any other version due to serialization issues. Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4 - Key: SPARK-3761 URL: https://issues.apache.org/jira/browse/SPARK-3761 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Igor Tkachenko Priority: Blocker I have Scala code: val master = spark://server address:7077 val sc = new SparkContext(new SparkConf() .setMaster(master) .setAppName(SparkQueryDemo 01) .set(spark.executor.memory, 512m)) val count2 = sc .textFile(hdfs://server address:8020/tmp/data/risk/account.txt) .filter(line = line.contains(Word)) .count() I've got such an error: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host server address: java.lang.ClassNotFoundExcept ion: SimpleApp$$anonfun$1 My dependencies : object Version { val spark= 1.0.0-cdh5.1.0 } object Library { val sparkCore = org.apache.spark % spark-assembly_2.10 % Version.spark } My OS is Win 7, sbt 13.5, Scala 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
[ https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156365#comment-14156365 ] Igor Tkachenko commented on SPARK-3761: --- I've tried sbt 12.4, but unfortunately with no luck. I'd like to emphasize that I am using with version _2.10;1.0.0-cdh5.1.0 from repository http://repository.cloudera.com/artifactory/repo/ as we are using cloudera image CDH 5.1.0 and can't use any other version due to serialization issues. Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4 - Key: SPARK-3761 URL: https://issues.apache.org/jira/browse/SPARK-3761 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Igor Tkachenko Priority: Blocker I have Scala code: val master = spark://server address:7077 val sc = new SparkContext(new SparkConf() .setMaster(master) .setAppName(SparkQueryDemo 01) .set(spark.executor.memory, 512m)) val count2 = sc .textFile(hdfs://server address:8020/tmp/data/risk/account.txt) .filter(line = line.contains(Word)) .count() I've got such an error: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host server address: java.lang.ClassNotFoundExcept ion: SimpleApp$$anonfun$1 My dependencies : object Version { val spark= 1.0.0-cdh5.1.0 } object Library { val sparkCore = org.apache.spark % spark-assembly_2.10 % Version.spark } My OS is Win 7, sbt 13.5, Scala 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
[ https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156373#comment-14156373 ] Igor Tkachenko commented on SPARK-3761: --- Created the same bug in Cloudera Jira: https://issues.cloudera.org/browse/DISTRO-647 Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4 - Key: SPARK-3761 URL: https://issues.apache.org/jira/browse/SPARK-3761 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Igor Tkachenko Priority: Blocker I have Scala code: val master = spark://server address:7077 val sc = new SparkContext(new SparkConf() .setMaster(master) .setAppName(SparkQueryDemo 01) .set(spark.executor.memory, 512m)) val count2 = sc .textFile(hdfs://server address:8020/tmp/data/risk/account.txt) .filter(line = line.contains(Word)) .count() I've got such an error: [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host server address: java.lang.ClassNotFoundExcept ion: SimpleApp$$anonfun$1 My dependencies : object Version { val spark= 1.0.0-cdh5.1.0 } object Library { val sparkCore = org.apache.spark % spark-assembly_2.10 % Version.spark } My OS is Win 7, sbt 13.5, Scala 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2809) update chill to version 0.5.0
[ https://issues.apache.org/jira/browse/SPARK-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156440#comment-14156440 ] Guoqiang Li commented on SPARK-2809: The related work. https://github.com/apache/spark/pull/2615 update chill to version 0.5.0 - Key: SPARK-2809 URL: https://issues.apache.org/jira/browse/SPARK-2809 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati Assignee: Guoqiang Li First twitter chill_2.11 0.4 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1834) NoSuchMethodError when invoking JavaPairRDD.reduce() in Java
[ https://issues.apache.org/jira/browse/SPARK-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156464#comment-14156464 ] Alexis Seigneurin commented on SPARK-1834: -- Same issue here with Spark 1.1.0: reduce() is not implemented on JavaPairRDD. {code} Tuple2String, Long r = sc.textFile(filename) .mapToPair(s - new Tuple2String, Long(s[3], 1L)) .reduceByKey((x, y) - x + y) .reduce((t1, t2) - t1._2 t2._2 ? t1 : t2); {code} Produces: {code} Exception in thread main java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2; at fr.ippon.dojo.spark.AnalyseParisTrees.main(AnalyseParisTrees.java:33) {code} However, reduce() is implemented on JavaRDD. I have had to add an intermediate map() operation: {code} Tuple2String, Long r = sc.textFile(filename) .mapToPair(s - new Tuple2String, Long(s[3], 1L)) .reduceByKey((x, y) - x + y) .map(t - t) .reduce((t1, t2) - t1._2 t2._2 ? t1 : t2); {code} NoSuchMethodError when invoking JavaPairRDD.reduce() in Java Key: SPARK-1834 URL: https://issues.apache.org/jira/browse/SPARK-1834 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1 Environment: Redhat Linux, Java 7, Hadoop 2.2, Scala 2.10.4 Reporter: John Snodgrass I get a java.lang.NoSuchMethod error when invoking JavaPairRDD.reduce(). Here is the partial stack trace: Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:39) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2; at JavaPairRDDReduceTest.main(JavaPairRDDReduceTest.java:49)... I'm using Spark 0.9.1. I checked to ensure that I'm compiling with the same version of Spark as I am running on the cluster. The reduce() method works fine with JavaRDD, just not with JavaPairRDD. Here is a code snippet that exhibits the problem: ArrayListInteger array = new ArrayList(); for (int i = 0; i 10; ++i) { array.add(i); } JavaRDDInteger rdd = javaSparkContext.parallelize(array); JavaPairRDDString, Integer testRDD = rdd.map(new PairFunctionInteger, String, Integer() { @Override public Tuple2String, Integer call(Integer t) throws Exception { return new Tuple2( + t, t); } }).cache(); testRDD.reduce(new Function2Tuple2String, Integer, Tuple2String, Integer, Tuple2String, Integer() { @Override public Tuple2String, Integer call(Tuple2String, Integer arg0, Tuple2String, Integer arg1) throws Exception { return new Tuple2(arg0._1 + arg1._1, arg0._2 * 10 + arg0._2); } }); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1834) NoSuchMethodError when invoking JavaPairRDD.reduce() in Java
[ https://issues.apache.org/jira/browse/SPARK-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156477#comment-14156477 ] Sean Owen commented on SPARK-1834: -- Weird, I can reproduce this. I have a new test case for {{JavaAPISuite}} and am investigating. It compiles fine but fails at runtime. I sense Scala shenanigans. NoSuchMethodError when invoking JavaPairRDD.reduce() in Java Key: SPARK-1834 URL: https://issues.apache.org/jira/browse/SPARK-1834 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1 Environment: Redhat Linux, Java 7, Hadoop 2.2, Scala 2.10.4 Reporter: John Snodgrass I get a java.lang.NoSuchMethod error when invoking JavaPairRDD.reduce(). Here is the partial stack trace: Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:39) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2; at JavaPairRDDReduceTest.main(JavaPairRDDReduceTest.java:49)... I'm using Spark 0.9.1. I checked to ensure that I'm compiling with the same version of Spark as I am running on the cluster. The reduce() method works fine with JavaRDD, just not with JavaPairRDD. Here is a code snippet that exhibits the problem: ArrayListInteger array = new ArrayList(); for (int i = 0; i 10; ++i) { array.add(i); } JavaRDDInteger rdd = javaSparkContext.parallelize(array); JavaPairRDDString, Integer testRDD = rdd.map(new PairFunctionInteger, String, Integer() { @Override public Tuple2String, Integer call(Integer t) throws Exception { return new Tuple2( + t, t); } }).cache(); testRDD.reduce(new Function2Tuple2String, Integer, Tuple2String, Integer, Tuple2String, Integer() { @Override public Tuple2String, Integer call(Tuple2String, Integer arg0, Tuple2String, Integer arg1) throws Exception { return new Tuple2(arg0._1 + arg1._1, arg0._2 * 10 + arg0._2); } }); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1834) NoSuchMethodError when invoking JavaPairRDD.reduce() in Java
[ https://issues.apache.org/jira/browse/SPARK-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156477#comment-14156477 ] Sean Owen edited comment on SPARK-1834 at 10/2/14 12:46 PM: Weird, I can reproduce this. It compiles fine but fails at runtime. Another example, that doesn't even use lambdas: {code} @Test public void pairReduce() { JavaRDDInteger rdd = sc.parallelize(Arrays.asList(1, 1, 2, 3, 5, 8, 13)); JavaPairRDDInteger,Integer pairRDD = rdd.mapToPair( new PairFunctionInteger, Integer, Integer() { @Override public Tuple2Integer, Integer call(Integer i) { return new Tuple2Integer, Integer(i, i + 1); } }); // See SPARK-1834 Tuple2Integer, Integer reduced = pairRDD.reduce( new Function2Tuple2Integer,Integer, Tuple2Integer,Integer, Tuple2Integer,Integer() { @Override public Tuple2Integer, Integer call(Tuple2Integer, Integer t1, Tuple2Integer, Integer t2) { return new Tuple2Integer, Integer(t1._1() + t2._1(), t1._2() + t2._2()); } }); Assert.assertEquals(33, reduced._1().intValue()); Assert.assertEquals(40, reduced._1().intValue()); } {code} but... {code} java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2; {code} I decompiled the class and it really looks like the method is there with the expected signature: {code} public scala.Tuple2K, V reduce(org.apache.spark.api.java.function.Function2scala.Tuple2K, V, scala.Tuple2K, V, scala.Tuple2K, V); {code} Color me pretty confused. was (Author: srowen): Weird, I can reproduce this. I have a new test case for {{JavaAPISuite}} and am investigating. It compiles fine but fails at runtime. I sense Scala shenanigans. NoSuchMethodError when invoking JavaPairRDD.reduce() in Java Key: SPARK-1834 URL: https://issues.apache.org/jira/browse/SPARK-1834 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1 Environment: Redhat Linux, Java 7, Hadoop 2.2, Scala 2.10.4 Reporter: John Snodgrass I get a java.lang.NoSuchMethod error when invoking JavaPairRDD.reduce(). Here is the partial stack trace: Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:39) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2; at JavaPairRDDReduceTest.main(JavaPairRDDReduceTest.java:49)... I'm using Spark 0.9.1. I checked to ensure that I'm compiling with the same version of Spark as I am running on the cluster. The reduce() method works fine with JavaRDD, just not with JavaPairRDD. Here is a code snippet that exhibits the problem: ArrayListInteger array = new ArrayList(); for (int i = 0; i 10; ++i) { array.add(i); } JavaRDDInteger rdd = javaSparkContext.parallelize(array); JavaPairRDDString, Integer testRDD = rdd.map(new PairFunctionInteger, String, Integer() { @Override public Tuple2String, Integer call(Integer t) throws Exception { return new Tuple2( + t, t); } }).cache(); testRDD.reduce(new Function2Tuple2String, Integer, Tuple2String, Integer, Tuple2String, Integer() { @Override public Tuple2String, Integer call(Tuple2String, Integer arg0, Tuple2String, Integer arg1) throws Exception { return new Tuple2(arg0._1 + arg1._1, arg0._2 * 10 + arg0._2); } }); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3765) add testing with sbt to doc
wangfei created SPARK-3765: -- Summary: add testing with sbt to doc Key: SPARK-3765 URL: https://issues.apache.org/jira/browse/SPARK-3765 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: wangfei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3765) add testing with sbt to doc
[ https://issues.apache.org/jira/browse/SPARK-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156569#comment-14156569 ] Apache Spark commented on SPARK-3765: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/2629 add testing with sbt to doc --- Key: SPARK-3765 URL: https://issues.apache.org/jira/browse/SPARK-3765 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: wangfei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156576#comment-14156576 ] Takuya Ueshin commented on SPARK-3764: -- Ah, so, these artifacts are only for hadoop-2 and if I want to use Hadoop APIs, I need to explicitly add dependencies to hadoop, right? Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156578#comment-14156578 ] Takuya Ueshin commented on SPARK-3764: -- Now I found the instruction [here|http://spark.apache.org/docs/latest/programming-guide.html#linking-with-spark] but this would not work for hadoop-1. I think we need some notice to lead hadoop-1 users to [Building Spark with Maven|http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version], and also need it at [Download Spark |http://spark.apache.org/downloads.html]. Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156581#comment-14156581 ] Sean Owen commented on SPARK-3764: -- The artifacts themselves don't contain any Hadoop code. The default disposition of the pom would link to Hadoop 1, but apps are not meant to depend on this (this is generally good Maven practice). Yes, you always need to add Hadoop dependencies if you use Hadoop APIs. That's not specific to Spark. In fact, you will want to mark Spark and Hadoop as provided dependencies when making an app for use with spark-submit. You can use the Spark artifacts to build a Spark app that works with Hadoop 2 or Hadoop 1. The instructions you see are really about creating a build of Spark itself to deploy on a cluster, rather than an app for Spark. Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156588#comment-14156588 ] Takuya Ueshin commented on SPARK-3764: -- But there are some codes using binary-incompatible APIs of Hadoop in Spark itself, which causes my original problem, so hadoop-1 users need to rebuild Spark itself. Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156599#comment-14156599 ] Takuya Ueshin commented on SPARK-3764: -- Ah, I see that {{context.getTaskAttemptID}} at [ParquetTableOperations.scala:334|https://github.com/apache/spark/blob/6e27cb630de69fa5acb510b4e2f6b980742b1957/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L334] is breaking binary-compatibility of Spark itself. Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156623#comment-14156623 ] Guoqiang Li commented on SPARK-1405: Hi everyone [The PR 2388|https://github.com/apache/spark/pull/2388] is OK to review. parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2811) update algebird to 0.8
[ https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2811: --- Summary: update algebird to 0.8 (was: update algebird to 0.7) update algebird to 0.8 -- Key: SPARK-2811 URL: https://issues.apache.org/jira/browse/SPARK-2811 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati First algebird_2.11 0.7.0 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2811) update algebird to 0.8
[ https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2811: --- Description: First algebird_2.11 0.8.1 has to be released (was: First algebird_2.11 0.7.0 has to be released) update algebird to 0.8 -- Key: SPARK-2811 URL: https://issues.apache.org/jira/browse/SPARK-2811 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati First algebird_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2811) update algebird to 0.8.1
[ https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2811: --- Summary: update algebird to 0.8.1 (was: update algebird to 0.8) update algebird to 0.8.1 Key: SPARK-2811 URL: https://issues.apache.org/jira/browse/SPARK-2811 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati First algebird_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3623) Graph should support the checkpoint operation
[ https://issues.apache.org/jira/browse/SPARK-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156664#comment-14156664 ] Apache Spark commented on SPARK-3623: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/2631 Graph should support the checkpoint operation - Key: SPARK-3623 URL: https://issues.apache.org/jira/browse/SPARK-3623 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2, 1.1.0 Reporter: Guoqiang Li Priority: Critical Consider the following code: {code} for (i - 0 until totalIter) { val previousCorpus = corpus logInfo(Start Gibbs sampling (Iteration %d/%d).format(i, totalIter)) val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, sumTerms, numTerms, numTopics, alpha, beta).persist(storageLevel) val corpusSampleTopics = sampleTopics(corpusTopicDist, globalTopicCounter, sumTerms, numTerms, numTopics, alpha, beta).persist(storageLevel) corpus = updateCounter(corpusSampleTopics, numTopics).persist(storageLevel) globalTopicCounter = collectGlobalCounter(corpus, numTopics) assert(bsum(globalTopicCounter) == sumTerms) previousCorpus.unpersistVertices() corpusTopicDist.unpersistVertices() corpusSampleTopics.unpersistVertices() } {code} If there is no checkpoint operation will appear the following problems. 1. The RDD of corpus dependencies are too deep 2. The shuffle files are too large. 3. Any of a server crash will cause the algorithm to recalculate -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156663#comment-14156663 ] Apache Spark commented on SPARK-3625: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/2631 In some cases, the RDD.checkpoint does not work --- Key: SPARK-3625 URL: https://issues.apache.org/jira/browse/SPARK-3625 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Guoqiang Li Assignee: Guoqiang Li The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)).map(_ + 1) c.count val dep = c.dependencies.head.rdd c.checkpoint() c.count assert(dep != c.dependencies.head.rdd) {code} This limit is too strict , This makes it difficult to implement SPARK-3623 . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3766) Snappy is also the default compression codec for broadcast variables
wangfei created SPARK-3766: -- Summary: Snappy is also the default compression codec for broadcast variables Key: SPARK-3766 URL: https://issues.apache.org/jira/browse/SPARK-3766 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: wangfei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3766) Snappy is also the default compression codec for broadcast variables
[ https://issues.apache.org/jira/browse/SPARK-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-3766: --- Component/s: Documentation Snappy is also the default compression codec for broadcast variables Key: SPARK-3766 URL: https://issues.apache.org/jira/browse/SPARK-3766 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.1.0 Reporter: wangfei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3765) add testing with sbt to doc
[ https://issues.apache.org/jira/browse/SPARK-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-3765: --- Component/s: Documentation add testing with sbt to doc --- Key: SPARK-3765 URL: https://issues.apache.org/jira/browse/SPARK-3765 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.1.0 Reporter: wangfei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3766) Snappy is also the default compression codec for broadcast variables
[ https://issues.apache.org/jira/browse/SPARK-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156680#comment-14156680 ] Apache Spark commented on SPARK-3766: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/2632 Snappy is also the default compression codec for broadcast variables Key: SPARK-3766 URL: https://issues.apache.org/jira/browse/SPARK-3766 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.1.0 Reporter: wangfei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156699#comment-14156699 ] Evan Sparks commented on SPARK-1405: Hi Guoqiang - is it correct that your runtimes are reported in minutes as opposed to seconds? In your tests, have you cached the input data? 45 minutes for 150 iterations over this small dataset seems slow to me. It would be great to get an idea of where the bottleneck is coming from. Is it the Gibbs step or something else? Is it possible to share the dataset you used for these experiments? Thanks! parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156723#comment-14156723 ] Sean Owen commented on SPARK-3764: -- I'm not sure what you mean. Spark compiles versus most versions of Hadoop 1 and 2. You can see the profiles in the build that help support this. These are however not relevant to someone that is just building a Spark app. Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3706) Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset
[ https://issues.apache.org/jira/browse/SPARK-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156776#comment-14156776 ] cocoatomo commented on SPARK-3706: -- Thank you for the comment and modification, [~joshrosen]. Taking a quick look, this regression created at the commit [f38fab97c7970168f1bd81d4dc202e36322c95e3|https://github.com/apache/spark/commit/f38fab97c7970168f1bd81d4dc202e36322c95e3#diff-5dbcb82caf8131d60c73e82cf8d12d8aR107] on master branch. Pushing ipython aside into a default value force us to set PYSPARK_PYTHON as ipython, since PYSPARK_PYTHON defaults to python at the top of the ./bin/pyspark script. This issue is a regression between 1.1.0 and 1.2.0, therefore affects only 1.2.0. Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset Key: SPARK-3706 URL: https://issues.apache.org/jira/browse/SPARK-3706 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0 Reporter: cocoatomo Labels: pyspark h3. Problem The section Using the shell in Spark Programming Guide (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) says that we can run pyspark REPL through IPython. But a folloing command does not run IPython but a default Python executable. {quote} $ IPYTHON=1 ./bin/pyspark Python 2.7.8 (default, Jul 2 2014, 10:14:46) ... {quote} the spark/bin/pyspark script on the commit b235e013638685758885842dc3268e9800af3678 decides which executable and options it use folloing way. # if PYSPARK_PYTHON unset #* → defaulting to python # if IPYTHON_OPTS set #* → set IPYTHON 1 # some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit #* out of this issues scope # if IPYTHON set as 1 #* → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS #* otherwise execute $PYSPARK_PYTHON Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is 1. In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no effect on decide which command to use. ||PYSPARK_PYTHON||IPYTHON_OPTS||IPYTHON||resulting command||expected command|| |(unset → defaults to python)|(unset)|(unset)|python|(same)| |(unset → defaults to python)|(unset)|1|python|ipython| |(unset → defaults to python)|an_option|(unset → set to 1)|python an_option|ipython an_option| |(unset → defaults to python)|an_option|1|python an_option|ipython an_option| |ipython|(unset)|(unset)|ipython|(same)| |ipython|(unset)|1|ipython|(same)| |ipython|an_option|(unset → set to 1)|ipython an_option|(same)| |ipython|an_option|1|ipython an_option|(same)| h3. Suggestion The pyspark script should determine firstly whether a user wants to run IPython or other executables. # if IPYTHON_OPTS set #* set IPYTHON 1 # if IPYTHON has a value 1 #* PYSPARK_PYTHON defaults to ipython if not set # PYSPARK_PYTHON defaults to python if not set See the pull request for more detailed modification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156840#comment-14156840 ] Norman He commented on SPARK-2447: -- HI Ted, I am very glad to see the hbase RDD work. I am probably going to use it in current form. I like the idea of worker node to manage HBaseConnection, Somehow I havenot seen anycode related to HConnectionStaticCache? Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156865#comment-14156865 ] Ted Malaska commented on SPARK-2447: Hey Norman, Yes the github project has been used by a couple of client now. It should be pretty harden. Let me know if you find any issues. I will hopefully run into TD and Hadoop World and I will work out how to get this into Spark. Thanks for the comment. Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3767) Support wildcard in Spark properties
Andrew Or created SPARK-3767: Summary: Support wildcard in Spark properties Key: SPARK-3767 URL: https://issues.apache.org/jira/browse/SPARK-3767 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or If the user sets spark.executor.extraJavaOptions, he/she may want to express the value in terms of the executor ID, for instance. In general it would be a feature that many will find useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156904#comment-14156904 ] Nicholas Chammas commented on SPARK-2870: - [~marmbrus] - A related feature that I think would be very important and useful is the ability to infer a complete schema as described here, but do so by key. i.e. something like {{inferSchemaByKey()}} Say I have a large, single RDD of data that includes many different event types. I want to key the RDD by event type and make a single pass over it to get the schema for each event type. This would probably yield something like a {{keyedSchemaRDD}} which I would want to register as multiple tables (one table per key/schema) in one go. Do you think this would be a useful feature? If so, should I track it in a separate JIRA issue? Thorough schema inference directly on RDDs of Python dictionaries - Key: SPARK-2870 URL: https://issues.apache.org/jira/browse/SPARK-2870 Project: Spark Issue Type: Improvement Components: PySpark, SQL Reporter: Nicholas Chammas h4. Background I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. They process JSON text directly and infer a schema that covers the entire source data set. This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types. For example: {code} {a: 5} {a: cow} {code} To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned {{SQLContext.json...()}} methods do this very well. h4. Feature Request What we need is for {{SQlContext.inferSchema()}} to do this, too. Alternatively, we need a new {{SQLContext}} method that works on RDDs of Python dictionaries and does something functionally equivalent to this: {code} SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) {code} As of 1.0.2, [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable. h4. Example Use Case * You have some JSON text data that you want to analyze using Spark SQL. * You would use one of the {{SQLContext.json...()}} methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation. * You deserialize the JSON objects to Python {{dict}} s and filter out the bad ones. You now have an RDD of dictionaries. * From this RDD, you want a SchemaRDD that captures the schema for the whole data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3706) Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset
[ https://issues.apache.org/jira/browse/SPARK-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3706. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2554 [https://github.com/apache/spark/pull/2554] Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset Key: SPARK-3706 URL: https://issues.apache.org/jira/browse/SPARK-3706 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0 Reporter: cocoatomo Labels: pyspark Fix For: 1.2.0 h3. Problem The section Using the shell in Spark Programming Guide (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) says that we can run pyspark REPL through IPython. But a folloing command does not run IPython but a default Python executable. {quote} $ IPYTHON=1 ./bin/pyspark Python 2.7.8 (default, Jul 2 2014, 10:14:46) ... {quote} the spark/bin/pyspark script on the commit b235e013638685758885842dc3268e9800af3678 decides which executable and options it use folloing way. # if PYSPARK_PYTHON unset #* → defaulting to python # if IPYTHON_OPTS set #* → set IPYTHON 1 # some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit #* out of this issues scope # if IPYTHON set as 1 #* → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS #* otherwise execute $PYSPARK_PYTHON Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is 1. In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no effect on decide which command to use. ||PYSPARK_PYTHON||IPYTHON_OPTS||IPYTHON||resulting command||expected command|| |(unset → defaults to python)|(unset)|(unset)|python|(same)| |(unset → defaults to python)|(unset)|1|python|ipython| |(unset → defaults to python)|an_option|(unset → set to 1)|python an_option|ipython an_option| |(unset → defaults to python)|an_option|1|python an_option|ipython an_option| |ipython|(unset)|(unset)|ipython|(same)| |ipython|(unset)|1|ipython|(same)| |ipython|an_option|(unset → set to 1)|ipython an_option|(same)| |ipython|an_option|1|ipython an_option|(same)| h3. Suggestion The pyspark script should determine firstly whether a user wants to run IPython or other executables. # if IPYTHON_OPTS set #* set IPYTHON 1 # if IPYTHON has a value 1 #* PYSPARK_PYTHON defaults to ipython if not set # PYSPARK_PYTHON defaults to python if not set See the pull request for more detailed modification. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3105) Calling cache() after RDDs are pipelined has no effect in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156936#comment-14156936 ] Nicholas Chammas commented on SPARK-3105: - I think it's definitely important for the larger project that the 3 APIs (Scala, Java, and Python) have semantics that are as consistent as possible. And for what it's worth, the Scala/Java semantics in this case seem nicer. People learn about the DAG as a distinguishing feature of Spark, so it might seem strange in PySpark that caching an RDD earlier in a lineage confers no benefit on descendent RDDs. Whether the descendent RDDs were defined before or after the caching seems like something people shouldn't have to think about. It sounds like this is a non-trivial change to make, and I don't appreciate the other implications it might have, but it seems like a good thing to me. Calling cache() after RDDs are pipelined has no effect in PySpark - Key: SPARK-3105 URL: https://issues.apache.org/jira/browse/SPARK-3105 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0, 1.1.0 Reporter: Josh Rosen Assignee: Josh Rosen PySpark's PipelinedRDD decides whether to pipeline transformations by checking whether those transformations are pipelinable _at the time that the PipelinedRDD objects are created_ rather than at the time that we invoke actions. This might lead to problems if we call {{cache()}} on an RDD after it's already been used in a pipeline: {code} rdd1 = sc.parallelize(range(100)).map(lambda x: x) rdd2 = rdd1.map(lambda x: 2 * x) rdd1.cache() rdd2.collect() {code} When I run this code, I'd expect {cache()}} to break the pipeline and cache intermediate results, but instead the two transformations are pipelined together in Python, effectively ignoring the {{cache()}}. Note that {{cache()}} works properly if we call it before performing any other transformations on the RDD: {code} rdd1 = sc.parallelize(range(100)).map(lambda x: x).cache() rdd2 = rdd1.map(lambda x: 2 * x) rdd2.collect() {code} This works as expected and caches {{rdd1}}. To fix this, I think we dynamically decide whether to pipeline when we actually perform actions, rather than statically deciding when we create the RDDs. We should also add tests for this. (Thanks to [~tdas] for pointing out this issue.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1270) An optimized gradient descent implementation
[ https://issues.apache.org/jira/browse/SPARK-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156964#comment-14156964 ] Peng Cheng commented on SPARK-1270: --- Yo, any follow up story on this one? I'm curious to know the local update part, as DistBelief has non-local model server shards. An optimized gradient descent implementation Key: SPARK-1270 URL: https://issues.apache.org/jira/browse/SPARK-1270 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Xusen Yin Labels: GradientDescent, MLLib, Fix For: 1.0.0 Current implementation of GradientDescent is inefficient in some aspects, especially in high-latency network. I propose a new implementation of GradientDescent, which follows a parallelism model called GradientDescentWithLocalUpdate, inspired by Jeff Dean's DistBelief and Eric Xing's SSP. With a few modifications of runMiniBatchSGD, the GradientDescentWithLocalUpdate can outperform the original sequential version by about 4x without sacrificing accuracy, and can be easily adopted by most classification and regression algorithms in MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3768) Modify default YARN memory_overhead-- from an additive constant to a multiplier
[ https://issues.apache.org/jira/browse/SPARK-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3768. -- Resolution: Fixed Modify default YARN memory_overhead-- from an additive constant to a multiplier --- Key: SPARK-3768 URL: https://issues.apache.org/jira/browse/SPARK-3768 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves Assignee: Nishkam Ravi Fix For: 1.2.0 Related to #894 and https://issues.apache.org/jira/browse/SPARK-2398 Experiments show that memory_overhead grows with container size. The multiplier has been experimentally obtained and can potentially be improved over time. https://github.com/apache/spark/pull/2485 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3768) Modify default YARN memory_overhead-- from an additive constant to a multiplier
Thomas Graves created SPARK-3768: Summary: Modify default YARN memory_overhead-- from an additive constant to a multiplier Key: SPARK-3768 URL: https://issues.apache.org/jira/browse/SPARK-3768 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves Assignee: Nishkam Ravi Fix For: 1.2.0 Related to #894 and https://issues.apache.org/jira/browse/SPARK-2398 Experiments show that memory_overhead grows with container size. The multiplier has been experimentally obtained and can potentially be improved over time. https://github.com/apache/spark/pull/2485 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157047#comment-14157047 ] David Martinez Rego commented on SPARK-1473: Sorry for having my name incomplete when I first posted. I am David Martinez currently at UCL in London. We had this project abandoned for some time but we will restart a pull request shortly. You can see the current version of the code at https://github.com/LIDIAgroup/SparkFeatureSelection. In response to a past post, the framework that Dr Gavin Brown presents in a single unified framework because it does not make any assumptions when stating the basic probabilistic model for the problem of FS. The problem of probability estimation that he mentions is not a philosophical question and is not only a shortcoming of feature selection. The problem is that the number of parameters that you need to estimate is exponentially proportional to the number of variables if you do not make any independence assumption (all possible events). So, to have good estimations of these parameters (probabilities of events), you need an exponential number of samples (to observe all possible events you need and exponential number of observations). That is why you need to make independence assumptions and follow a greedy strategy to be able to draw some conclusions. Feature selection for high dimensional datasets --- Key: SPARK-1473 URL: https://issues.apache.org/jira/browse/SPARK-1473 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ignacio Zendejas Assignee: Alexander Ulanov Priority: Minor Labels: features For classification tasks involving large feature spaces in the order of tens of thousands or higher (e.g., text classification with n-grams, where n 1), it is often useful to rank and filter features that are irrelevant thereby reducing the feature space by at least one or two orders of magnitude without impacting performance on key evaluation metrics (accuracy/precision/recall). A feature evaluation interface which is flexible needs to be designed and at least two methods should be implemented with Information Gain being a priority as it has been shown to be amongst the most reliable. Special consideration should be taken in the design to account for wrapper methods (see research papers below) which are more practical for lower dimensional data. Relevant research: * Brown, G., Pocock, A., Zhao, M. J., Luján, M. (2012). Conditional likelihood maximisation: a unifying framework for information theoretic feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. * Forman, George. An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research 3 (2003): 1289-1305. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157115#comment-14157115 ] Apache Spark commented on SPARK-3261: - User 'derrickburns' has created a pull request for this issue: https://github.com/apache/spark/pull/2634 KMeans clusterer can return duplicate cluster centers - Key: SPARK-3261 URL: https://issues.apache.org/jira/browse/SPARK-3261 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns This is a bad design choice. I think that it is preferable to produce no duplicate cluster centers. So instead of forcing the number of clusters to be K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3218) K-Means clusterer can fail on degenerate data
[ https://issues.apache.org/jira/browse/SPARK-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157113#comment-14157113 ] Apache Spark commented on SPARK-3218: - User 'derrickburns' has created a pull request for this issue: https://github.com/apache/spark/pull/2634 K-Means clusterer can fail on degenerate data - Key: SPARK-3218 URL: https://issues.apache.org/jira/browse/SPARK-3218 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns Assignee: Derrick Burns The KMeans parallel implementation selects points to be cluster centers with probability weighted by their distance to cluster centers. However, if there are fewer than k DISTINCT points in the data set, this approach will fail. Further, the recent checkin to work around this problem results in selection of the same point repeatedly as a cluster center. The fix is to allow fewer than k cluster centers to be selected. This requires several changes to the code, as the number of cluster centers is woven into the implementation. I have a version of the code that addresses this problem, AND generalizes the distance metric. However, I see that there are literally hundreds of outstanding pull requests. If someone will commit to working with me to sponsor the pull request, I will create it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157114#comment-14157114 ] Apache Spark commented on SPARK-3219: - User 'derrickburns' has created a pull request for this issue: https://github.com/apache/spark/pull/2634 K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3424) KMeans Plus Plus is too slow
[ https://issues.apache.org/jira/browse/SPARK-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157116#comment-14157116 ] Apache Spark commented on SPARK-3424: - User 'derrickburns' has created a pull request for this issue: https://github.com/apache/spark/pull/2634 KMeans Plus Plus is too slow Key: SPARK-3424 URL: https://issues.apache.org/jira/browse/SPARK-3424 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.2 Reporter: Derrick Burns The KMeansPlusPlus algorithm is implemented in time O( m k^2), where m is the rounds of the KMeansParallel algorithm and k is the number of clusters. This can be dramatically improved by maintaining the distance the closest cluster center from round to round and then incrementally updating that value for each point. This incremental update is O(1) time, this reduces the running time for K Means Plus Plus to O( m k ). For large k, this is significant. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path
Tom Weber created SPARK-3769: Summary: SparkFiles.get gives me the wrong fully qualified path Key: SPARK-3769 URL: https://issues.apache.org/jira/browse/SPARK-3769 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.1.0, 1.0.2 Environment: linux host, and linux grid. Reporter: Tom Weber Priority: Minor My spark pgm running on my host, (submitting work to my grid). JavaSparkContext sc =new JavaSparkContext(conf); final String path = args[1]; sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */ The log shows: 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986 those are paths on my host machine. The location that this file gets on grid nodes is: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas While the call to get the path in my code that runs in my mapPartitions function on the grid nodes is: String pgm = SparkFiles.get(path); And this returns the following string: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas So, am I expected to take the qualified path that was given to me and parse it to get only the file name at the end, and then concatenate that to what I get from the SparkFiles.getRootDirectory() call in order to get this to work? Or pass only the parsed file name to the SparkFiles.get method? Seems as though I should be able to pass the same file specification to both sc.addFile() and SparkFiles.get() and get the correct location of the file. Thanks, Tom -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3766) Snappy is also the default compression codec for broadcast variables
[ https://issues.apache.org/jira/browse/SPARK-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3766. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: wangfei Snappy is also the default compression codec for broadcast variables Key: SPARK-3766 URL: https://issues.apache.org/jira/browse/SPARK-3766 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.1.0 Reporter: wangfei Assignee: wangfei Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path
[ https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157156#comment-14157156 ] Sean Owen commented on SPARK-3769: -- My understanding is that you execute: {code} sc.addFile(/opt/tom/SparkFiles.sas); ... SparkFiles.get(SparkFiles.sas); {code} I would not expect the key used by remote workers must be aware of the location on the driver that the file came from. The path may not be absolute in all cases anyway. I can see the argument that it feels like both should be the same key but really the key being set is the file name, not path. You don't have to parse it by hand though. Usually you might do something like this anyway: {code} File myFile = new File(args[1]); sc.addFile(myFile.getAbsolutePath()); String fileName = myFile.getName(); ... SparkFiles.get(fileName); {code} AFAIK this is as intended. SparkFiles.get gives me the wrong fully qualified path -- Key: SPARK-3769 URL: https://issues.apache.org/jira/browse/SPARK-3769 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2, 1.1.0 Environment: linux host, and linux grid. Reporter: Tom Weber Priority: Minor My spark pgm running on my host, (submitting work to my grid). JavaSparkContext sc =new JavaSparkContext(conf); final String path = args[1]; sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */ The log shows: 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986 those are paths on my host machine. The location that this file gets on grid nodes is: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas While the call to get the path in my code that runs in my mapPartitions function on the grid nodes is: String pgm = SparkFiles.get(path); And this returns the following string: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas So, am I expected to take the qualified path that was given to me and parse it to get only the file name at the end, and then concatenate that to what I get from the SparkFiles.getRootDirectory() call in order to get this to work? Or pass only the parsed file name to the SparkFiles.get method? Seems as though I should be able to pass the same file specification to both sc.addFile() and SparkFiles.get() and get the correct location of the file. Thanks, Tom -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3495) Block replication fails continuously when the replication target node is dead
[ https://issues.apache.org/jira/browse/SPARK-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3495: --- Target Version/s: 1.2.0 (was: 1.1.1, 1.2.0) Block replication fails continuously when the replication target node is dead - Key: SPARK-3495 URL: https://issues.apache.org/jira/browse/SPARK-3495 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core, Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Fix For: 1.2.0 If a block manager (say, A) wants to replicate a block and the node chosen for replication (say, B) is dead, then the attempt to send the block to B fails. However, this continues to fail indefinitely. Even if the driver learns about the demise of the B, A continues to try replicating to B and failing miserably. The reason behind this bug is that A initially fetches a list of peers from the driver (when B was active), but never updates it after B is dead. This affects Spark Streaming as its receiver uses block replication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3496) Block replication can by mistake choose driver BlockManager as a peer for replication
[ https://issues.apache.org/jira/browse/SPARK-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3496. Resolution: Fixed Fix Version/s: 1.2.0 Block replication can by mistake choose driver BlockManager as a peer for replication - Key: SPARK-3496 URL: https://issues.apache.org/jira/browse/SPARK-3496 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core, Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Fix For: 1.2.0 When selecting peer block managers for replicating a block, the driver block manager can also get chosen accidentally. This is because BlockManagerMasterActor did not filter out the driver block manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3495) Block replication fails continuously when the replication target node is dead
[ https://issues.apache.org/jira/browse/SPARK-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3495. Resolution: Fixed Fix Version/s: 1.2.0 Block replication fails continuously when the replication target node is dead - Key: SPARK-3495 URL: https://issues.apache.org/jira/browse/SPARK-3495 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core, Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Fix For: 1.2.0 If a block manager (say, A) wants to replicate a block and the node chosen for replication (say, B) is dead, then the attempt to send the block to B fails. However, this continues to fail indefinitely. Even if the driver learns about the demise of the B, A continues to try replicating to B and failing miserably. The reason behind this bug is that A initially fetches a list of peers from the driver (when B was active), but never updates it after B is dead. This affects Spark Streaming as its receiver uses block replication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3496) Block replication can by mistake choose driver BlockManager as a peer for replication
[ https://issues.apache.org/jira/browse/SPARK-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3496: --- Target Version/s: 1.2.0 (was: 1.1.1, 1.2.0) Block replication can by mistake choose driver BlockManager as a peer for replication - Key: SPARK-3496 URL: https://issues.apache.org/jira/browse/SPARK-3496 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core, Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Fix For: 1.2.0 When selecting peer block managers for replicating a block, the driver block manager can also get chosen accidentally. This is because BlockManagerMasterActor did not filter out the driver block manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3632) ConnectionManager can run out of receive threads with authentication on
[ https://issues.apache.org/jira/browse/SPARK-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3632. Resolution: Fixed Fix Version/s: 1.2.0 ConnectionManager can run out of receive threads with authentication on --- Key: SPARK-3632 URL: https://issues.apache.org/jira/browse/SPARK-3632 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Critical Fix For: 1.2.0 If you turn authentication on and you are using a lot of executors. There is a chance that all the of the threads in the handleMessageExecutor could be waiting to send a message because they are blocked waiting on authentication to happen. This can cause a temporary deadlock until the connection times out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3770) The userFeatures RDD from MatrixFactorizationModel isn't accessible from the python bindings
Michelangelo D'Agostino created SPARK-3770: -- Summary: The userFeatures RDD from MatrixFactorizationModel isn't accessible from the python bindings Key: SPARK-3770 URL: https://issues.apache.org/jira/browse/SPARK-3770 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Michelangelo D'Agostino We need access to the underlying latent user features from python. However, the userFeatures RDD from the MatrixFactorizationModel isn't accessible from the python bindings. I've fixed this with a PR that I'll submit shortly that adds a method to the underlying scala class to turn the RDD[(Int, Array[Double])] to an RDD[String]. This is then accessed from the python recommendation.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1284) pyspark hangs after IOError on Executor
[ https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-1284. --- Resolution: Fixed Fix Version/s: 1.1.0 I think this is an logging issue ,should be fixed by https://github.com/apache/spark/pull/1625, so close it. If anyone meet this again, we can reopen it. pyspark hangs after IOError on Executor --- Key: SPARK-1284 URL: https://issues.apache.org/jira/browse/SPARK-1284 Project: Spark Issue Type: Bug Components: PySpark Reporter: Jim Blomo Assignee: Davies Liu Fix For: 1.1.0 When running a reduceByKey over a cached RDD, Python fails with an exception, but the failure is not detected by the task runner. Spark and the pyspark shell hang waiting for the task to finish. The error is: {code} PySpark worker failed with exception: Traceback (most recent call last): File /home/hadoop/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /home/hadoop/spark/python/pyspark/serializers.py, line 182, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /home/hadoop/spark/python/pyspark/serializers.py, line 118, in dump_stream self._write_with_length(obj, stream) File /home/hadoop/spark/python/pyspark/serializers.py, line 130, in _write_with_length stream.write(serialized) IOError: [Errno 104] Connection reset by peer 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 4257 bytes in 47 ms Traceback (most recent call last): File /home/hadoop/spark/python/pyspark/daemon.py, line 117, in launch_worker worker(listen_sock) File /home/hadoop/spark/python/pyspark/daemon.py, line 107, in worker outfile.flush() IOError: [Errno 32] Broken pipe {code} I can reproduce the error by running take(10) on the cached RDD before running reduceByKey (which looks at the whole input file). Affects Version 1.0.0-SNAPSHOT (4d88030486) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3770) The userFeatures RDD from MatrixFactorizationModel isn't accessible from the python bindings
[ https://issues.apache.org/jira/browse/SPARK-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157242#comment-14157242 ] Apache Spark commented on SPARK-3770: - User 'mdagost' has created a pull request for this issue: https://github.com/apache/spark/pull/2636 The userFeatures RDD from MatrixFactorizationModel isn't accessible from the python bindings Key: SPARK-3770 URL: https://issues.apache.org/jira/browse/SPARK-3770 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Michelangelo D'Agostino We need access to the underlying latent user features from python. However, the userFeatures RDD from the MatrixFactorizationModel isn't accessible from the python bindings. I've fixed this with a PR that I'll submit shortly that adds a method to the underlying scala class to turn the RDD[(Int, Array[Double])] to an RDD[String]. This is then accessed from the python recommendation.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path
[ https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157263#comment-14157263 ] Tom Weber commented on SPARK-3769: -- Thanks for the quick turnaround! I can see that it wouldn't necessarily make sense that a fully qualified path (relative to the driver programs filesystem) would be what the .get method would take on the worker node systems. But, at the same time, the .get seems like it just takes what you give it and blindly concatenates it to the .getRootDirectory result w/out even validating it or failing if that file doesn't exist. I appreciate the File object methods for pulling the path name apart; I'll use that and that will work just fine. First time playing around with all of this, so sometimes what you expect it to do is just a matter of thinking about it a particular way :) You can close this ticket out, as I'm sure I'll be able to work fine by using the full path on the driver side and only the file name on the worker side. Seems like it might be convenient though if these matched set of routines did this themselves since the driver side needs a qualified path to find the file, and the worker side, by definition, strips that off and only put's the file in the designated work directory (which make sense of course). No big deal though. Thanks again, Tom SparkFiles.get gives me the wrong fully qualified path -- Key: SPARK-3769 URL: https://issues.apache.org/jira/browse/SPARK-3769 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2, 1.1.0 Environment: linux host, and linux grid. Reporter: Tom Weber Priority: Minor My spark pgm running on my host, (submitting work to my grid). JavaSparkContext sc =new JavaSparkContext(conf); final String path = args[1]; sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */ The log shows: 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986 those are paths on my host machine. The location that this file gets on grid nodes is: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas While the call to get the path in my code that runs in my mapPartitions function on the grid nodes is: String pgm = SparkFiles.get(path); And this returns the following string: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas So, am I expected to take the qualified path that was given to me and parse it to get only the file name at the end, and then concatenate that to what I get from the SparkFiles.getRootDirectory() call in order to get this to work? Or pass only the parsed file name to the SparkFiles.get method? Seems as though I should be able to pass the same file specification to both sc.addFile() and SparkFiles.get() and get the correct location of the file. Thanks, Tom -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711
[ https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157351#comment-14157351 ] Marcelo Vanzin commented on SPARK-3633: --- Hey [~pwendell] [~matei], is anyone activelly looking at this issue? Fetches failure observed after SPARK-2711 - Key: SPARK-3633 URL: https://issues.apache.org/jira/browse/SPARK-3633 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.1.0 Reporter: Nishkam Ravi Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. Recently upgraded to Spark 1.1. The workload fails with the following error message(s): {code} 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120) 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages {code} In order to identify the problem, I carried out change set analysis. As I go back in time, the error message changes to: {code} 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, c1706.halxg.cloudera.com): java.io.FileNotFoundException: /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185) org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} All the way until Aug 4th. Turns out the problem changeset is 4fde28c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path
[ https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157389#comment-14157389 ] Josh Rosen commented on SPARK-3769: --- I think that {{SparkFiles.get()}} can be called from driver code, too, so that's one option if you'd like to achieve consistency between driver and executor code. SparkFiles.get gives me the wrong fully qualified path -- Key: SPARK-3769 URL: https://issues.apache.org/jira/browse/SPARK-3769 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2, 1.1.0 Environment: linux host, and linux grid. Reporter: Tom Weber Priority: Minor My spark pgm running on my host, (submitting work to my grid). JavaSparkContext sc =new JavaSparkContext(conf); final String path = args[1]; sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */ The log shows: 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986 those are paths on my host machine. The location that this file gets on grid nodes is: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas While the call to get the path in my code that runs in my mapPartitions function on the grid nodes is: String pgm = SparkFiles.get(path); And this returns the following string: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas So, am I expected to take the qualified path that was given to me and parse it to get only the file name at the end, and then concatenate that to what I get from the SparkFiles.getRootDirectory() call in order to get this to work? Or pass only the parsed file name to the SparkFiles.get method? Seems as though I should be able to pass the same file specification to both sc.addFile() and SparkFiles.get() and get the correct location of the file. Thanks, Tom -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path
[ https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3769. --- Resolution: Not a Problem SparkFiles.get gives me the wrong fully qualified path -- Key: SPARK-3769 URL: https://issues.apache.org/jira/browse/SPARK-3769 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2, 1.1.0 Environment: linux host, and linux grid. Reporter: Tom Weber Priority: Minor My spark pgm running on my host, (submitting work to my grid). JavaSparkContext sc =new JavaSparkContext(conf); final String path = args[1]; sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */ The log shows: 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986 those are paths on my host machine. The location that this file gets on grid nodes is: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas While the call to get the path in my code that runs in my mapPartitions function on the grid nodes is: String pgm = SparkFiles.get(path); And this returns the following string: /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas So, am I expected to take the qualified path that was given to me and parse it to get only the file name at the end, and then concatenate that to what I get from the SparkFiles.getRootDirectory() call in order to get this to work? Or pass only the parsed file name to the SparkFiles.get method? Seems as though I should be able to pass the same file specification to both sc.addFile() and SparkFiles.get() and get the correct location of the file. Thanks, Tom -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-1671) Cached tables should follow write-through policy
[ https://issues.apache.org/jira/browse/SPARK-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-1671: --- Assignee: Michael Armbrust Cached tables should follow write-through policy Key: SPARK-1671 URL: https://issues.apache.org/jira/browse/SPARK-1671 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Cheng Lian Assignee: Michael Armbrust Labels: cache, column Writing (insert / load) to a cached table causes cache inconsistency, and user have to unpersist and cache the whole table again. The write-through policy may be implemented with {{RDD.union}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157418#comment-14157418 ] Takuya Ueshin commented on SPARK-3764: -- {{AppendingParquetOutputFormat}} is using {{TaskAttemptContext}}, which is a class in [hadoop-1|https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/TaskAttemptContext.html] but is an interface in [hadoop-2|http://hadoop.apache.org/docs/r2.5.1/api/org/apache/hadoop/mapreduce/TaskAttemptContext.html], so the {{context.getTaskAttemptID}} is source-compatible but not binary-compatible. If Spark itself is built against hadoop-1, the artifact is for only hadoop-1, and vice versa. Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3771) AppendingParquetOutputFormat should use reflection to prevent breaking binary-compatibility.
Takuya Ueshin created SPARK-3771: Summary: AppendingParquetOutputFormat should use reflection to prevent breaking binary-compatibility. Key: SPARK-3771 URL: https://issues.apache.org/jira/browse/SPARK-3771 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Takuya Ueshin Original problem is [SPARK-3764|https://issues.apache.org/jira/browse/SPARK-3764]. {{AppendingParquetOutputFormat}} uses a binary-incompatible method {{context.getTaskAttemptID}}. This causes binary-incompatible of Spark itself, i.e. if Spark itself is built against hadoop-1, the artifact is for only hadoop-1, and vice versa. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3771) AppendingParquetOutputFormat should use reflection to prevent breaking binary-compatibility.
[ https://issues.apache.org/jira/browse/SPARK-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157441#comment-14157441 ] Apache Spark commented on SPARK-3771: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/2638 AppendingParquetOutputFormat should use reflection to prevent breaking binary-compatibility. Key: SPARK-3771 URL: https://issues.apache.org/jira/browse/SPARK-3771 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Takuya Ueshin Original problem is [SPARK-3764|https://issues.apache.org/jira/browse/SPARK-3764]. {{AppendingParquetOutputFormat}} uses a binary-incompatible method {{context.getTaskAttemptID}}. This causes binary-incompatible of Spark itself, i.e. if Spark itself is built against hadoop-1, the artifact is for only hadoop-1, and vice versa. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157447#comment-14157447 ] Takuya Ueshin commented on SPARK-3764: -- I filed a new issue SPARK-3771 and close this. Thanks, [~srowen]! Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.
[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin closed SPARK-3764. Resolution: Fixed Invalid dependencies of artifacts in Maven Central Repository. -- Key: SPARK-3764 URL: https://issues.apache.org/jira/browse/SPARK-3764 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Takuya Ueshin While testing my spark applications locally using spark artifacts downloaded from Maven Central, the following exception was thrown: {quote} ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} This is because the hadoop class {{TaskAttemptContext}} is incompatible between hadoop-1 and hadoop-2. I guess the spark artifacts in Maven Central were built against hadoop-2 with Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so the hadoop version mismatch is happend. FYI: sbt seems to publish 'effective pom'-like pom file, so the dependencies are correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number
cocoatomo created SPARK-3772: Summary: RDD operation on IPython REPL failed with an illegal port number Key: SPARK-3772 URL: https://issues.apache.org/jira/browse/SPARK-3772 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0 Reporter: cocoatomo To reproduce this issue, we should execute following commands. {quote} $ PYSPARK_PYTHON=ipython ./bin/pyspark ... In [1]: file = sc.textFile('README.md') In [2]: file.first() ... 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at PythonRDD.scala:334 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:334) with 1 output partitions (allowLocal=true) 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at PythonRDD.scala:334) 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List() 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List() 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44), which has no missing parents 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with curMem=57388, maxMem=278019440 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.4 KB, free 265.1 MB) 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44) 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1207 bytes) 14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: port out of range:1027423549 at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143) at java.net.InetSocketAddress.init(InetSocketAddress.java:188) at java.net.Socket.init(Socket.java:244) at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75) at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3572) Support register UserType in SQL
[ https://issues.apache.org/jira/browse/SPARK-3572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3572: - Assignee: Joseph K. Bradley Support register UserType in SQL Key: SPARK-3572 URL: https://issues.apache.org/jira/browse/SPARK-3572 Project: Spark Issue Type: New Feature Components: SQL Reporter: Xiangrui Meng Assignee: Joseph K. Bradley If a user knows how to map a class to a struct type in Spark SQL, he should be able to register this mapping through sqlContext and hence SQL can figure out the schema automatically. {code} trait RowSerializer[T] { def dataType: StructType def serialize(obj: T): Row def deserialize(row: Row): T } sqlContext.registerUserType[T](clazz: classOf[T], serializer: classOf[RowSerializer[T]]) {code} In sqlContext, we can maintain a class-to-serializer map and use it for conversion. The serializer class can be embedded into the metadata, so when `select` is called, we know we want to deserialize the result. {code} sqlContext.registerUserType(classOf[Vector], classOf[VectorRowSerializer]) val points: RDD[LabeledPoint] = ... val features: RDD[Vector] = points.select('features).map { case Row(v: Vector) = v } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2066) Better error message for non-aggregated attributes with aggregates
[ https://issues.apache.org/jira/browse/SPARK-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2066: Priority: Critical (was: Major) Better error message for non-aggregated attributes with aggregates -- Key: SPARK-2066 URL: https://issues.apache.org/jira/browse/SPARK-2066 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Cheng Lian Priority: Critical [~marmbrus] Run the following query {code} scala c.hql(select key, count(*) from src).collect() {code} Got the following exception at runtime {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function to evaluate expression. type: AttributeReference, tree: key#61 at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:157) at org.apache.spark.sql.catalyst.expressions.Projection.apply(Projection.scala:35) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:154) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:134) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:558) at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:558) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} This should either fail in analysis time, or pass at runtime. Definitely shouldn't fail at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3755) Do not bind port 1 - 1024 to server in spark
[ https://issues.apache.org/jira/browse/SPARK-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3755. Resolution: Fixed Fix Version/s: 1.1.1 Do not bind port 1 - 1024 to server in spark Key: SPARK-3755 URL: https://issues.apache.org/jira/browse/SPARK-3755 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: wangfei Assignee: wangfei Fix For: 1.1.1, 1.2.0 Non-root user use port 1- 1024 to start jetty server will get the exception java.net.SocketException: Permission denied -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3773) Sphinx build warnings
cocoatomo created SPARK-3773: Summary: Sphinx build warnings Key: SPARK-3773 URL: https://issues.apache.org/jira/browse/SPARK-3773 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0, Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, docutils==0.12, numpy==1.9.0 Reporter: cocoatomo Priority: Minor When building Sphinx documents for PySpark, we have 12 warnings. Their causes are almost docstrings in broken ReST format. To reproduce this issue, we should run following commands. {quote} $ cd ./python/docs $ make clean html ... /Users/user/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.SparkContext.sequenceFile:4: ERROR: Unexpected indentation. /Users/user/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.RDD.saveAsSequenceFile:4: ERROR: Unexpected indentation. /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:14: ERROR: Unexpected indentation. /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent. /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:14: ERROR: Unexpected indentation. /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent. /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/user/MyRepos/Scala/spark/python/docs/pyspark.mllib.rst:50: WARNING: missing attribute mentioned in :members: or __all__: module pyspark.mllib.regression, attribute RidgeRegressionModelLinearRegressionWithSGD /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.DecisionTreeModel.predict:3: ERROR: Unexpected indentation. ... checking consistency... /Users/user/MyRepos/Scala/spark/python/docs/modules.rst:: WARNING: document isn't included in any toctree ... copying static files... WARNING: html_static_path entry u'/Users/user/MyRepos/Scala/spark/python/docs/_static' does not exist ... build succeeded, 12 warnings. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3759) SparkSubmitDriverBootstrapper should return exit code of driver process
[ https://issues.apache.org/jira/browse/SPARK-3759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3759. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Assignee: Eric Eijkelenboom Target Version/s: 1.1.1, 1.2.0 SparkSubmitDriverBootstrapper should return exit code of driver process --- Key: SPARK-3759 URL: https://issues.apache.org/jira/browse/SPARK-3759 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.1.0 Environment: Linux, Windows, Scala/Java Reporter: Eric Eijkelenboom Assignee: Eric Eijkelenboom Priority: Minor Fix For: 1.1.1, 1.2.0 Original Estimate: 24h Remaining Estimate: 24h SparkSubmitDriverBootstrapper.scala currently always returns exit code 0. Instead, it should return the exit code of the driver process. Suggested code change in SparkSubmitDriverBootstrapper, line 157: {code} val returnCode = process.waitFor() sys.exit(returnCode) {code} Workaround for this issue: Instead of specifying 'driver.extra*' properties in spark-defaults.conf, pass these properties to spark-submit directly. This will launch the driver program without the use of SparkSubmitDriverBootstrapper. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number
[ https://issues.apache.org/jira/browse/SPARK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157548#comment-14157548 ] Josh Rosen commented on SPARK-3772: --- Can you post the SHA of the commit that you were using when you saw this? RDD operation on IPython REPL failed with an illegal port number Key: SPARK-3772 URL: https://issues.apache.org/jira/browse/SPARK-3772 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0 Reporter: cocoatomo Labels: pyspark To reproduce this issue, we should execute following commands. {quote} $ PYSPARK_PYTHON=ipython ./bin/pyspark ... In [1]: file = sc.textFile('README.md') In [2]: file.first() ... 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at PythonRDD.scala:334 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:334) with 1 output partitions (allowLocal=true) 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at PythonRDD.scala:334) 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List() 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List() 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44), which has no missing parents 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with curMem=57388, maxMem=278019440 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.4 KB, free 265.1 MB) 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44) 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1207 bytes) 14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: port out of range:1027423549 at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143) at java.net.InetSocketAddress.init(InetSocketAddress.java:188) at java.net.Socket.init(Socket.java:244) at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75) at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number
[ https://issues.apache.org/jira/browse/SPARK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cocoatomo updated SPARK-3772: - Description: To reproduce this issue, we should execute following commands on the commit: 6e27cb630de69fa5acb510b4e2f6b980742b1957. {quote} $ PYSPARK_PYTHON=ipython ./bin/pyspark ... In [1]: file = sc.textFile('README.md') In [2]: file.first() ... 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at PythonRDD.scala:334 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:334) with 1 output partitions (allowLocal=true) 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at PythonRDD.scala:334) 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List() 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List() 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44), which has no missing parents 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with curMem=57388, maxMem=278019440 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.4 KB, free 265.1 MB) 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44) 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1207 bytes) 14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: port out of range:1027423549 at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143) at java.net.InetSocketAddress.init(InetSocketAddress.java:188) at java.net.Socket.init(Socket.java:244) at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75) at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) {quote} was: To reproduce this issue, we should execute following commands. {quote} $ PYSPARK_PYTHON=ipython ./bin/pyspark ... In [1]: file = sc.textFile('README.md') In [2]: file.first() ... 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at PythonRDD.scala:334 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:334) with 1 output partitions (allowLocal=true) 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at PythonRDD.scala:334) 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List() 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List() 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44), which has no missing parents 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with curMem=57388, maxMem=278019440 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.4 KB, free 265.1 MB) 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44) 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1207 bytes) 14/10/03 08:50:13 INFO Executor: Running
[jira] [Commented] (SPARK-3773) Sphinx build warnings
[ https://issues.apache.org/jira/browse/SPARK-3773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157590#comment-14157590 ] cocoatomo commented on SPARK-3773: -- Using Sphinx to generate API docs for PySpark Sphinx build warnings - Key: SPARK-3773 URL: https://issues.apache.org/jira/browse/SPARK-3773 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0, Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, docutils==0.12, numpy==1.9.0 Reporter: cocoatomo Priority: Minor Labels: docs, docstrings, pyspark When building Sphinx documents for PySpark, we have 12 warnings. Their causes are almost docstrings in broken ReST format. To reproduce this issue, we should run following commands on the commit: 6e27cb630de69fa5acb510b4e2f6b980742b1957. {quote} $ cd ./python/docs $ make clean html ... /Users/user/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.SparkContext.sequenceFile:4: ERROR: Unexpected indentation. /Users/user/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.RDD.saveAsSequenceFile:4: ERROR: Unexpected indentation. /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:14: ERROR: Unexpected indentation. /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent. /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:14: ERROR: Unexpected indentation. /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent. /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/user/MyRepos/Scala/spark/python/docs/pyspark.mllib.rst:50: WARNING: missing attribute mentioned in :members: or __all__: module pyspark.mllib.regression, attribute RidgeRegressionModelLinearRegressionWithSGD /Users/user/MyRepos/Scala/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.DecisionTreeModel.predict:3: ERROR: Unexpected indentation. ... checking consistency... /Users/user/MyRepos/Scala/spark/python/docs/modules.rst:: WARNING: document isn't included in any toctree ... copying static files... WARNING: html_static_path entry u'/Users/user/MyRepos/Scala/spark/python/docs/_static' does not exist ... build succeeded, 12 warnings. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3771) AppendingParquetOutputFormat should use reflection to prevent from breaking binary-compatibility.
[ https://issues.apache.org/jira/browse/SPARK-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-3771: - Summary: AppendingParquetOutputFormat should use reflection to prevent from breaking binary-compatibility. (was: AppendingParquetOutputFormat should use reflection to prevent breaking binary-compatibility.) AppendingParquetOutputFormat should use reflection to prevent from breaking binary-compatibility. - Key: SPARK-3771 URL: https://issues.apache.org/jira/browse/SPARK-3771 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Takuya Ueshin Original problem is [SPARK-3764|https://issues.apache.org/jira/browse/SPARK-3764]. {{AppendingParquetOutputFormat}} uses a binary-incompatible method {{context.getTaskAttemptID}}. This causes binary-incompatible of Spark itself, i.e. if Spark itself is built against hadoop-1, the artifact is for only hadoop-1, and vice versa. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3774) typo comment in bin/utils.sh
[ https://issues.apache.org/jira/browse/SPARK-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157593#comment-14157593 ] Apache Spark commented on SPARK-3774: - User 'tsudukim' has created a pull request for this issue: https://github.com/apache/spark/pull/2639 typo comment in bin/utils.sh Key: SPARK-3774 URL: https://issues.apache.org/jira/browse/SPARK-3774 Project: Spark Issue Type: Improvement Components: PySpark, Spark Shell Affects Versions: 1.1.0 Reporter: Masayoshi TSUZUKI Priority: Trivial typo comment in bin/utils.sh {code} # Gather all all spark-submit options into SUBMISSION_OPTS {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number
[ https://issues.apache.org/jira/browse/SPARK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157602#comment-14157602 ] Josh Rosen commented on SPARK-3772: --- Ah, I see the problem: PythonWorkerFactory also passes the -u flag when creating Python workers and daemons, which doesn't work in IPython. I noticed one of the uses of -u when reviewing your PR, but missed these uses: https://github.com/apache/spark/blob/42d5077fd3f2c37d1cd23f4c81aa89286a74cb40/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L111 https://github.com/apache/spark/blob/42d5077fd3f2c37d1cd23f4c81aa89286a74cb40/core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala#L152 I guess we need to apply the same PYTHONUNBUFFERED fix here, too. Sorry for overlooking this. It's strange that PythonWorkerFactory didn't report this error in a more graceful way, though. I think that IPython printed an error message before our code had a chance to run, and in startDaemon() we expected to read an integer from stdout but instead received text. There are probably less brittle mechanisms for communicating the daemon process's port to its parent (SPARK-2313 is an issue that partially addresses this). I can fix this and open a PR. If you'd like to do it yourself, just let me know and I'd be glad to review it. RDD operation on IPython REPL failed with an illegal port number Key: SPARK-3772 URL: https://issues.apache.org/jira/browse/SPARK-3772 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0 Reporter: cocoatomo Labels: pyspark To reproduce this issue, we should execute following commands on the commit: 6e27cb630de69fa5acb510b4e2f6b980742b1957. {quote} $ PYSPARK_PYTHON=ipython ./bin/pyspark ... In [1]: file = sc.textFile('README.md') In [2]: file.first() ... 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at PythonRDD.scala:334 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:334) with 1 output partitions (allowLocal=true) 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at PythonRDD.scala:334) 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List() 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List() 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44), which has no missing parents 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with curMem=57388, maxMem=278019440 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.4 KB, free 265.1 MB) 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44) 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1207 bytes) 14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: port out of range:1027423549 at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143) at java.net.InetSocketAddress.init(InetSocketAddress.java:188) at java.net.Socket.init(Socket.java:244) at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75) at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
[jira] [Commented] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number
[ https://issues.apache.org/jira/browse/SPARK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157604#comment-14157604 ] Josh Rosen commented on SPARK-3772: --- The reason that we never hit this before is that setting IPYTHON=1 caused Spark to only use IPython on the master; the workers and daemons still launched through regular `python`. The old behavior might actually be preferable from a performance standpoint, since `ipython` might take longer to start up (this is less of an issue nowadays thanks to the worker re-use patch). RDD operation on IPython REPL failed with an illegal port number Key: SPARK-3772 URL: https://issues.apache.org/jira/browse/SPARK-3772 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0 Reporter: cocoatomo Labels: pyspark To reproduce this issue, we should execute following commands on the commit: 6e27cb630de69fa5acb510b4e2f6b980742b1957. {quote} $ PYSPARK_PYTHON=ipython ./bin/pyspark ... In [1]: file = sc.textFile('README.md') In [2]: file.first() ... 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at PythonRDD.scala:334 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at PythonRDD.scala:334) with 1 output partitions (allowLocal=true) 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at PythonRDD.scala:334) 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List() 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List() 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44), which has no missing parents 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with curMem=57388, maxMem=278019440 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.4 KB, free 265.1 MB) 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[2] at RDD at PythonRDD.scala:44) 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1207 bytes) 14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: port out of range:1027423549 at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143) at java.net.InetSocketAddress.init(InetSocketAddress.java:188) at java.net.Socket.init(Socket.java:244) at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75) at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org