[jira] [Created] (SPARK-4337) Add ability to cancel pending requests to YARN
Sandy Ryza created SPARK-4337: - Summary: Add ability to cancel pending requests to YARN Key: SPARK-4337 URL: https://issues.apache.org/jira/browse/SPARK-4337 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza This will be useful for things like SPARK-4136 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4214) With dynamic allocation, avoid outstanding requests for more executors than pending tasks need
[ https://issues.apache.org/jira/browse/SPARK-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206098#comment-14206098 ] Apache Spark commented on SPARK-4214: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/3204 > With dynamic allocation, avoid outstanding requests for more executors than > pending tasks need > -- > > Key: SPARK-4214 > URL: https://issues.apache.org/jira/browse/SPARK-4214 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.2.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > Dynamic allocation tries to allocate more executors while we have pending > tasks remaining. Our current policy can end up with more outstanding > executor requests than needed to fulfill all the pending tasks. Capping the > executor requests to the number of cores needed to fulfill all pending tasks > would make dynamic allocation behavior less sensitive to settings for > maxExecutors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3647) Shaded Guava patch causes access issues with package private classes
[ https://issues.apache.org/jira/browse/SPARK-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206081#comment-14206081 ] Jeff Hammerbacher commented on SPARK-3647: -- Should this issue have a Fix version set? > Shaded Guava patch causes access issues with package private classes > > > Key: SPARK-3647 > URL: https://issues.apache.org/jira/browse/SPARK-3647 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Critical > > The patch that introduced shading to Guava (SPARK-2848) tried to maintain > backwards compatibility in the Java API by not relocating the "Optional" > class. That causes problems when that class references package private > members in the Absent and Present classes, which are now in a different > package: > {noformat} > Exception in thread "main" java.lang.IllegalAccessError: tried to access > class org.spark-project.guava.common.base.Present from class > com.google.common.base.Optional > at com.google.common.base.Optional.of(Optional.java:86) > at > org.apache.spark.api.java.JavaUtils$.optionToOptional(JavaUtils.scala:25) > at > org.apache.spark.api.java.JavaSparkContext.getSparkHome(JavaSparkContext.scala:542) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3990) kryo.KryoException caused by ALS.trainImplicit in pyspark
[ https://issues.apache.org/jira/browse/SPARK-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206036#comment-14206036 ] Xiangrui Meng commented on SPARK-3990: -- In Spark 1.1.1, I put a note on ALS and ask users to use the default serializer if they want to run ALS. > kryo.KryoException caused by ALS.trainImplicit in pyspark > - > > Key: SPARK-3990 > URL: https://issues.apache.org/jira/browse/SPARK-3990 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.1.0 > Environment: 5 slaves cluster(m3.large) in AWS launched by spark-ec2 > Linux > Python 2.6.8 >Reporter: Gen TANG >Assignee: Xiangrui Meng >Priority: Critical > Labels: test > Fix For: 1.1.1, 1.2.0 > > > When we tried ALS.trainImplicit() in pyspark environment, it only works for > iterations = 1. What is more strange, it is that if we try the same code in > Scala, it works very well.(I did several test, by now, in Scala > ALS.trainImplicit works) > For example, the following code: > {code:title=test.py|borderStyle=solid} > from pyspark.mllib.recommendation import * > r1 = (1, 1, 1.0) > r2 = (1, 2, 2.0) > r3 = (2, 1, 2.0) > ratings = sc.parallelize([r1, r2, r3]) > model = ALS.trainImplicit(ratings, 1) > '''by default iterations = 5 or model = ALS.trainImplicit(ratings, 1, 2)''' > {code} > It will cause the failed stage at count at ALS.scala:314 Info as: > {code:title=error information provided by ganglia} > Job aborted due to stage failure: Task 6 in stage 90.0 failed 4 times, most > recent failure: Lost task 6.3 in stage 90.0 (TID 484, > ip-172-31-35-238.ec2.internal): com.esotericsoftware.kryo.KryoException: > java.lang.ArrayStoreException: scala.collection.mutable.HashSet > Serialization trace: > shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) > > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) > > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > org.apache.spark.rdd.RDD.iterator(RDD.scala:227) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.Thr
[jira] [Commented] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file
[ https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206026#comment-14206026 ] Apache Spark commented on SPARK-4314: - User 'maji2014' has created a pull request for this issue: https://github.com/apache/spark/pull/3203 > Exception when textFileStream attempts to read deleted _COPYING_ file > - > > Key: SPARK-4314 > URL: https://issues.apache.org/jira/browse/SPARK-4314 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: maji2014 > > [Reproduce] > 1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))" > 2. Upload file to hdfs(reason as followings) > 3. Exception as followings. > [Exception stack] > 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to > master/192.168.84.142:9000 from ocdc sending #13 > 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time > 1415611274000 ms > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: hdfs://master:9000/user/spark/200.COPYING > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection
[jira] [Commented] (SPARK-4336) auto detect type from json string
[ https://issues.apache.org/jira/browse/SPARK-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206025#comment-14206025 ] Apache Spark commented on SPARK-4336: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/3202 > auto detect type from json string > - > > Key: SPARK-4336 > URL: https://issues.apache.org/jira/browse/SPARK-4336 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Adrian Wang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4335) Mima check misreporting for GraphX pull request
[ https://issues.apache.org/jira/browse/SPARK-4335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206019#comment-14206019 ] Apache Spark commented on SPARK-4335: - User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/3201 > Mima check misreporting for GraphX pull request > --- > > Key: SPARK-4335 > URL: https://issues.apache.org/jira/browse/SPARK-4335 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.2.0 >Reporter: Reynold Xin >Assignee: Prashant Sharma >Priority: Blocker > > Mima is reporting binary check failure for RDD/partitions in the following > pull request even though the pull request doesn't change those at all. > https://github.com/apache/spark/pull/2530 > {code} > [error] * abstract method getPartitions()Array[org.apache.spark.Partition] > in class org.apache.spark.rdd.RDD does not have a correspondent in new version > [error]filter with: > ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions") > [error] * abstract method > compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator > in class org.apache.spark.rdd.RDD does not have a correspondent in new > version > [error]filter with: > ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute") > [error] * abstract method getPartitions()Array[org.apache.spark.Partition] > in class org.apache.spark.rdd.RDD does not have a correspondent in new version > [error]filter with: > ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions") > [error] * abstract method > compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator > in class org.apache.spark.rdd.RDD does not have a correspondent in new > version > [error]filter with: > ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute") > [info] Done updating. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4324) Support numpy/scipy in all Python API of MLlib
[ https://issues.apache.org/jira/browse/SPARK-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-4324. Resolution: Fixed Fix Version/s: 1.2.0 > Support numpy/scipy in all Python API of MLlib > -- > > Key: SPARK-4324 > URL: https://issues.apache.org/jira/browse/SPARK-4324 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.2.0 > > > Before 1.2, we accept numpy/scipy array as Vector in MLlib Python API. We > should continue to support this semantics in 1.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4336) auto detect type from json string
Adrian Wang created SPARK-4336: -- Summary: auto detect type from json string Key: SPARK-4336 URL: https://issues.apache.org/jira/browse/SPARK-4336 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2652) Turning default configurations for PySpark
[ https://issues.apache.org/jira/browse/SPARK-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206013#comment-14206013 ] Xiangrui Meng commented on SPARK-2652: -- We kept Kryo as the default serialization in 1.1.1. > Turning default configurations for PySpark > -- > > Key: SPARK-2652 > URL: https://issues.apache.org/jira/browse/SPARK-2652 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.0.0 >Reporter: Davies Liu >Assignee: Davies Liu > Labels: Configuration, Python > Fix For: 1.2.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > Some default value of configuration does not make sense for PySpark, change > them to reasonable ones, such as spark.serializer and > spark.kryo.referenceTracking -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2652) Turning default configurations for PySpark
[ https://issues.apache.org/jira/browse/SPARK-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2652: - Target Version/s: 1.2.0 (was: 1.1.0) > Turning default configurations for PySpark > -- > > Key: SPARK-2652 > URL: https://issues.apache.org/jira/browse/SPARK-2652 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.0.0 >Reporter: Davies Liu >Assignee: Davies Liu > Labels: Configuration, Python > Fix For: 1.2.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > Some default value of configuration does not make sense for PySpark, change > them to reasonable ones, such as spark.serializer and > spark.kryo.referenceTracking -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2652) Turning default configurations for PySpark
[ https://issues.apache.org/jira/browse/SPARK-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2652: - Fix Version/s: (was: 1.1.1) > Turning default configurations for PySpark > -- > > Key: SPARK-2652 > URL: https://issues.apache.org/jira/browse/SPARK-2652 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.0.0 >Reporter: Davies Liu >Assignee: Davies Liu > Labels: Configuration, Python > Fix For: 1.2.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > Some default value of configuration does not make sense for PySpark, change > them to reasonable ones, such as spark.serializer and > spark.kryo.referenceTracking -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4330) Link to proper URL for YARN overview
[ https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4330: - Assignee: Kousuke Saruta > Link to proper URL for YARN overview > > > Key: SPARK-4330 > URL: https://issues.apache.org/jira/browse/SPARK-4330 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 1.1.1, 1.2.0 > > > In running-on-yarn.md, a link to YARN overview is here. > But the URL is to YARN alpha's. > It should be stable's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4330) Link to proper URL for YARN overview
[ https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-4330. -- Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Target Version/s: (was: 1.3.0) > Link to proper URL for YARN overview > > > Key: SPARK-4330 > URL: https://issues.apache.org/jira/browse/SPARK-4330 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 1.1.1, 1.2.0 > > > In running-on-yarn.md, a link to YARN overview is here. > But the URL is to YARN alpha's. > It should be stable's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4253) Ignore spark.driver.host in yarn-cluster and standalone-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WangTaoTheTonic updated SPARK-4253: --- Summary: Ignore spark.driver.host in yarn-cluster and standalone-cluster mode (was: Ignore spark.driver.host in yarn-cluster mode) > Ignore spark.driver.host in yarn-cluster and standalone-cluster mode > > > Key: SPARK-4253 > URL: https://issues.apache.org/jira/browse/SPARK-4253 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: WangTaoTheTonic >Priority: Minor > Attachments: Cannot assign requested address.txt > > > We actually don't know where driver will be before it is launched in > yarn-cluster mode. If we set spark.driver.host property, Spark will create > Actor on the hostname or ip as setted, which will leads an error. > So we should ignore this config item in yarn-cluster mode. > As [~joshrosen]] pointed, we also ignore it in standalone cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4253) Ignore spark.driver.host in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WangTaoTheTonic updated SPARK-4253: --- Description: We actually don't know where driver will be before it is launched in yarn-cluster mode. If we set spark.driver.host property, Spark will create Actor on the hostname or ip as setted, which will leads an error. So we should ignore this config item in yarn-cluster mode. As [~joshrosen]] pointed, we also ignore it in standalone cluster mode. was: We actually don't know where driver will be before it is launched in yarn-cluster mode. If we set spark.driver.host property, Spark will create Actor on the hostname or ip as setted, which will leads an error. So we should ignore this config item in yarn-cluster mode. > Ignore spark.driver.host in yarn-cluster mode > - > > Key: SPARK-4253 > URL: https://issues.apache.org/jira/browse/SPARK-4253 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: WangTaoTheTonic >Priority: Minor > Attachments: Cannot assign requested address.txt > > > We actually don't know where driver will be before it is launched in > yarn-cluster mode. If we set spark.driver.host property, Spark will create > Actor on the hostname or ip as setted, which will leads an error. > So we should ignore this config item in yarn-cluster mode. > As [~joshrosen]] pointed, we also ignore it in standalone cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3974) Block matrix abstracitons and partitioners
[ https://issues.apache.org/jira/browse/SPARK-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205975#comment-14205975 ] Apache Spark commented on SPARK-3974: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/3200 > Block matrix abstracitons and partitioners > -- > > Key: SPARK-3974 > URL: https://issues.apache.org/jira/browse/SPARK-3974 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh >Assignee: Burak Yavuz > > We need abstractions for block matrices with fixed block sizes, with each > block being dense. Partitioners along both rows and columns required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4335) Mima check misreporting for GraphX pull request
[ https://issues.apache.org/jira/browse/SPARK-4335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4335: --- Assignee: Prashant Sharma > Mima check misreporting for GraphX pull request > --- > > Key: SPARK-4335 > URL: https://issues.apache.org/jira/browse/SPARK-4335 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.2.0 >Reporter: Reynold Xin >Assignee: Prashant Sharma >Priority: Blocker > > Mima is reporting binary check failure for RDD/partitions in the following > pull request even though the pull request doesn't change those at all. > https://github.com/apache/spark/pull/2530 > {code} > [error] * abstract method getPartitions()Array[org.apache.spark.Partition] > in class org.apache.spark.rdd.RDD does not have a correspondent in new version > [error]filter with: > ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions") > [error] * abstract method > compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator > in class org.apache.spark.rdd.RDD does not have a correspondent in new > version > [error]filter with: > ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute") > [error] * abstract method getPartitions()Array[org.apache.spark.Partition] > in class org.apache.spark.rdd.RDD does not have a correspondent in new version > [error]filter with: > ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions") > [error] * abstract method > compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator > in class org.apache.spark.rdd.RDD does not have a correspondent in new > version > [error]filter with: > ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute") > [info] Done updating. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3936) Incorrect result in GraphX BytecodeUtils with closures + class/object methods
[ https://issues.apache.org/jira/browse/SPARK-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3936: --- Priority: Blocker (was: Major) > Incorrect result in GraphX BytecodeUtils with closures + class/object methods > - > > Key: SPARK-3936 > URL: https://issues.apache.org/jira/browse/SPARK-3936 > Project: Spark > Issue Type: Bug > Components: GraphX >Reporter: Pedro Rodriguez >Assignee: Ankur Dave >Priority: Blocker > > There seems to be a bug with the GraphX byte code inspection, specifically in > BytecodeUtils. > These are the unit tests I wrote to expose the problem: > https://github.com/EntilZha/spark/blob/a3c38a8329545c034fae2458df134fa3829d08fb/graphx/src/test/scala/org/apache/spark/graphx/util/BytecodeUtilsSuite.scala#L93-L121 > The first two tests pass, the second two tests fail. This exposes a problem > with inspection of methods in closures, in this case within maps. > Specifically, it seems like there is a problem with inspection of non-inline > methods in a closure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3649) ClassCastException in GraphX custom serializers when sort-based shuffle spills
[ https://issues.apache.org/jira/browse/SPARK-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3649. Resolution: Fixed Fix Version/s: 1.2.0 > ClassCastException in GraphX custom serializers when sort-based shuffle spills > -- > > Key: SPARK-3649 > URL: https://issues.apache.org/jira/browse/SPARK-3649 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > Fix For: 1.2.0 > > > As > [reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501] > on the mailing list, GraphX throws > {code} > java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2 > at > org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39) > > at > org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) > > at > org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329) > {code} > when sort-based shuffle attempts to spill to disk. This is because GraphX > defines custom serializers for shuffling pair RDDs that assume Spark will > always serialize the entire pair object rather than breaking it up into its > components. However, the spill code path in sort-based shuffle [violates this > assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329]. > GraphX uses the custom serializers to compress vertex ID keys using > variable-length integer encoding. However, since the serializer can no longer > rely on the key and value being serialized and deserialized together, > performing such encoding would require writing a tag byte. Therefore it may > be better to simply remove the custom serializers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3666) Extract interfaces for EdgeRDD and VertexRDD
[ https://issues.apache.org/jira/browse/SPARK-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3666: --- Priority: Blocker (was: Major) > Extract interfaces for EdgeRDD and VertexRDD > > > Key: SPARK-3666 > URL: https://issues.apache.org/jira/browse/SPARK-3666 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4335) Mima check misreporting for GraphX pull request
[ https://issues.apache.org/jira/browse/SPARK-4335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-4335: --- Priority: Blocker (was: Major) > Mima check misreporting for GraphX pull request > --- > > Key: SPARK-4335 > URL: https://issues.apache.org/jira/browse/SPARK-4335 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.2.0 >Reporter: Reynold Xin >Priority: Blocker > > Mima is reporting binary check failure for RDD/partitions in the following > pull request even though the pull request doesn't change those at all. > https://github.com/apache/spark/pull/2530 > {code} > [error] * abstract method getPartitions()Array[org.apache.spark.Partition] > in class org.apache.spark.rdd.RDD does not have a correspondent in new version > [error]filter with: > ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions") > [error] * abstract method > compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator > in class org.apache.spark.rdd.RDD does not have a correspondent in new > version > [error]filter with: > ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute") > [error] * abstract method getPartitions()Array[org.apache.spark.Partition] > in class org.apache.spark.rdd.RDD does not have a correspondent in new version > [error]filter with: > ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions") > [error] * abstract method > compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator > in class org.apache.spark.rdd.RDD does not have a correspondent in new > version > [error]filter with: > ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute") > [info] Done updating. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4335) Mima check misreporting for GraphX pull request
Reynold Xin created SPARK-4335: -- Summary: Mima check misreporting for GraphX pull request Key: SPARK-4335 URL: https://issues.apache.org/jira/browse/SPARK-4335 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Reynold Xin Mima is reporting binary check failure for RDD/partitions in the following pull request even though the pull request doesn't change those at all. https://github.com/apache/spark/pull/2530 {code} [error] * abstract method getPartitions()Array[org.apache.spark.Partition] in class org.apache.spark.rdd.RDD does not have a correspondent in new version [error]filter with: ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions") [error] * abstract method compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator in class org.apache.spark.rdd.RDD does not have a correspondent in new version [error]filter with: ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute") [error] * abstract method getPartitions()Array[org.apache.spark.Partition] in class org.apache.spark.rdd.RDD does not have a correspondent in new version [error]filter with: ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.getPartitions") [error] * abstract method compute(org.apache.spark.Partition,org.apache.spark.TaskContext)scala.collection.Iterator in class org.apache.spark.rdd.RDD does not have a correspondent in new version [error]filter with: ProblemFilters.exclude[AbstractMethodProblem]("org.apache.spark.rdd.RDD.compute") [info] Done updating. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4334) Utils.startServiceOnPort should check whether the tryPort is less than 1024
[ https://issues.apache.org/jira/browse/SPARK-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YanTang Zhai resolved SPARK-4334. - Resolution: Duplicate > Utils.startServiceOnPort should check whether the tryPort is less than 1024 > --- > > Key: SPARK-4334 > URL: https://issues.apache.org/jira/browse/SPARK-4334 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: YanTang Zhai >Priority: Minor > > Utils.startServiceOnPort should check whether the tryPort is less than 1024. > If next attempt port is less than 1024, SockertException with "Permission > denied" will be thrown. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4334) Utils.startServiceOnPort should check whether the tryPort is less than 1024
YanTang Zhai created SPARK-4334: --- Summary: Utils.startServiceOnPort should check whether the tryPort is less than 1024 Key: SPARK-4334 URL: https://issues.apache.org/jira/browse/SPARK-4334 Project: Spark Issue Type: Bug Components: Spark Core Reporter: YanTang Zhai Priority: Minor Utils.startServiceOnPort should check whether the tryPort is less than 1024. If next attempt port is less than 1024, SockertException with "Permission denied" will be thrown. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4306) LogisticRegressionWithLBFGS support for PySpark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varadharajan updated SPARK-4306: Description: Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib interfact. This task is to add support for LogisticRegressionWithLBFGS algorithm. (was: Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib interfact. This task is to add support for LogisticRegressionWithLBFGS algorithm as include examples.) > LogisticRegressionWithLBFGS support for PySpark MLlib > -- > > Key: SPARK-4306 > URL: https://issues.apache.org/jira/browse/SPARK-4306 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Varadharajan > Labels: newbie > Original Estimate: 48h > Remaining Estimate: 48h > > Currently we are supporting LogisticRegressionWithSGD in the PySpark MLlib > interfact. This task is to add support for LogisticRegressionWithLBFGS > algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4333) Correctly log number of iterations in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-4333: -- Summary: Correctly log number of iterations in RuleExecutor (was: change num of iteration printed in trace log from ${iteration} to ${iteration - 1}) > Correctly log number of iterations in RuleExecutor > -- > > Key: SPARK-4333 > URL: https://issues.apache.org/jira/browse/SPARK-4333 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: DoingDone9 >Priority: Minor > > RuleExecutor breaks, num of iteration should be $(iteration -1) not > (iteration) . > Log looks like "Fixed point reached for batch $(batch.name) after 3 > iterations.", but it did 2 iterations really! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4333) Correctly log number of iterations in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-4333: -- Description: RuleExecutor breaks, num of iteration should be $(iteration -1) not $(iteration) . Log looks like "Fixed point reached for batch $(batch.name) after 3 iterations.", but it did 2 iterations really! was: RuleExecutor breaks, num of iteration should be $(iteration -1) not (iteration) . Log looks like "Fixed point reached for batch $(batch.name) after 3 iterations.", but it did 2 iterations really! > Correctly log number of iterations in RuleExecutor > -- > > Key: SPARK-4333 > URL: https://issues.apache.org/jira/browse/SPARK-4333 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: DoingDone9 >Priority: Minor > > RuleExecutor breaks, num of iteration should be $(iteration -1) not > $(iteration) . > Log looks like "Fixed point reached for batch $(batch.name) after 3 > iterations.", but it did 2 iterations really! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4333) change num of iteration printed in trace log from ${iteration} to ${iteration - 1}
[ https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205853#comment-14205853 ] Apache Spark commented on SPARK-4333: - User 'DoingDone9' has created a pull request for this issue: https://github.com/apache/spark/pull/3180 > change num of iteration printed in trace log from ${iteration} to ${iteration > - 1} > -- > > Key: SPARK-4333 > URL: https://issues.apache.org/jira/browse/SPARK-4333 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: DoingDone9 >Priority: Minor > > RuleExecutor breaks, num of iteration should be $(iteration -1) not > (iteration) . > Log looks like "Fixed point reached for batch $(batch.name) after 3 > iterations.", but it did 2 iterations really! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4333) change num of iteration printed in trace log from ${iteration} to ${iteration - 1}
[ https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-4333: -- Description: RuleExecutor breaks, num of iteration should be $(iteration -1) not (iteration) . Log looks like "Fixed point reached for batch $(batch.name) after 3 iterations.", but it did 2 iterations really! was: RuleExecutor breaks, num of iteration should be ${iteration -1} not {iteration} . Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it did 2 iterations really! > change num of iteration printed in trace log from ${iteration} to ${iteration > - 1} > -- > > Key: SPARK-4333 > URL: https://issues.apache.org/jira/browse/SPARK-4333 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: DoingDone9 >Priority: Minor > > RuleExecutor breaks, num of iteration should be $(iteration -1) not > (iteration) . > Log looks like "Fixed point reached for batch $(batch.name) after 3 > iterations.", but it did 2 iterations really! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4333) change num of iteration printed in trace log from ${iteration} to ${iteration - 1}
[ https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-4333: -- Description: RuleExecutor breaks, num of iteration should be ${iteration -1} not {iteration} . Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it did 2 iterations really! was:RuleExecutor breaks, num of iteration should be {iteration -1} not {iteration} .Log looks like "Fixed point reached for batch {batch.name} after 3 iterations.", but it did 2 iterations really! > change num of iteration printed in trace log from ${iteration} to ${iteration > - 1} > -- > > Key: SPARK-4333 > URL: https://issues.apache.org/jira/browse/SPARK-4333 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: DoingDone9 >Priority: Minor > > RuleExecutor breaks, num of iteration should be ${iteration -1} not > {iteration} . > Log looks like "Fixed point reached for batch ${batch.name} after 3 > iterations.", but it did 2 iterations really! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3495) Block replication fails continuously when the replication target node is dead
[ https://issues.apache.org/jira/browse/SPARK-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-3495: - Fix Version/s: 1.1.1 > Block replication fails continuously when the replication target node is dead > - > > Key: SPARK-3495 > URL: https://issues.apache.org/jira/browse/SPARK-3495 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core, Streaming >Affects Versions: 1.0.2, 1.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > Fix For: 1.1.1, 1.2.0 > > > If a block manager (say, A) wants to replicate a block and the node chosen > for replication (say, B) is dead, then the attempt to send the block to B > fails. However, this continues to fail indefinitely. Even if the driver > learns about the demise of the B, A continues to try replicating to B and > failing miserably. > The reason behind this bug is that A initially fetches a list of peers from > the driver (when B was active), but never updates it after B is dead. This > affects Spark Streaming as its receiver uses block replication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3496) Block replication can by mistake choose driver BlockManager as a peer for replication
[ https://issues.apache.org/jira/browse/SPARK-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-3496: - Fix Version/s: 1.1.1 > Block replication can by mistake choose driver BlockManager as a peer for > replication > - > > Key: SPARK-3496 > URL: https://issues.apache.org/jira/browse/SPARK-3496 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core, Streaming >Affects Versions: 1.0.2, 1.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > Fix For: 1.1.1, 1.2.0 > > > When selecting peer block managers for replicating a block, the driver block > manager can also get chosen accidentally. This is because > BlockManagerMasterActor did not filter out the driver block manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4333) change num of iteration printed in trace log from ${iteration} to ${iteration - 1}
[ https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-4333: -- Description: RuleExecutor breaks, num of iteration should be {iteration -1} not {iteration} .Log looks like "Fixed point reached for batch {batch.name} after 3 iterations.", but it did 2 iterations really! (was: RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it did 2 iterations really!) > change num of iteration printed in trace log from ${iteration} to ${iteration > - 1} > -- > > Key: SPARK-4333 > URL: https://issues.apache.org/jira/browse/SPARK-4333 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: DoingDone9 >Priority: Minor > > RuleExecutor breaks, num of iteration should be {iteration -1} not > {iteration} .Log looks like "Fixed point reached for batch {batch.name} after > 3 iterations.", but it did 2 iterations really! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4333) change num of iteration printed in trace log from ${iteration} to ${iteration - 1}
DoingDone9 created SPARK-4333: - Summary: change num of iteration printed in trace log from ${iteration} to ${iteration - 1} Key: SPARK-4333 URL: https://issues.apache.org/jira/browse/SPARK-4333 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: DoingDone9 Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4333) change num of iteration printed in trace log from ${iteration} to ${iteration - 1}
[ https://issues.apache.org/jira/browse/SPARK-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-4333: -- Description: RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it did 2 iterations really! > change num of iteration printed in trace log from ${iteration} to ${iteration > - 1} > -- > > Key: SPARK-4333 > URL: https://issues.apache.org/jira/browse/SPARK-4333 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: DoingDone9 >Priority: Minor > > RuleExecutor breaks, num of iteration should be ${iteration -1} not > ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} > after 3 iterations.", but it did 2 iterations really! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file
[ https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205825#comment-14205825 ] maji2014 edited comment on SPARK-4314 at 11/11/14 2:32 AM: --- Yes, Not all of this intermediate state are caught. i wanna add following code into defaultFilter method under FileInputDStream. Any suggestions? was (Author: maji2014): Yes, Not all of this intermediate state are caught. i wanna add following code into CustomPathFilter method under FileInputDStream class to filter this state from FileInputDStream interface. if(path.getName.endsWith("\_COPYING\_")){ logDebug ") return false } Any suggestions? > Exception when textFileStream attempts to read deleted _COPYING_ file > - > > Key: SPARK-4314 > URL: https://issues.apache.org/jira/browse/SPARK-4314 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: maji2014 > > [Reproduce] > 1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))" > 2. Upload file to hdfs(reason as followings) > 3. Exception as followings. > [Exception stack] > 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to > master/192.168.84.142:9000 from ocdc sending #13 > 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time > 1415611274000 ms > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: hdfs://master:9000/user/spark/200.COPYING > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > sc
[jira] [Closed] (SPARK-4332) RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it
[ https://issues.apache.org/jira/browse/SPARK-4332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 closed SPARK-4332. - Resolution: Invalid > RuleExecutor breaks, num of iteration should be ${iteration -1} not > ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} > after 3 iterations.", but it did 2 iterations really! > > > Key: SPARK-4332 > URL: https://issues.apache.org/jira/browse/SPARK-4332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: DoingDone9 >Priority: Minor > > RuleExecutor breaks, num of iteration should be ${iteration -1} not > ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} > after 3 iterations.", but it did 2 iterations really! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4332) RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it
[ https://issues.apache.org/jira/browse/SPARK-4332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-4332: -- Description: RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it did 2 iterations really! (was: RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.nam) > RuleExecutor breaks, num of iteration should be ${iteration -1} not > ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} > after 3 iterations.", but it did 2 iterations really! > > > Key: SPARK-4332 > URL: https://issues.apache.org/jira/browse/SPARK-4332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: DoingDone9 >Priority: Minor > > RuleExecutor breaks, num of iteration should be ${iteration -1} not > ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} > after 3 iterations.", but it did 2 iterations really! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4332) RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it
[ https://issues.apache.org/jira/browse/SPARK-4332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-4332: -- Description: RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.nam > RuleExecutor breaks, num of iteration should be ${iteration -1} not > ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} > after 3 iterations.", but it did 2 iterations really! > > > Key: SPARK-4332 > URL: https://issues.apache.org/jira/browse/SPARK-4332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: DoingDone9 >Priority: Minor > > RuleExecutor breaks, num of iteration should be ${iteration -1} not > ${iteration} .Log looks like "Fixed point reached for batch ${batch.nam -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4332) RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it
DoingDone9 created SPARK-4332: - Summary: RuleExecutor breaks, num of iteration should be ${iteration -1} not ${iteration} .Log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it did 2 iterations really! Key: SPARK-4332 URL: https://issues.apache.org/jira/browse/SPARK-4332 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: DoingDone9 Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2775) HiveContext does not support dots in column names.
[ https://issues.apache.org/jira/browse/SPARK-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205829#comment-14205829 ] Josh Rosen commented on SPARK-2775: --- Chatted with [~marmbrus] about this and he pointed me to a workaround: use {{applySchema}} to replace the dots with some other separator (underscore, in this case): {code} import sqlContext.applySchema /** Replaces . in column names with _ (underscore) */ protected def cleanSchema(dataType: DataType): DataType = dataType match { case StructType(fields) => StructType( fields.map(f => f.copy(name = f.name.replaceAll(".", "_"), dataType = cleanSchema(f.dataType case ArrayType(elemType, nullable) => ArrayType(cleanSchema(elemType), nullable) case NullType => StringType case other => other } /** Replaces . signs in column names with _ */ protected def cleanSchema(schemaRDD: SchemaRDD): SchemaRDD = { applySchema(schemaRDD, cleanSchema(schemaRDD.schema).asInstanceOf[StructType]) } {code} I've translated this to Python: {code} from pyspark.sql import * from copy import copy def cleanSchema(schemaRDD): def renameField(field): field = copy(field) field.name = field.name.replace(".", "_") field.dataType = doCleanSchema(field.dataType) return field def doCleanSchema(dataType): dataType = copy(dataType) if isinstance(dataType, StructType): dataType.fields = [renameField(f) for f in dataType.fields] elif isinstance(dataType, ArrayType): dataType.elementType = doCleanSchema(dataType.elementType ) return dataType return sqlContext.applySchema(schemaRDD.map(lambda x: x), doCleanSchema(schemaRDD.schema())) print cleanSchema(resultsRDD) {code} > HiveContext does not support dots in column names. > --- > > Key: SPARK-2775 > URL: https://issues.apache.org/jira/browse/SPARK-2775 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > When you try the following snippet in hive/console. > {code} > val data = sc.parallelize(Seq("""{"key.number1": "value1", "key.number2": > "value2"}""")) > jsonRDD(data).registerAsTable("jt") > hql("select `key.number1` from jt") > {code} > You will find the name of key.number1 cannot be resolved. > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved > attributes: 'key.number1, tree: > Project ['key.number1] > LowerCaseSchema > Subquery jt >SparkLogicalPlan (ExistingRdd [key.number1#8,key.number2#9], MappedRDD[17] > at map at JsonRDD.scala:37) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file
[ https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205825#comment-14205825 ] maji2014 edited comment on SPARK-4314 at 11/11/14 2:16 AM: --- Yes, Not all of this intermediate state are caught. i wanna add following code into CustomPathFilter method under FileInputDStream class to filter this state from FileInputDStream interface. if(path.getName.endsWith("\_COPYING\_")){ logDebug ") return false } Any suggestions? was (Author: maji2014): Yes, Not all of this intermediate state are caught. i wanna add following code into CustomPathFilter method under FileInputDStream class to filter this state from FileInputDStream interface. if(path.getName.endsWith("_COPYING_")){ logDebug ") return false } Any suggestions? > Exception when textFileStream attempts to read deleted _COPYING_ file > - > > Key: SPARK-4314 > URL: https://issues.apache.org/jira/browse/SPARK-4314 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: maji2014 > > [Reproduce] > 1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))" > 2. Upload file to hdfs(reason as followings) > 3. Exception as followings. > [Exception stack] > 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to > master/192.168.84.142:9000 from ocdc sending #13 > 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time > 1415611274000 ms > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: hdfs://master:9000/user/spark/200.COPYING > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) > at > org.ap
[jira] [Commented] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file
[ https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205825#comment-14205825 ] maji2014 commented on SPARK-4314: - Yes, Not all of this intermediate state are caught. i wanna add following code into CustomPathFilter method under FileInputDStream class to filter this state from FileInputDStream interface. if(path.getName.endsWith("_COPYING_")){ logDebug ") return false } Any suggestions? > Exception when textFileStream attempts to read deleted _COPYING_ file > - > > Key: SPARK-4314 > URL: https://issues.apache.org/jira/browse/SPARK-4314 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: maji2014 > > [Reproduce] > 1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))" > 2. Upload file to hdfs(reason as followings) > 3. Exception as followings. > [Exception stack] > 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to > master/192.168.84.142:9000 from ocdc sending #13 > 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time > 1415611274000 ms > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: hdfs://master:9000/user/spark/200.COPYING > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$
[jira] [Comment Edited] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file
[ https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205774#comment-14205774 ] maji2014 edited comment on SPARK-4314 at 11/11/14 2:07 AM: --- The actual behavior is the intermediate file _COPYING_ is found by FileInputDStream interface when i upload a normal file rather than uploading a intermediate file. Take following as examples, normally upload three files aa,bb,cc ** hdfs://master:9000/user/spark/aa 14/11/09 23:16:44 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/aa:0+8 hdfs://master:9000/user/spark/bb 14/11/09 23:16:58 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/bb:0+7 hdfs://master:9000/user/spark/cc._COPYING_ 14/11/09 23:17:06 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/cc._COPYING_:0+0 hdfs://master:9000/user/spark/cc 14/11/09 23:17:08 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/cc:0+6 ** and then file cc is counted two times because cc._COPYING_ is also counted. The issue exists in a special scenario. Do a simple test, upload a file directly like dd._COPYING._(sometimes this intermediate state will be read by FileInputDStream interface), the file can be counted, that's not the expect behavior. was (Author: maji2014): The actual behavior is the intermediate file _COPYING_ is found by FileInputDStream interface when i upload a normal file rather than uploading a intermediate file. Take following as examples, normally upload three files aa,bb,cc ** hdfs://master:9000/user/spark/aa 14/11/09 23:16:44 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/aa:0+8 hdfs://master:9000/user/spark/bb 14/11/09 23:16:58 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/bb:0+7 hdfs://master:9000/user/spark/cc._COPYING_ 14/11/09 23:17:06 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/cc._COPYING_:0+0 hdfs://master:9000/user/spark/cc 14/11/09 23:17:08 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/cc:0+6 ** and then file cc is counted two times because cc._COPYING_ is also counted. The issue exists in a special scenario. > Exception when textFileStream attempts to read deleted _COPYING_ file > - > > Key: SPARK-4314 > URL: https://issues.apache.org/jira/browse/SPARK-4314 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: maji2014 > > [Reproduce] > 1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))" > 2. Upload file to hdfs(reason as followings) > 3. Exception as followings. > [Exception stack] > 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to > master/192.168.84.142:9000 from ocdc sending #13 > 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time > 1415611274000 ms > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: hdfs://master:9000/user/spark/200.COPYING > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream
[jira] [Commented] (SPARK-4331) SBT Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1
[ https://issues.apache.org/jira/browse/SPARK-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205816#comment-14205816 ] Kousuke Saruta commented on SPARK-4331: --- [~pwendell] You mean Scala 2.11 patch introduces changes including more of non standard directory structure so it's not good to fix this issue for now right? > SBT Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1 > > > Key: SPARK-4331 > URL: https://issues.apache.org/jira/browse/SPARK-4331 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta > > v0.13.1 and v0.12.0 is not standard directory structure for sbt's sclastyle > plugin so scalastyle doesn't work for sources under those directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4331) SBT Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1
[ https://issues.apache.org/jira/browse/SPARK-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-4331: -- Summary: SBT Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1 (was: Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1) > SBT Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1 > > > Key: SPARK-4331 > URL: https://issues.apache.org/jira/browse/SPARK-4331 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta > > v0.13.1 and v0.12.0 is not standard directory structure for sbt's sclastyle > plugin so scalastyle doesn't work for sources under those directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4316) Utils.isBindCollision misjudges at Non-English environment
[ https://issues.apache.org/jira/browse/SPARK-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205810#comment-14205810 ] YanTang Zhai commented on SPARK-4316: - Thanks @Sean Owen > Utils.isBindCollision misjudges at Non-English environment > -- > > Key: SPARK-4316 > URL: https://issues.apache.org/jira/browse/SPARK-4316 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: YanTang Zhai >Priority: Minor > > export LANG=zh_CN.utf8 > export LC_ALL=zh_CN.utf8 > Then Utils.isBindCollision misjudges since BindException's message is > "地址已在使用" not "Address already in use". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4314) Exception when textFileStream attempts to read deleted _COPYING_ file
[ https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4314: --- Summary: Exception when textFileStream attempts to read deleted _COPYING_ file (was: Exception throws when the upload intermediate file(_COPYING_ file) is read through hdfs interface) > Exception when textFileStream attempts to read deleted _COPYING_ file > - > > Key: SPARK-4314 > URL: https://issues.apache.org/jira/browse/SPARK-4314 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: maji2014 > > [Reproduce] > 1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))" > 2. Upload file to hdfs(reason as followings) > 3. Exception as followings. > [Exception stack] > 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to > master/192.168.84.142:9000 from ocdc sending #13 > 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time > 1415611274000 ms > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: hdfs://master:9000/user/spark/200.COPYING > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.flatMap(Traversable
[jira] [Comment Edited] (SPARK-4314) Exception throws when the upload intermediate file(_COPYING_ file) is read through hdfs interface
[ https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205799#comment-14205799 ] Patrick Wendell edited comment on SPARK-4314 at 11/11/14 1:56 AM: -- Can you add a filter that excludes COPYING files? My guess is that at one time it lists the file and at another time it tries to open it. IIRC Spark Streaming supports filters to remove ephemeral files like this. was (Author: pwendell): Can you add a filter that excludes COPYING files? > Exception throws when the upload intermediate file(_COPYING_ file) is read > through hdfs interface > - > > Key: SPARK-4314 > URL: https://issues.apache.org/jira/browse/SPARK-4314 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: maji2014 > > [Reproduce] > 1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))" > 2. Upload file to hdfs(reason as followings) > 3. Exception as followings. > [Exception stack] > 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to > master/192.168.84.142:9000 from ocdc sending #13 > 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time > 1415611274000 ms > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: hdfs://master:9000/user/spark/200.COPYING > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$
[jira] [Updated] (SPARK-4314) Exception throws when the upload intermediate file(_COPYING_ file) is read through hdfs interface
[ https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4314: --- Component/s: Streaming > Exception throws when the upload intermediate file(_COPYING_ file) is read > through hdfs interface > - > > Key: SPARK-4314 > URL: https://issues.apache.org/jira/browse/SPARK-4314 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: maji2014 > > [Reproduce] > 1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))" > 2. Upload file to hdfs(reason as followings) > 3. Exception as followings. > [Exception stack] > 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to > master/192.168.84.142:9000 from ocdc sending #13 > 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time > 1415611274000 ms > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: hdfs://master:9000/user/spark/200.COPYING > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.ap
[jira] [Commented] (SPARK-4314) Exception throws when the upload intermediate file(_COPYING_ file) is read through hdfs interface
[ https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205799#comment-14205799 ] Patrick Wendell commented on SPARK-4314: Can you add a filter that excludes COPYING files? > Exception throws when the upload intermediate file(_COPYING_ file) is read > through hdfs interface > - > > Key: SPARK-4314 > URL: https://issues.apache.org/jira/browse/SPARK-4314 > Project: Spark > Issue Type: Bug >Reporter: maji2014 > > [Reproduce] > 1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))" > 2. Upload file to hdfs(reason as followings) > 3. Exception as followings. > [Exception stack] > 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to > master/192.168.84.142:9000 from ocdc sending #13 > 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time > 1415611274000 ms > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: hdfs://master:9000/user/spark/200.COPYING > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversabl
[jira] [Resolved] (SPARK-4274) NPE in printing the details of query plan
[ https://issues.apache.org/jira/browse/SPARK-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4274. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3139 [https://github.com/apache/spark/pull/3139] > NPE in printing the details of query plan > - > > Key: SPARK-4274 > URL: https://issues.apache.org/jira/browse/SPARK-4274 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Priority: Minor > Fix For: 1.2.0 > > > NPE in printing the details of query plan, if the query is not valid. This > will be great helpful in Hive comparison test, which could provide more > information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4331) Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1
[ https://issues.apache.org/jira/browse/SPARK-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205795#comment-14205795 ] Patrick Wendell commented on SPARK-4331: This is going to be exacerbated by the Scala 2.11 patch which will add more of these nonstatdard directory structures. > Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1 > > > Key: SPARK-4331 > URL: https://issues.apache.org/jira/browse/SPARK-4331 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta > > v0.13.1 and v0.12.0 is not standard directory structure for sbt's sclastyle > plugin so scalastyle doesn't work for sources under those directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4331) Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1
Kousuke Saruta created SPARK-4331: - Summary: Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1 Key: SPARK-4331 URL: https://issues.apache.org/jira/browse/SPARK-4331 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 1.3.0 Reporter: Kousuke Saruta v0.13.1 and v0.12.0 is not standard directory structure for sbt's sclastyle plugin so scalastyle doesn't work for sources under those directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4274) NPE in printing the details of query plan
[ https://issues.apache.org/jira/browse/SPARK-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4274: - Summary: NPE in printing the details of query plan (was: NEP in printing the details of query plan) > NPE in printing the details of query plan > - > > Key: SPARK-4274 > URL: https://issues.apache.org/jira/browse/SPARK-4274 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > NEP in printing the details of query plan, if the query is not valid. this > will great helpful in Hive comparison test, which could provide more > information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4274) NPE in printing the details of query plan
[ https://issues.apache.org/jira/browse/SPARK-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4274: - Description: NPE in printing the details of query plan, if the query is not valid. This will be great helpful in Hive comparison test, which could provide more information. (was: NEP in printing the details of query plan, if the query is not valid. this will great helpful in Hive comparison test, which could provide more information.) > NPE in printing the details of query plan > - > > Key: SPARK-4274 > URL: https://issues.apache.org/jira/browse/SPARK-4274 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > NPE in printing the details of query plan, if the query is not valid. This > will be great helpful in Hive comparison test, which could provide more > information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4274) NEP in printing the details of query plan
[ https://issues.apache.org/jira/browse/SPARK-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4274: - Description: NEP in printing the details of query plan, if the query is not valid. this will great helpful in Hive comparison test, which could provide more information. (was: Hive comparison test frame doesn't print informative message, if the unit test failed in logical plan analyzing.) > NEP in printing the details of query plan > - > > Key: SPARK-4274 > URL: https://issues.apache.org/jira/browse/SPARK-4274 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > NEP in printing the details of query plan, if the query is not valid. this > will great helpful in Hive comparison test, which could provide more > information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4274) NEP in printing the details of query plan
[ https://issues.apache.org/jira/browse/SPARK-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4274: - Summary: NEP in printing the details of query plan (was: Hive comparison test framework doesn't print effective information while logical plan analyzing failed ) > NEP in printing the details of query plan > - > > Key: SPARK-4274 > URL: https://issues.apache.org/jira/browse/SPARK-4274 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > Hive comparison test frame doesn't print informative message, if the unit > test failed in logical plan analyzing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4314) Exception throws when the upload intermediate file(_COPYING_ file) is read through hdfs interface
[ https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205774#comment-14205774 ] maji2014 edited comment on SPARK-4314 at 11/11/14 1:48 AM: --- The actual behavior is the intermediate file _COPYING_ is found by FileInputDStream interface when i upload a normal file rather than uploading a intermediate file. Take following as examples, normally upload three files aa,bb,cc ** hdfs://master:9000/user/spark/aa 14/11/09 23:16:44 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/aa:0+8 hdfs://master:9000/user/spark/bb 14/11/09 23:16:58 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/bb:0+7 hdfs://master:9000/user/spark/cc._COPYING_ 14/11/09 23:17:06 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/cc._COPYING_:0+0 hdfs://master:9000/user/spark/cc 14/11/09 23:17:08 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/cc:0+6 ** and then file cc is counted two times because cc._COPYING_ is also counted. The issue exists in a special scenario. was (Author: maji2014): The actual behavior is the intermediate file _COPYING_ is found by FileInputDStream interface when i upload a normal file rather than uploading a intermediate file. take following as examples: ** hdfs://master:9000/user/spark/aa 14/11/09 23:16:44 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/aa:0+8 hdfs://master:9000/user/spark/bb 14/11/09 23:16:58 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/bb:0+7 hdfs://master:9000/user/spark/cc._COPYING_ 14/11/09 23:17:06 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/cc._COPYING_:0+0 hdfs://master:9000/user/spark/cc 14/11/09 23:17:08 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/cc:0+6 ** and then file cc is counted two times. The issue exists in a special scenario. > Exception throws when the upload intermediate file(_COPYING_ file) is read > through hdfs interface > - > > Key: SPARK-4314 > URL: https://issues.apache.org/jira/browse/SPARK-4314 > Project: Spark > Issue Type: Bug >Reporter: maji2014 > > [Reproduce] > 1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))" > 2. Upload file to hdfs(reason as followings) > 3. Exception as followings. > [Exception stack] > 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to > master/192.168.84.142:9000 from ocdc sending #13 > 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time > 1415611274000 ms > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: hdfs://master:9000/user/spark/200.COPYING > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DS
[jira] [Updated] (SPARK-4314) Exception throws when the upload intermediate file(_COPYING_ file) is read through hdfs interface
[ https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] maji2014 updated SPARK-4314: Summary: Exception throws when the upload intermediate file(_COPYING_ file) is read through hdfs interface (was: Exception throws when finding new files like intermediate result(_COPYING_ file) through hdfs interface) > Exception throws when the upload intermediate file(_COPYING_ file) is read > through hdfs interface > - > > Key: SPARK-4314 > URL: https://issues.apache.org/jira/browse/SPARK-4314 > Project: Spark > Issue Type: Bug >Reporter: maji2014 > > [Reproduce] > 1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))" > 2. Upload file to hdfs(reason as followings) > 3. Exception as followings. > [Exception stack] > 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to > master/192.168.84.142:9000 from ocdc sending #13 > 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time > 1415611274000 ms > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: hdfs://master:9000/user/spark/200.COPYING > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collec
[jira] [Updated] (SPARK-4119) Don't rely on HIVE_DEV_HOME to find .q files
[ https://issues.apache.org/jira/browse/SPARK-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4119: Target Version/s: 1.3.0 (was: 1.2.0) > Don't rely on HIVE_DEV_HOME to find .q files > > > Key: SPARK-4119 > URL: https://issues.apache.org/jira/browse/SPARK-4119 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 1.1.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > After merging in Hive 0.13.1 support, a bunch of .q files and golden answer > files got updated. Unfortunately, some .q were updated in Hive. For example, > an ORDER BY clause was added to groupby1_limit.q for bug fix. > With HIVE_DEV_HOME set, developers working on Hive 0.12.0 may end up with > false test failures. Because .q files are looked up from HIVE_DEV_HOME and > outdated .q files are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4119) Don't rely on HIVE_DEV_HOME to find .q files
[ https://issues.apache.org/jira/browse/SPARK-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4119: Assignee: Cheng Lian > Don't rely on HIVE_DEV_HOME to find .q files > > > Key: SPARK-4119 > URL: https://issues.apache.org/jira/browse/SPARK-4119 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 1.1.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > After merging in Hive 0.13.1 support, a bunch of .q files and golden answer > files got updated. Unfortunately, some .q were updated in Hive. For example, > an ORDER BY clause was added to groupby1_limit.q for bug fix. > With HIVE_DEV_HOME set, developers working on Hive 0.12.0 may end up with > false test failures. Because .q files are looked up from HIVE_DEV_HOME and > outdated .q files are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4314) Exception throws when finding new files like intermediate result(_COPYING_ file) through hdfs interface
[ https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205774#comment-14205774 ] maji2014 commented on SPARK-4314: - The actual behavior is the intermediate file _COPYING_ is found by FileInputDStream interface when i upload a normal file rather than uploading a intermediate file. take following as examples: ** hdfs://master:9000/user/spark/aa 14/11/09 23:16:44 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/aa:0+8 hdfs://master:9000/user/spark/bb 14/11/09 23:16:58 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/bb:0+7 hdfs://master:9000/user/spark/cc._COPYING_ 14/11/09 23:17:06 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/cc._COPYING_:0+0 hdfs://master:9000/user/spark/cc 14/11/09 23:17:08 INFO NewHadoopRDD: Input split: hdfs://master:9000/user/spark/cc:0+6 ** and then file cc is counted two times. The issue exists in a special scenario. > Exception throws when finding new files like intermediate result(_COPYING_ > file) through hdfs interface > --- > > Key: SPARK-4314 > URL: https://issues.apache.org/jira/browse/SPARK-4314 > Project: Spark > Issue Type: Bug >Reporter: maji2014 > > [Reproduce] > 1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))" > 2. Upload file to hdfs(reason as followings) > 3. Exception as followings. > [Exception stack] > 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to > master/192.168.84.142:9000 from ocdc sending #13 > 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time > 1415611274000 ms > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: hdfs://master:9000/user/spark/200.COPYING > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.g
[jira] [Updated] (SPARK-3954) Optimization to FileInputDStream
[ https://issues.apache.org/jira/browse/SPARK-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-3954: - Fix Version/s: 1.1.1 > Optimization to FileInputDStream > > > Key: SPARK-3954 > URL: https://issues.apache.org/jira/browse/SPARK-3954 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0, 1.1.0 >Reporter: 宿荣全 > Fix For: 1.1.1, 1.2.0 > > > about convert files to RDDS there are 3 loops with files sequence in spark > source. > loops files sequence: > 1.files.map(...) > 2.files.zip(fileRDDs) > 3.files-size.foreach > modiy 3 recursions to 1 recursion. > spark source code: > private def filesToRDD(files: Seq[String]): RDD[(K, V)] = { > val fileRDDs = files.map(file => context.sparkContext.newAPIHadoopFile[K, > V, F](file)) > files.zip(fileRDDs).foreach { case (file, rdd) => { > if (rdd.partitions.size == 0) { > logError("File " + file + " has no data in it. Spark Streaming can > only ingest " + > "files that have been \"moved\" to the directory assigned to the > file stream. " + > "Refer to the streaming programming guide for more details.") > } > }} > new UnionRDD(context.sparkContext, fileRDDs) > } > // > --- > modified code: > private def filesToRDD(files: Seq[String]): RDD[(K, V)] = { > val fileRDDs = for (file <- files; rdd = > context.sparkContext.newAPIHadoopFile[K, V, F](file)) > yield { > if (rdd.partitions.size == 0) { > logError("File " + file + " has no data in it. Spark Streaming can > only ingest " + > "files that have been \"moved\" to the directory assigned to the > file stream. " + > "Refer to the streaming programming guide for more details.") > } > rdd > } > new UnionRDD(context.sparkContext, fileRDDs) > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3954) Optimization to FileInputDStream
[ https://issues.apache.org/jira/browse/SPARK-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-3954. -- Resolution: Fixed Fix Version/s: 1.2.0 > Optimization to FileInputDStream > > > Key: SPARK-3954 > URL: https://issues.apache.org/jira/browse/SPARK-3954 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0, 1.1.0 >Reporter: 宿荣全 > Fix For: 1.2.0 > > > about convert files to RDDS there are 3 loops with files sequence in spark > source. > loops files sequence: > 1.files.map(...) > 2.files.zip(fileRDDs) > 3.files-size.foreach > modiy 3 recursions to 1 recursion. > spark source code: > private def filesToRDD(files: Seq[String]): RDD[(K, V)] = { > val fileRDDs = files.map(file => context.sparkContext.newAPIHadoopFile[K, > V, F](file)) > files.zip(fileRDDs).foreach { case (file, rdd) => { > if (rdd.partitions.size == 0) { > logError("File " + file + " has no data in it. Spark Streaming can > only ingest " + > "files that have been \"moved\" to the directory assigned to the > file stream. " + > "Refer to the streaming programming guide for more details.") > } > }} > new UnionRDD(context.sparkContext, fileRDDs) > } > // > --- > modified code: > private def filesToRDD(files: Seq[String]): RDD[(K, V)] = { > val fileRDDs = for (file <- files; rdd = > context.sparkContext.newAPIHadoopFile[K, V, F](file)) > yield { > if (rdd.partitions.size == 0) { > logError("File " + file + " has no data in it. Spark Streaming can > only ingest " + > "files that have been \"moved\" to the directory assigned to the > file stream. " + > "Refer to the streaming programming guide for more details.") > } > rdd > } > new UnionRDD(context.sparkContext, fileRDDs) > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4149) ISO 8601 support for json date time strings
[ https://issues.apache.org/jira/browse/SPARK-4149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4149. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3012 [https://github.com/apache/spark/pull/3012] > ISO 8601 support for json date time strings > --- > > Key: SPARK-4149 > URL: https://issues.apache.org/jira/browse/SPARK-4149 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Minor > Fix For: 1.2.0 > > > parse json string like "2014-10-29T20:05:00-08:00" or "2014-10-29T20:05:00Z". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4250) Create constant null value for Hive Inspectors
[ https://issues.apache.org/jira/browse/SPARK-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4250. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3114 [https://github.com/apache/spark/pull/3114] > Create constant null value for Hive Inspectors > -- > > Key: SPARK-4250 > URL: https://issues.apache.org/jira/browse/SPARK-4250 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao > Fix For: 1.2.0 > > > Constant null value is not accepted while creating the Hive Inspectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.
[ https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2205: Target Version/s: 1.3.0 (was: 1.2.0) > Unnecessary exchange operators in a join on multiple tables with the same > join key. > --- > > Key: SPARK-2205 > URL: https://issues.apache.org/jira/browse/SPARK-2205 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Minor > > {code} > hql("select * from src x join src y on (x.key=y.key) join src z on > (y.key=z.key)") > SchemaRDD[1] at RDD at SchemaRDD.scala:100 > == Query Plan == > Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5] > HashJoin [key#6], [key#8], BuildRight > Exchange (HashPartitioning [key#6], 200) >HashJoin [key#4], [key#6], BuildRight > Exchange (HashPartitioning [key#4], 200) > HiveTableScan [key#4,value#5], (MetastoreRelation default, src, > Some(x)), None > Exchange (HashPartitioning [key#6], 200) > HiveTableScan [key#6,value#7], (MetastoreRelation default, src, > Some(y)), None > Exchange (HashPartitioning [key#8], 200) >HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), > None > {code} > However, this is fine... > {code} > hql("select * from src x join src y on (x.key=y.key) join src z on > (x.key=z.key)") > res5: org.apache.spark.sql.SchemaRDD = > SchemaRDD[5] at RDD at SchemaRDD.scala:100 > == Query Plan == > Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5] > HashJoin [key#26], [key#30], BuildRight > HashJoin [key#26], [key#28], BuildRight >Exchange (HashPartitioning [key#26], 200) > HiveTableScan [key#26,value#27], (MetastoreRelation default, src, > Some(x)), None >Exchange (HashPartitioning [key#28], 200) > HiveTableScan [key#28,value#29], (MetastoreRelation default, src, > Some(y)), None > Exchange (HashPartitioning [key#30], 200) >HiveTableScan [key#30,value#31], (MetastoreRelation default, src, > Some(z)), None > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205731#comment-14205731 ] Apache Spark commented on SPARK-3461: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/3198 > Support external groupByKey using repartitionAndSortWithinPartitions > > > Key: SPARK-3461 > URL: https://issues.apache.org/jira/browse/SPARK-3461 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Sandy Ryza >Priority: Critical > > Given that we have SPARK-2978, it seems like we could support an external > group by operator pretty easily. We'd just have to wrap the existing iterator > exposed by SPARK-2978 with a lookahead iterator that detects the group > boundaries. Also, we'd have to override the cache() operator to cache the > parent RDD so that if this object is cached it doesn't wind through the > iterator. > I haven't totally followed all the sort-shuffle internals, but just given the > stated semantics of SPARK-2978 it seems like this would be possible. > It would be really nice to externalize this because many beginner users write > jobs in terms of groupByKey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4308) SQL operation state is not properly set when exception is thrown
[ https://issues.apache.org/jira/browse/SPARK-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4308. - Resolution: Fixed Fix Version/s: 1.1.1 1.2.0 Issue resolved by pull request 3175 [https://github.com/apache/spark/pull/3175] > SQL operation state is not properly set when exception is thrown > > > Key: SPARK-4308 > URL: https://issues.apache.org/jira/browse/SPARK-4308 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.1 >Reporter: Cheng Lian >Priority: Minor > Fix For: 1.2.0, 1.1.1 > > > In {{HiveThriftServer2}}, when an exception is thrown during a SQL execution, > the SQL operation state should be set to {{ERROR}}, but now it remains > {{RUNNING}}. This affects the result of the GetOperationStatus Thrift API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4202) DSL support for Scala UDF
[ https://issues.apache.org/jira/browse/SPARK-4202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4202. - Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Cheng Lian > DSL support for Scala UDF > - > > Key: SPARK-4202 > URL: https://issues.apache.org/jira/browse/SPARK-4202 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.1 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 1.2.0 > > > Using Scala UDF with current DSL API is quite verbose, e.g.: > {code} > case class KeyValue(key: Int, value: String) > val schemaRDD = sc.parallelize(1 to 10).map(i => KeyValue(i, > i.toString)).toSchemaRDD > def foo = (a: Int, b: String) => a.toString + b > schemaRDD.select( // SELECT > Star(None), // *, > ScalaUdf( // > foo, // foo( > StringType, // > 'key.attr :: 'value.attr :: Nil) // key, value > ).collect() // ) FROM ... > {code} > It would be good to add a DSL syntax to simplify UDF invocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4330) Link to proper URL for YARN overview
[ https://issues.apache.org/jira/browse/SPARK-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205680#comment-14205680 ] Apache Spark commented on SPARK-4330: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/3196 > Link to proper URL for YARN overview > > > Key: SPARK-4330 > URL: https://issues.apache.org/jira/browse/SPARK-4330 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta >Priority: Minor > > In running-on-yarn.md, a link to YARN overview is here. > But the URL is to YARN alpha's. > It should be stable's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4330) Link to proper URL for YARN overview
Kousuke Saruta created SPARK-4330: - Summary: Link to proper URL for YARN overview Key: SPARK-4330 URL: https://issues.apache.org/jira/browse/SPARK-4330 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.3.0 Reporter: Kousuke Saruta Priority: Minor In running-on-yarn.md, a link to YARN overview is here. But the URL is to YARN alpha's. It should be stable's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205637#comment-14205637 ] Joseph K. Bradley commented on SPARK-3717: -- [~codedeft] Are you asking me or [~bbnsumanth]? If me, I sketched out my thoughts above, but please let me know if parts are unclear. [~bbnsumanth] I'll look forward to seeing how things progress. > DecisionTree, RandomForest: Partition by feature > > > Key: SPARK-3717 > URL: https://issues.apache.org/jira/browse/SPARK-3717 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > h1. Summary > Currently, data are partitioned by row/instance for DecisionTree and > RandomForest. This JIRA argues for partitioning by feature for training deep > trees. This is especially relevant for random forests, which are often > trained to be deeper than single decision trees. > h1. Details > Dataset dimensions and the depth of the tree to be trained are the main > problem parameters determining whether it is better to partition features or > instances. For random forests (training many deep trees), partitioning > features could be much better. > Notation: > * P = # workers > * N = # instances > * M = # features > * D = depth of tree > h2. Partitioning Features > Algorithm sketch: > * Each worker stores: > ** a subset of columns (i.e., a subset of features). If a worker stores > feature j, then the worker stores the feature value for all instances (i.e., > the whole column). > ** all labels > * Train one level at a time. > * Invariants: > ** Each worker stores a mapping: instance → node in current level > * On each iteration: > ** Each worker: For each node in level, compute (best feature to split, info > gain). > ** Reduce (P x M) values to M values to find best split for each node. > ** Workers who have features used in best splits communicate left/right for > relevant instances. Gather total of N bits to master, then broadcast. > * Total communication: > ** Depth D iterations > ** On each iteration, reduce to M values (~8 bytes each), broadcast N values > (1 bit each). > ** Estimate: D * (M * 8 + N) > h2. Partitioning Instances > Algorithm sketch: > * Train one group of nodes at a time. > * Invariants: > * Each worker stores a mapping: instance → node > * On each iteration: > ** Each worker: For each instance, add to aggregate statistics. > ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) > *** (“# classes” is for classification. 3 for regression) > ** Reduce aggregate. > ** Master chooses best split for each node in group and broadcasts. > * Local training: Once all instances for a node fit on one machine, it can be > best to shuffle data and training subtrees locally. This can mean shuffling > the entire dataset for each tree trained. > * Summing over all iterations, reduce to total of: > ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) > ** Estimate: 2^D * M * B * C * 8 > h2. Comparing Partitioning Methods > Partitioning features cost < partitioning instances cost when: > * D * (M * 8 + N) < 2^D * M * B * C * 8 > * D * N < 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the > right hand side) > * N < [ 2^D * M * B * C * 8 ] / D > Example: many instances: > * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = > 5) > * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 > * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4325) Improve spark-ec2 cluster launch times
[ https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205627#comment-14205627 ] Apache Spark commented on SPARK-4325: - User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/3195 > Improve spark-ec2 cluster launch times > -- > > Key: SPARK-4325 > URL: https://issues.apache.org/jira/browse/SPARK-4325 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Priority: Minor > > There are several optimizations we know we can make to [{{setup.sh}} | > https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches > faster. > There are also some improvements to the AMIs that will help a lot. > Potential improvements: > * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This > will reduce or eliminate SSH wait time and Ganglia init time. > * Replace instances of {{download; rsync to rest of cluster}} with parallel > downloads on all nodes of the cluster. > * Replace instances of > {code} > for node in $NODES; do > command > sleep 0.3 > done > wait{code} > with simpler calls to {{pssh}}. > * Remove the [linear backoff | > https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665] > when we wait for SSH availability now that we are already waiting for EC2 > status checks to clear before testing SSH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4329) Add indexing feature for HistoryPage
[ https://issues.apache.org/jira/browse/SPARK-4329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205625#comment-14205625 ] Apache Spark commented on SPARK-4329: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/3194 > Add indexing feature for HistoryPage > > > Key: SPARK-4329 > URL: https://issues.apache.org/jira/browse/SPARK-4329 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta > > Current HistoryPage have links only to previous page or next page. > I suggest to add index to access history pages easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205626#comment-14205626 ] Apache Spark commented on SPARK-3398: - User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/3195 > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4329) Add indexing feature for HistoryPage
Kousuke Saruta created SPARK-4329: - Summary: Add indexing feature for HistoryPage Key: SPARK-4329 URL: https://issues.apache.org/jira/browse/SPARK-4329 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.3.0 Reporter: Kousuke Saruta Current HistoryPage have links only to previous page or next page. I suggest to add index to access history pages easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4327) Python API for RDD.randomSplit()
[ https://issues.apache.org/jira/browse/SPARK-4327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205603#comment-14205603 ] Apache Spark commented on SPARK-4327: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3193 > Python API for RDD.randomSplit() > > > Key: SPARK-4327 > URL: https://issues.apache.org/jira/browse/SPARK-4327 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Davies Liu >Priority: Critical > > randomSplit() is used in MLlib to split the dataset into training and testing > set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4319) Enable an ignored test "null count".
[ https://issues.apache.org/jira/browse/SPARK-4319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4319. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3185 [https://github.com/apache/spark/pull/3185] > Enable an ignored test "null count". > > > Key: SPARK-4319 > URL: https://issues.apache.org/jira/browse/SPARK-4319 > Project: Spark > Issue Type: Test > Components: SQL >Reporter: Takuya Ueshin >Priority: Minor > Fix For: 1.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3461: --- Assignee: Sandy Ryza (was: Davies Liu) > Support external groupByKey using repartitionAndSortWithinPartitions > > > Key: SPARK-3461 > URL: https://issues.apache.org/jira/browse/SPARK-3461 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Sandy Ryza >Priority: Critical > > Given that we have SPARK-2978, it seems like we could support an external > group by operator pretty easily. We'd just have to wrap the existing iterator > exposed by SPARK-2978 with a lookahead iterator that detects the group > boundaries. Also, we'd have to override the cache() operator to cache the > parent RDD so that if this object is cached it doesn't wind through the > iterator. > I haven't totally followed all the sort-shuffle internals, but just given the > stated semantics of SPARK-2978 it seems like this would be possible. > It would be really nice to externalize this because many beginner users write > jobs in terms of groupByKey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205520#comment-14205520 ] Patrick Wendell commented on SPARK-3461: I think [~sandyr] wanted to take a crack a this so I'm assigning to him. > Support external groupByKey using repartitionAndSortWithinPartitions > > > Key: SPARK-3461 > URL: https://issues.apache.org/jira/browse/SPARK-3461 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Sandy Ryza >Priority: Critical > > Given that we have SPARK-2978, it seems like we could support an external > group by operator pretty easily. We'd just have to wrap the existing iterator > exposed by SPARK-2978 with a lookahead iterator that detects the group > boundaries. Also, we'd have to override the cache() operator to cache the > parent RDD so that if this object is cached it doesn't wind through the > iterator. > I haven't totally followed all the sort-shuffle internals, but just given the > stated semantics of SPARK-2978 it seems like this would be possible. > It would be really nice to externalize this because many beginner users write > jobs in terms of groupByKey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3495) Block replication fails continuously when the replication target node is dead
[ https://issues.apache.org/jira/browse/SPARK-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-3495: - Target Version/s: 1.1.1, 1.2.0 (was: 1.2.0) > Block replication fails continuously when the replication target node is dead > - > > Key: SPARK-3495 > URL: https://issues.apache.org/jira/browse/SPARK-3495 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core, Streaming >Affects Versions: 1.0.2, 1.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > Fix For: 1.2.0 > > > If a block manager (say, A) wants to replicate a block and the node chosen > for replication (say, B) is dead, then the attempt to send the block to B > fails. However, this continues to fail indefinitely. Even if the driver > learns about the demise of the B, A continues to try replicating to B and > failing miserably. > The reason behind this bug is that A initially fetches a list of peers from > the driver (when B was active), but never updates it after B is dead. This > affects Spark Streaming as its receiver uses block replication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3496) Block replication can by mistake choose driver BlockManager as a peer for replication
[ https://issues.apache.org/jira/browse/SPARK-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-3496: - Target Version/s: 1.1.1, 1.2.0 (was: 1.2.0) > Block replication can by mistake choose driver BlockManager as a peer for > replication > - > > Key: SPARK-3496 > URL: https://issues.apache.org/jira/browse/SPARK-3496 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core, Streaming >Affects Versions: 1.0.2, 1.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > Fix For: 1.2.0 > > > When selecting peer block managers for replicating a block, the driver block > manager can also get chosen accidentally. This is because > BlockManagerMasterActor did not filter out the driver block manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2703) Make Tachyon related unit tests execute without deploying a Tachyon system locally.
[ https://issues.apache.org/jira/browse/SPARK-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205486#comment-14205486 ] Patrick Wendell commented on SPARK-2703: FYI I had to revert this patch because it looked like it broke our maven build > Make Tachyon related unit tests execute without deploying a Tachyon system > locally. > --- > > Key: SPARK-2703 > URL: https://issues.apache.org/jira/browse/SPARK-2703 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Haoyuan Li >Assignee: Rong Gu > Fix For: 1.2.0 > > > Use the LocalTachyonCluster class in tachyon-test.jar in 0.5.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4328) Python serialization updates make Python ML API more brittle to types
[ https://issues.apache.org/jira/browse/SPARK-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4328. -- Resolution: Duplicate This is covered in the PR for SPARK-4324. > Python serialization updates make Python ML API more brittle to types > - > > Key: SPARK-4328 > URL: https://issues.apache.org/jira/browse/SPARK-4328 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > In Spark 1.1, you could create a LabeledPoint with labels specified as > integers, and then use it with LinearRegression. This was broken by the > Python API updates since then. E.g., this code runs in the 1.1 branch but > not in the current master: > {code} > from pyspark.mllib.regression import * > import numpy > features = numpy.ndarray((3)) > data = sc.parallelize([LabeledPoint(1, features)]) > LinearRegressionWithSGD.train(data) > {code} > Recommendation: Allow users to use integers from Python. > The error message you get is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o55.trainLinearRegressionModelWithSGD. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 > in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 > (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot > be cast to java.lang.Double > at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119) > at > org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) > at > org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804) > at > org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1405: - Priority: Critical (was: Major) > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1405: - Target Version/s: 1.3.0 (was: 1.2.0) > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2199: - Assignee: Valeriy Avanesov > Distributed probabilistic latent semantic analysis in MLlib > --- > > Key: SPARK-2199 > URL: https://issues.apache.org/jira/browse/SPARK-2199 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Denis Turdakov >Assignee: Valeriy Avanesov > Labels: features > > Probabilistic latent semantic analysis (PLSA) is a topic model which extracts > topics from text corpus. PLSA was historically a predecessor of LDA. However > recent research shows that modifications of PLSA sometimes performs better > then LDA[1]. Furthermore, the most recent paper by same authors shows that > there is a clear way to extend PLSA to LDA and beyond[2]. > We should implement distributed versions of PLSA. In addition it should be > possible to easily add user defined regularizers or combination of them. We > will implement regularizers that allows > * extract sparse topics > * extract human interpretable topics > * perform semi-supervised training > * sort out non-topic specific terms. > [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In > Proceedings of ECIR'13. > [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive > Regularization for Stochastic Matrix Factorization. > http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1473: - Target Version/s: 1.3.0 (was: 1.2.0) > Feature selection for high dimensional datasets > --- > > Key: SPARK-1473 > URL: https://issues.apache.org/jira/browse/SPARK-1473 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ignacio Zendejas >Assignee: Alexander Ulanov >Priority: Minor > Labels: features > > For classification tasks involving large feature spaces in the order of tens > of thousands or higher (e.g., text classification with n-grams, where n > 1), > it is often useful to rank and filter features that are irrelevant thereby > reducing the feature space by at least one or two orders of magnitude without > impacting performance on key evaluation metrics (accuracy/precision/recall). > A feature evaluation interface which is flexible needs to be designed and at > least two methods should be implemented with Information Gain being a > priority as it has been shown to be amongst the most reliable. > Special consideration should be taken in the design to account for wrapper > methods (see research papers below) which are more practical for lower > dimensional data. > Relevant research: > * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional > likelihood maximisation: a unifying framework for information theoretic > feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. > * Forman, George. "An extensive empirical study of feature selection metrics > for text classification." The Journal of machine learning research 3 (2003): > 1289-1305. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1486: - Target Version/s: 1.3.0 (was: 1.2.0) > Support multi-model training in MLlib > - > > Key: SPARK-1486 > URL: https://issues.apache.org/jira/browse/SPARK-1486 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Burak Yavuz >Priority: Critical > > It is rare in practice to train just one model with a given set of > parameters. Usually, this is done by training multiple models with different > sets of parameters and then select the best based on their performance on the > validation set. MLlib should provide native support for multi-model > training/scoring. It requires decoupling of concepts like problem, > formulation, algorithm, parameter set, and model, which are missing in MLlib > now. MLI implements similar concepts, which we can borrow. There are > different approaches for multi-model training: > 0) Keep one copy of the data, and train models one after another (or maybe in > parallel, depending on the scheduler). > 1) Keep one copy of the data, and train multiple models at the same time > (similar to `runs` in KMeans). > 2) Make multiple copies of the data (still stored distributively), and use > more cores to distribute the work. > 3) Collect the data, make the entire dataset available on workers, and train > one or more models on each worker. > Users should be able to choose which execution mode they want to use. Note > that 3) could cover many use cases in practice when the training data is not > huge, e.g., <1GB. > This task will be divided into sub-tasks and this JIRA is created to discuss > the design and track the overall progress. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2199: - Target Version/s: 1.3.0 (was: 1.2.0) > Distributed probabilistic latent semantic analysis in MLlib > --- > > Key: SPARK-2199 > URL: https://issues.apache.org/jira/browse/SPARK-2199 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Denis Turdakov > Labels: features > > Probabilistic latent semantic analysis (PLSA) is a topic model which extracts > topics from text corpus. PLSA was historically a predecessor of LDA. However > recent research shows that modifications of PLSA sometimes performs better > then LDA[1]. Furthermore, the most recent paper by same authors shows that > there is a clear way to extend PLSA to LDA and beyond[2]. > We should implement distributed versions of PLSA. In addition it should be > possible to easily add user defined regularizers or combination of them. We > will implement regularizers that allows > * extract sparse topics > * extract human interpretable topics > * perform semi-supervised training > * sort out non-topic specific terms. > [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In > Proceedings of ECIR'13. > [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive > Regularization for Stochastic Matrix Factorization. > http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4314) Exception throws when finding new files like intermediate result(_COPYING_ file) through hdfs interface
[ https://issues.apache.org/jira/browse/SPARK-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205477#comment-14205477 ] Sean Owen commented on SPARK-4314: -- To clarify, the suffix is _COPYING_ right? Yeah that's the standard naming for a destination file during a copy. It's renamed on completion. I think you're seeing that the file was present when the dir was listed but not when it was subsequently read. the file could also have been encountered in an intermediate partly written state. Normally I'd say "just don't do that" but the point of streaming is to read a data source that is being written to with new files. I think this and other streaming processes that read HDFS should filter out "*._COPYING_" files, yes. > Exception throws when finding new files like intermediate result(_COPYING_ > file) through hdfs interface > --- > > Key: SPARK-4314 > URL: https://issues.apache.org/jira/browse/SPARK-4314 > Project: Spark > Issue Type: Bug >Reporter: maji2014 > > [Reproduce] > 1. Run HdfsWordCount interface, such as "ssc.textFileStream(args(0))" > 2. Upload file to hdfs(reason as followings) > 3. Exception as followings. > [Exception stack] > 14/11/10 01:21:19 DEBUG Client: IPC Client (842425021) connection to > master/192.168.84.142:9000 from ocdc sending #13 > 14/11/10 01:21:19 ERROR JobScheduler: Error generating jobs for time > 1415611274000 ms > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: hdfs://master:9000/user/spark/200.COPYING > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:125) > at > org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD$1.apply(FileInputDStream.scala:124) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:124) > at > org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:83) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:40) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:40) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:291) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115) > at > org.apache.spark.streaming
[jira] [Updated] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator
[ https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3181: - Assignee: Fan Jiang > Add Robust Regression Algorithm with Huber Estimator > > > Key: SPARK-3181 > URL: https://issues.apache.org/jira/browse/SPARK-3181 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang >Assignee: Fan Jiang > Labels: features > Original Estimate: 0h > Remaining Estimate: 0h > > Linear least square estimates assume the error has normal distribution and > can behave badly when the errors are heavy-tailed. In practical we get > various types of data. We need to include Robust Regression to employ a > fitting criterion that is not as vulnerable as least square. > In 1973, Huber introduced M-estimation for regression which stands for > "maximum likelihood type". The method is resistant to outliers in the > response variable and has been widely used. > The new feature for MLlib will contain 3 new files > /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala > /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala > /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala > and one new class HuberRobustGradient in > /main/scala/org/apache/spark/mllib/optimization/Gradient.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2309) Generalize the binary logistic regression into multinomial logistic regression
[ https://issues.apache.org/jira/browse/SPARK-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2309: - Target Version/s: 1.3.0 (was: 1.2.0) > Generalize the binary logistic regression into multinomial logistic regression > -- > > Key: SPARK-2309 > URL: https://issues.apache.org/jira/browse/SPARK-2309 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: DB Tsai >Assignee: DB Tsai > > Currently, there is no multi-class classifier in mllib. Logistic regression > can be extended to multinomial one straightforwardly. > The following formula will be implemented. > http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator
[ https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3181: - Target Version/s: 1.3.0 (was: 1.2.0) > Add Robust Regression Algorithm with Huber Estimator > > > Key: SPARK-3181 > URL: https://issues.apache.org/jira/browse/SPARK-3181 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang > Labels: features > Original Estimate: 0h > Remaining Estimate: 0h > > Linear least square estimates assume the error has normal distribution and > can behave badly when the errors are heavy-tailed. In practical we get > various types of data. We need to include Robust Regression to employ a > fitting criterion that is not as vulnerable as least square. > In 1973, Huber introduced M-estimation for regression which stands for > "maximum likelihood type". The method is resistant to outliers in the > response variable and has been widely used. > The new feature for MLlib will contain 3 new files > /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala > /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala > /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala > and one new class HuberRobustGradient in > /main/scala/org/apache/spark/mllib/optimization/Gradient.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3218) K-Means clusterer can fail on degenerate data
[ https://issues.apache.org/jira/browse/SPARK-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3218: - Target Version/s: 1.3.0 (was: 1.2.0) > K-Means clusterer can fail on degenerate data > - > > Key: SPARK-3218 > URL: https://issues.apache.org/jira/browse/SPARK-3218 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns >Assignee: Derrick Burns > > The KMeans parallel implementation selects points to be cluster centers with > probability weighted by their distance to cluster centers. However, if there > are fewer than k DISTINCT points in the data set, this approach will fail. > Further, the recent checkin to work around this problem results in selection > of the same point repeatedly as a cluster center. > The fix is to allow fewer than k cluster centers to be selected. This > requires several changes to the code, as the number of cluster centers is > woven into the implementation. > I have a version of the code that addresses this problem, AND generalizes the > distance metric. However, I see that there are literally hundreds of > outstanding pull requests. If someone will commit to working with me to > sponsor the pull request, I will create it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4328) Python serialization updates make Python ML API more brittle to types
[ https://issues.apache.org/jira/browse/SPARK-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205457#comment-14205457 ] Joseph K. Bradley commented on SPARK-4328: -- both related to Python API SerDe updates > Python serialization updates make Python ML API more brittle to types > - > > Key: SPARK-4328 > URL: https://issues.apache.org/jira/browse/SPARK-4328 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > In Spark 1.1, you could create a LabeledPoint with labels specified as > integers, and then use it with LinearRegression. This was broken by the > Python API updates since then. E.g., this code runs in the 1.1 branch but > not in the current master: > {code} > from pyspark.mllib.regression import * > import numpy > features = numpy.ndarray((3)) > data = sc.parallelize([LabeledPoint(1, features)]) > LinearRegressionWithSGD.train(data) > {code} > Recommendation: Allow users to use integers from Python. > The error message you get is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o55.trainLinearRegressionModelWithSGD. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 > in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 > (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot > be cast to java.lang.Double > at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119) > at > org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) > at > org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804) > at > org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4328) Python serialization updates make Python ML API more brittle to types
[ https://issues.apache.org/jira/browse/SPARK-4328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205458#comment-14205458 ] Joseph K. Bradley commented on SPARK-4328: -- [~atalwalkar] Thanks for pointing this out! > Python serialization updates make Python ML API more brittle to types > - > > Key: SPARK-4328 > URL: https://issues.apache.org/jira/browse/SPARK-4328 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > In Spark 1.1, you could create a LabeledPoint with labels specified as > integers, and then use it with LinearRegression. This was broken by the > Python API updates since then. E.g., this code runs in the 1.1 branch but > not in the current master: > {code} > from pyspark.mllib.regression import * > import numpy > features = numpy.ndarray((3)) > data = sc.parallelize([LabeledPoint(1, features)]) > LinearRegressionWithSGD.train(data) > {code} > Recommendation: Allow users to use integers from Python. > The error message you get is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o55.trainLinearRegressionModelWithSGD. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 > in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 > (TID 15, localhost): java.lang.ClassCastException: java.lang.Integer cannot > be cast to java.lang.Double > at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119) > at > org.apache.spark.mllib.api.python.SerDe$LabeledPointPickler.construct(PythonMLLibAPI.scala:727) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:84) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97) > at > org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:804) > at > org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:803) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1309) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:910) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org