[jira] [Commented] (SPARK-4848) On a stand-alone cluster, several worker-specific variables are read only on the master
[ https://issues.apache.org/jira/browse/SPARK-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246391#comment-14246391 ] Apache Spark commented on SPARK-4848: - User 'nkronenfeld' has created a pull request for this issue: https://github.com/apache/spark/pull/3699 > On a stand-alone cluster, several worker-specific variables are read only on > the master > --- > > Key: SPARK-4848 > URL: https://issues.apache.org/jira/browse/SPARK-4848 > Project: Spark > Issue Type: Bug > Components: Project Infra > Environment: stand-alone spark cluster >Reporter: Nathan Kronenfeld > Original Estimate: 24h > Remaining Estimate: 24h > > On a stand-alone spark cluster, much of the determination of worker > specifics, especially one has multiple instances per node, is done only on > the master. > The master loops over instances, and starts a worker per instance on each > node. > This means, if your workers have different values of SPARK_WORKER_INSTANCES > or SPARK_WORKER_WEBUI_PORT from each other (or from the master), all values > are ignored except the one on the master. > SPARK_WORKER_PORT looks like it is unread in scripts, but read in code - I'm > not sure how it will behave, since all instances will read the same value > from the environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4848) On a stand-alone cluster, several worker-specific variables are read only on the master
Nathan Kronenfeld created SPARK-4848: Summary: On a stand-alone cluster, several worker-specific variables are read only on the master Key: SPARK-4848 URL: https://issues.apache.org/jira/browse/SPARK-4848 Project: Spark Issue Type: Bug Components: Project Infra Environment: stand-alone spark cluster Reporter: Nathan Kronenfeld On a stand-alone spark cluster, much of the determination of worker specifics, especially one has multiple instances per node, is done only on the master. The master loops over instances, and starts a worker per instance on each node. This means, if your workers have different values of SPARK_WORKER_INSTANCES or SPARK_WORKER_WEBUI_PORT from each other (or from the master), all values are ignored except the one on the master. SPARK_WORKER_PORT looks like it is unread in scripts, but read in code - I'm not sure how it will behave, since all instances will read the same value from the environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4847) extraStrategies cannot take effect in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246380#comment-14246380 ] Apache Spark commented on SPARK-4847: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/3698 > extraStrategies cannot take effect in SQLContext > > > Key: SPARK-4847 > URL: https://issues.apache.org/jira/browse/SPARK-4847 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Saisai Shao > > Because strategies is initialized when SparkPlanner is created, so later > added extraStrategies cannot be added into strategies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4812) SparkPlan.codegenEnabled may be initialized to a wrong value
[ https://issues.apache.org/jira/browse/SPARK-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-4812: Description: The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, which can be override by subclasses. Here is a simple example to show this issue. {code} scala> :paste // Entering paste mode (ctrl-D to finish) abstract class Foo { protected val sqlContext = "Foo" val codegenEnabled: Boolean = { println(sqlContext) // it will call subclass's `sqlContext` which has not yet been initialized. if (sqlContext != null) { true } else { false } } } class Bar extends Foo { override val sqlContext = "Bar" } println(new Bar().codegenEnabled) // Exiting paste mode, now interpreting. null false defined class Foo defined class Bar scala> {code} We should make `sqlContext` `final` to prevent subclasses from overriding it incorrectly. was: The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, which can be override by subclasses. Here is a simple example to show this issue. {code} scala> :paste // Entering paste mode (ctrl-D to finish) abstract class Foo { protected val sqlContext = "Foo" val codegenEnabled: Boolean = { println(sqlContext) // it will call subclass's `sqlContext` which has not yet been initialized. if (sqlContext != null) { true } else { false } } } class Bar extends Foo { override val sqlContext = "Bar" } println(new Bar().codegenEnabled) // Exiting paste mode, now interpreting. null false defined class Foo defined class Bar scala> {code} To fix it, should override codegenEnabled in `InMemoryColumnarTableScan`. > SparkPlan.codegenEnabled may be initialized to a wrong value > > > Key: SPARK-4812 > URL: https://issues.apache.org/jira/browse/SPARK-4812 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, > which can be override by subclasses. Here is a simple example to show this > issue. > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > abstract class Foo { > protected val sqlContext = "Foo" > val codegenEnabled: Boolean = { > println(sqlContext) // it will call subclass's `sqlContext` which has not > yet been initialized. > if (sqlContext != null) { > true > } else { > false > } > } > } > class Bar extends Foo { > override val sqlContext = "Bar" > } > println(new Bar().codegenEnabled) > // Exiting paste mode, now interpreting. > null > false > defined class Foo > defined class Bar > scala> > {code} > We should make `sqlContext` `final` to prevent subclasses from overriding it > incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4847) extraStrategies cannot take effect in SQLContext
Saisai Shao created SPARK-4847: -- Summary: extraStrategies cannot take effect in SQLContext Key: SPARK-4847 URL: https://issues.apache.org/jira/browse/SPARK-4847 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Saisai Shao Because strategies is initialized when SparkPlanner is created, so later added extraStrategies cannot be added into strategies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4843) Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module
[ https://issues.apache.org/jira/browse/SPARK-4843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-4843: -- Assignee: Kostas Sakellis > Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module > - > > Key: SPARK-4843 > URL: https://issues.apache.org/jira/browse/SPARK-4843 > Project: Spark > Issue Type: Improvement >Reporter: Kostas Sakellis >Assignee: Kostas Sakellis > > ExecutorRunnableUtil is a parent of ExecutorRunnable because of the > yarn-alpha and yarn-stable split. Now that yarn-alpha is gone, we can squash > the unnecessary hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: Requested array size exceeds VM limit"
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246337#comment-14246337 ] Apache Spark commented on SPARK-4846: - User 'jinntrance' has created a pull request for this issue: https://github.com/apache/spark/pull/3697 > When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: > Requested array size exceeds VM limit" > --- > > Key: SPARK-4846 > URL: https://issues.apache.org/jira/browse/SPARK-4846 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0 > Environment: Use Word2Vec to process a corpus(sized 3.5G) with one > partition. > The corpus contains about 300 million words and its vocabulary size is about > 10 million. >Reporter: Joseph Tang >Priority: Critical > > Exception in thread "Driver" java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) > Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) > at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) > at > org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2604) Spark Application hangs on yarn in edge case scenario of executor memory requirement
[ https://issues.apache.org/jira/browse/SPARK-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Twinkle Sachdeva closed SPARK-2604. --- Resolution: Fixed > Spark Application hangs on yarn in edge case scenario of executor memory > requirement > > > Key: SPARK-2604 > URL: https://issues.apache.org/jira/browse/SPARK-2604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Twinkle Sachdeva > > In yarn environment, let's say : > MaxAM = Maximum allocatable memory > ExecMem - Executor's memory > if (MaxAM > ExecMem && ( MaxAM - ExecMem) > 384m )) > then Maximum resource validation fails w.r.t executor memory , and > application master gets launched, but when resource is allocated and again > validated, they are returned and application appears to be hanged. > Typical use case is to ask for executor memory = maximum allowed memory as > per yarn config -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: Requested array size exceeds VM limit"
Joseph Tang created SPARK-4846: -- Summary: When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: Requested array size exceeds VM limit" Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Priority: Critical Exception in thread "Driver" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4843) Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module
[ https://issues.apache.org/jira/browse/SPARK-4843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246318#comment-14246318 ] Apache Spark commented on SPARK-4843: - User 'ksakellis' has created a pull request for this issue: https://github.com/apache/spark/pull/3696 > Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module > - > > Key: SPARK-4843 > URL: https://issues.apache.org/jira/browse/SPARK-4843 > Project: Spark > Issue Type: Improvement >Reporter: Kostas Sakellis > > ExecutorRunnableUtil is a parent of ExecutorRunnable because of the > yarn-alpha and yarn-stable split. Now that yarn-alpha is gone, we can squash > the unnecessary hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"
[ https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246289#comment-14246289 ] Apache Spark commented on SPARK-4826: - User 'harishreedharan' has created a pull request for this issue: https://github.com/apache/spark/pull/3695 > Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: > "java.lang.IllegalStateException: File exists and there is no append support!" > > > Key: SPARK-4826 > URL: https://issues.apache.org/jira/browse/SPARK-4826 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0, 1.3.0 >Reporter: Josh Rosen >Assignee: Tathagata Das > Labels: flaky-test > > I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite > where four tests failed with the same exception. > [Link to test result (this will eventually > break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/]. > In case that link breaks: > The failed tests: > {code} > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in block manager, not in write ahead log > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, not in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, and test storing in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > with partially available in block manager, and rest in write ahead log > {code} > The error messages are all (essentially) the same: > {code} > java.lang.IllegalStateException: File exists and there is no append > support! > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.run
[jira] [Updated] (SPARK-4845) Adding a parallelismRatio to control the partitions num of shuffledRDD
[ https://issues.apache.org/jira/browse/SPARK-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-4845: --- Description: Adding parallelismRatio to control the partitions num of shuffledRDD, the rule is: Math.max(1, parallelismRatio * number of partitions of the largest upstream RDD) The ratio is 1.0 by default to make it compatible with the old version. When we have a good experience on it, we can change this. was: Adding parallelismRatio to control the partitions num of shuffledRDD, the rule is: Math.max(1, parallelismRatio * number of partitions of the largest upstream RDD) The ratio is 1.0 by default to make it compatible with the old version. When we have a good experience on it, we can change this. > Adding a parallelismRatio to control the partitions num of shuffledRDD > -- > > Key: SPARK-4845 > URL: https://issues.apache.org/jira/browse/SPARK-4845 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: wangfei > Fix For: 1.3.0 > > > Adding parallelismRatio to control the partitions num of shuffledRDD, the > rule is: > Math.max(1, parallelismRatio * number of partitions of the largest upstream > RDD) > The ratio is 1.0 by default to make it compatible with the old version. > When we have a good experience on it, we can change this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4845) Adding a parallelismRatio to control the partitions num of shuffledRDD
[ https://issues.apache.org/jira/browse/SPARK-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246286#comment-14246286 ] Apache Spark commented on SPARK-4845: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/3694 > Adding a parallelismRatio to control the partitions num of shuffledRDD > -- > > Key: SPARK-4845 > URL: https://issues.apache.org/jira/browse/SPARK-4845 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: wangfei > Fix For: 1.3.0 > > > Adding parallelismRatio to control the partitions num of shuffledRDD, the > rule is: > Math.max(1, parallelismRatio * number of partitions of the largest upstream > RDD) > The ratio is 1.0 by default to make it compatible with the old version. When > we have a good experience on it, we can change this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4845) Adding a parallelismRatio to control the partitions num of shuffledRDD
wangfei created SPARK-4845: -- Summary: Adding a parallelismRatio to control the partitions num of shuffledRDD Key: SPARK-4845 URL: https://issues.apache.org/jira/browse/SPARK-4845 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.3.0 Adding parallelismRatio to control the partitions num of shuffledRDD, the rule is: Math.max(1, parallelismRatio * number of partitions of the largest upstream RDD) The ratio is 1.0 by default to make it compatible with the old version. When we have a good experience on it, we can change this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"
[ https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246270#comment-14246270 ] Hari Shreedharan commented on SPARK-4826: - I suspect that the nextString is conflicting and producing strings that are likely conflicting (since createTempDir is atomic). Using monotonically increasing names for the file counter will likely fix the issue. > Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: > "java.lang.IllegalStateException: File exists and there is no append support!" > > > Key: SPARK-4826 > URL: https://issues.apache.org/jira/browse/SPARK-4826 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0, 1.3.0 >Reporter: Josh Rosen >Assignee: Tathagata Das > Labels: flaky-test > > I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite > where four tests failed with the same exception. > [Link to test result (this will eventually > break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/]. > In case that link breaks: > The failed tests: > {code} > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in block manager, not in write ahead log > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, not in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, and test storing in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > with partially available in block manager, and rest in write ahead log > {code} > The error messages are all (essentially) the same: > {code} > java.lang.IllegalStateException: File exists and there is no append > support! > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest
[jira] [Created] (SPARK-4844) SGD should support custom sampling.
Guoqiang Li created SPARK-4844: -- Summary: SGD should support custom sampling. Key: SPARK-4844 URL: https://issues.apache.org/jira/browse/SPARK-4844 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Guoqiang Li Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4843) Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module
[ https://issues.apache.org/jira/browse/SPARK-4843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246254#comment-14246254 ] Kostas Sakellis commented on SPARK-4843: I'm working on this. > Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module > - > > Key: SPARK-4843 > URL: https://issues.apache.org/jira/browse/SPARK-4843 > Project: Spark > Issue Type: Improvement >Reporter: Kostas Sakellis > > ExecutorRunnableUtil is a parent of ExecutorRunnable because of the > yarn-alpha and yarn-stable split. Now that yarn-alpha is gone, we can squash > the unnecessary hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4843) Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module
Kostas Sakellis created SPARK-4843: -- Summary: Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module Key: SPARK-4843 URL: https://issues.apache.org/jira/browse/SPARK-4843 Project: Spark Issue Type: Improvement Reporter: Kostas Sakellis ExecutorRunnableUtil is a parent of ExecutorRunnable because of the yarn-alpha and yarn-stable split. Now that yarn-alpha is gone, we can squash the unnecessary hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger
[ https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246135#comment-14246135 ] Apache Spark commented on SPARK-4814: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/3692 > Enable assertions in SBT, Maven tests / AssertionError from Hive's > LazyBinaryInteger > > > Key: SPARK-4814 > URL: https://issues.apache.org/jira/browse/SPARK-4814 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.1.0 >Reporter: Sean Owen > > Follow up to SPARK-4159, wherein we noticed that Java tests weren't running > in Maven, in part because a Java test actually fails with {{AssertionError}}. > That code/test was fixed in SPARK-4850. > The reason it wasn't caught by SBT tests was that they don't run with > assertions on, and Maven's surefire does. > Turning on assertions in the SBT build is trivial, adding one line: > {code} > javaOptions in Test += "-ea", > {code} > This reveals a test failure in Scala test suites though: > {code} > [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds) > [info] Failed to execute query using catalyst: > [info] Error: Job aborted due to stage failure: Task 1 in stage 551.0 > failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, > localhost): java.lang.AssertionError > [info]at > org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51) > [info]at > org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110) > [info]at > org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171) > [info]at > org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166) > [info]at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318) > [info]at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314) > [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > [info]at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132) > [info]at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128) > [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) > [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) > [info]at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > [info]at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) > [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) > [info]at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > [info]at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) > [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) > [info]at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > [info]at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > [info]at org.apache.spark.scheduler.Task.run(Task.scala:56) > [info]at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) > [info]at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [info]at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [info]at java.lang.Thread.run(Thread.java:745) > {code} > The items for this JIRA are therefore: > - Enable assertions in SBT > - Fix this failure > - Figure out why Maven scalatest didn't trigger it - may need assertions > explicitly turned on too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"
[ https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246122#comment-14246122 ] Nicholas Chammas commented on SPARK-4826: - I just cooked up a quick way of invoking this test multiple times in parallel using [GNU parallel|http://www.gnu.org/software/parallel/]: {code} parallel 'sbt/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive -Phive-thriftserver "testOnly org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite"' ::: '' '' '' '' {code} This will fire up 4 copies of that one test in parallel. I ran it a couple of times on my laptop without issue, but appears to be due to some sbt locking that prevents the tests from actually running in parallel. I've [posted a question on Stack Overflow|http://stackoverflow.com/questions/27474000/how-can-i-run-multiple-copies-of-the-same-test-in-parallel] about this. > Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: > "java.lang.IllegalStateException: File exists and there is no append support!" > > > Key: SPARK-4826 > URL: https://issues.apache.org/jira/browse/SPARK-4826 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0, 1.3.0 >Reporter: Josh Rosen >Assignee: Tathagata Das > Labels: flaky-test > > I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite > where four tests failed with the same exception. > [Link to test result (this will eventually > break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/]. > In case that link breaks: > The failed tests: > {code} > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in block manager, not in write ahead log > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, not in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, and test storing in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > with partially available in block manager, and rest in write ahead log > {code} > The error messages are all (essentially) the same: > {code} > java.lang.IllegalStateException: File exists and there is no append > support! > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalate
[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"
[ https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246091#comment-14246091 ] Nicholas Chammas commented on SPARK-4826: - This raises an interesting test infrastructure question: Do we have a way of invoking multiple copies of the same test (on the same or across multiple JVMs) to check a test's level of isolation? If not, that might be a good thing to look into. > Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: > "java.lang.IllegalStateException: File exists and there is no append support!" > > > Key: SPARK-4826 > URL: https://issues.apache.org/jira/browse/SPARK-4826 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0, 1.3.0 >Reporter: Josh Rosen >Assignee: Tathagata Das > Labels: flaky-test > > I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite > where four tests failed with the same exception. > [Link to test result (this will eventually > break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/]. > In case that link breaks: > The failed tests: > {code} > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in block manager, not in write ahead log > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, not in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, and test storing in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > with partially available in block manager, and rest in write ahead log > {code} > The error messages are all (essentially) the same: > {code} > java.lang.IllegalStateException: File exists and there is no append > support! > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:
[jira] [Updated] (SPARK-4812) SparkPlan.codegenEnabled may be initialized to a wrong value
[ https://issues.apache.org/jira/browse/SPARK-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4812: Assignee: Shixiong Zhu > SparkPlan.codegenEnabled may be initialized to a wrong value > > > Key: SPARK-4812 > URL: https://issues.apache.org/jira/browse/SPARK-4812 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, > which can be override by subclasses. Here is a simple example to show this > issue. > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > abstract class Foo { > protected val sqlContext = "Foo" > val codegenEnabled: Boolean = { > println(sqlContext) // it will call subclass's `sqlContext` which has not > yet been initialized. > if (sqlContext != null) { > true > } else { > false > } > } > } > class Bar extends Foo { > override val sqlContext = "Bar" > } > println(new Bar().codegenEnabled) > // Exiting paste mode, now interpreting. > null > false > defined class Foo > defined class Bar > scala> > {code} > To fix it, should override codegenEnabled in `InMemoryColumnarTableScan`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4775) Possible problem in a simple join? Getting duplicate rows and missing rows
[ https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Boesch closed SPARK-4775. - Resolution: Not a Problem > Possible problem in a simple join? Getting duplicate rows and missing rows > --- > > Key: SPARK-4775 > URL: https://issues.apache.org/jira/browse/SPARK-4775 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: Run on Mac but should be agnostic >Reporter: Stephen Boesch >Assignee: Michael Armbrust > > I am working on testing of HBase joins. As part of this work some simple > vanilla SparkSQL tests were created. Some of the results are surprising: > here are the details: > > Consider the following schema that includes two columns: > {code} > case class JoinTable2Cols(intcol: Int, strcol: String) > {code} > Let us register two temp tables using this schema and insert 2 rows and 4 > rows respectively: > {code} > val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, > s"valA$ix")}) > rdd1.registerTempTable("SparkJoinTable1") > val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4)) > val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, > s"valB$is")}) > val table2 = rdd2.registerTempTable("SparkJoinTable2") > {code} > Here is the data in both tables: > {code} > Table1 Contents: > [1,valA1] > [2,valA2] > Table2 Contents: > [1,valB1] > [1,valB2] > [2,valB3] > [2,valB4] > {code} > Now let us join the tables on the first column: > {code} > select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol, > t2.strcol t2strcol from SparkJoinTable1 t1 JOIN > SparkJoinTable2 t2 on t1.intcol = t2.intcol > {code} > What results do we get: > came back with 4 results > {code} > Results > [1,1,valA1,valB2] > [1,1,valA1,valB2] > [2,2,valA2,valB4] > [2,2,valA2,valB4] > {code} > Huh?? > Where did valB1 and valB3 go? Why do we have duplicate rows? > Note: the expected results were: > {code} > Seq(1, 1, "valA1", "valB1"), > Seq(1, 1, "valA1", "valB2"), > Seq(2, 2, "valA2", "valB3"), > Seq(2, 2, "valA2", "valB4")) > {code} > A standalone testing program is attached SparkSQLJoinSuite. An abridged > version of the actual output is also attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4775) Possible problem in a simple join? Getting duplicate rows and missing rows
[ https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246084#comment-14246084 ] Stephen Boesch commented on SPARK-4775: --- Thanks v much Michael. You hit the nail on the head. I will update our internal code here to remove that antipattern. Issue is being closed. > Possible problem in a simple join? Getting duplicate rows and missing rows > --- > > Key: SPARK-4775 > URL: https://issues.apache.org/jira/browse/SPARK-4775 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: Run on Mac but should be agnostic >Reporter: Stephen Boesch >Assignee: Michael Armbrust > > I am working on testing of HBase joins. As part of this work some simple > vanilla SparkSQL tests were created. Some of the results are surprising: > here are the details: > > Consider the following schema that includes two columns: > {code} > case class JoinTable2Cols(intcol: Int, strcol: String) > {code} > Let us register two temp tables using this schema and insert 2 rows and 4 > rows respectively: > {code} > val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, > s"valA$ix")}) > rdd1.registerTempTable("SparkJoinTable1") > val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4)) > val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, > s"valB$is")}) > val table2 = rdd2.registerTempTable("SparkJoinTable2") > {code} > Here is the data in both tables: > {code} > Table1 Contents: > [1,valA1] > [2,valA2] > Table2 Contents: > [1,valB1] > [1,valB2] > [2,valB3] > [2,valB4] > {code} > Now let us join the tables on the first column: > {code} > select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol, > t2.strcol t2strcol from SparkJoinTable1 t1 JOIN > SparkJoinTable2 t2 on t1.intcol = t2.intcol > {code} > What results do we get: > came back with 4 results > {code} > Results > [1,1,valA1,valB2] > [1,1,valA1,valB2] > [2,2,valA2,valB4] > [2,2,valA2,valB4] > {code} > Huh?? > Where did valB1 and valB3 go? Why do we have duplicate rows? > Note: the expected results were: > {code} > Seq(1, 1, "valA1", "valB1"), > Seq(1, 1, "valA1", "valB2"), > Seq(2, 2, "valA2", "valB3"), > Seq(2, 2, "valA2", "valB4")) > {code} > A standalone testing program is attached SparkSQLJoinSuite. An abridged > version of the actual output is also attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4812) SparkPlan.codegenEnabled may be initialized to a wrong value
[ https://issues.apache.org/jira/browse/SPARK-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4812: Target Version/s: 1.3.0 > SparkPlan.codegenEnabled may be initialized to a wrong value > > > Key: SPARK-4812 > URL: https://issues.apache.org/jira/browse/SPARK-4812 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu > > The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, > which can be override by subclasses. Here is a simple example to show this > issue. > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > abstract class Foo { > protected val sqlContext = "Foo" > val codegenEnabled: Boolean = { > println(sqlContext) // it will call subclass's `sqlContext` which has not > yet been initialized. > if (sqlContext != null) { > true > } else { > false > } > } > } > class Bar extends Foo { > override val sqlContext = "Bar" > } > println(new Bar().codegenEnabled) > // Exiting paste mode, now interpreting. > null > false > defined class Foo > defined class Bar > scala> > {code} > To fix it, should override codegenEnabled in `InMemoryColumnarTableScan`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4684) Add a script to run JDBC server on Windows
[ https://issues.apache.org/jira/browse/SPARK-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4684: Target Version/s: 1.3.0 > Add a script to run JDBC server on Windows > -- > > Key: SPARK-4684 > URL: https://issues.apache.org/jira/browse/SPARK-4684 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Matei Zaharia >Assignee: Cheng Lian >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"
[ https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246041#comment-14246041 ] Hari Shreedharan commented on SPARK-4826: - It looks like there is some issue with the directories/files existing (though we use random names for files/dirs). I will see try to get something ready later today > Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: > "java.lang.IllegalStateException: File exists and there is no append support!" > > > Key: SPARK-4826 > URL: https://issues.apache.org/jira/browse/SPARK-4826 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0, 1.3.0 >Reporter: Josh Rosen >Assignee: Tathagata Das > Labels: flaky-test > > I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite > where four tests failed with the same exception. > [Link to test result (this will eventually > break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/]. > In case that link breaks: > The failed tests: > {code} > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in block manager, not in write ahead log > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, not in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, and test storing in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > with partially available in block manager, and rest in write ahead log > {code} > The error messages are all (essentially) the same: > {code} > java.lang.IllegalStateException: File exists and there is no append > support! > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTests
[jira] [Commented] (SPARK-4775) Possible problem in a simple join? Getting duplicate rows and missing rows
[ https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246029#comment-14246029 ] Michael Armbrust commented on SPARK-4775: - I only scanned over the code quickly, but I think likely the problem is that you are calling "toRDD". This function is an developer API not intended for users and is documented "Internal version of the RDD. Avoids copies and has no schema". If you use it directly without defensively copying you'll see weird repeated rows. Instead just use the SchemaRDD as an RDD and we'll do the copying for you. > Possible problem in a simple join? Getting duplicate rows and missing rows > --- > > Key: SPARK-4775 > URL: https://issues.apache.org/jira/browse/SPARK-4775 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: Run on Mac but should be agnostic >Reporter: Stephen Boesch >Assignee: Michael Armbrust > > I am working on testing of HBase joins. As part of this work some simple > vanilla SparkSQL tests were created. Some of the results are surprising: > here are the details: > > Consider the following schema that includes two columns: > {code} > case class JoinTable2Cols(intcol: Int, strcol: String) > {code} > Let us register two temp tables using this schema and insert 2 rows and 4 > rows respectively: > {code} > val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, > s"valA$ix")}) > rdd1.registerTempTable("SparkJoinTable1") > val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4)) > val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, > s"valB$is")}) > val table2 = rdd2.registerTempTable("SparkJoinTable2") > {code} > Here is the data in both tables: > {code} > Table1 Contents: > [1,valA1] > [2,valA2] > Table2 Contents: > [1,valB1] > [1,valB2] > [2,valB3] > [2,valB4] > {code} > Now let us join the tables on the first column: > {code} > select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol, > t2.strcol t2strcol from SparkJoinTable1 t1 JOIN > SparkJoinTable2 t2 on t1.intcol = t2.intcol > {code} > What results do we get: > came back with 4 results > {code} > Results > [1,1,valA1,valB2] > [1,1,valA1,valB2] > [2,2,valA2,valB4] > [2,2,valA2,valB4] > {code} > Huh?? > Where did valB1 and valB3 go? Why do we have duplicate rows? > Note: the expected results were: > {code} > Seq(1, 1, "valA1", "valB1"), > Seq(1, 1, "valA1", "valB2"), > Seq(2, 2, "valA2", "valB3"), > Seq(2, 2, "valA2", "valB4")) > {code} > A standalone testing program is attached SparkSQLJoinSuite. An abridged > version of the actual output is also attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4775) Possible problem in a simple join? Getting duplicate rows and missing rows
[ https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-4775: --- Assignee: Michael Armbrust (was: Cheng Lian) > Possible problem in a simple join? Getting duplicate rows and missing rows > --- > > Key: SPARK-4775 > URL: https://issues.apache.org/jira/browse/SPARK-4775 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: Run on Mac but should be agnostic >Reporter: Stephen Boesch >Assignee: Michael Armbrust > > I am working on testing of HBase joins. As part of this work some simple > vanilla SparkSQL tests were created. Some of the results are surprising: > here are the details: > > Consider the following schema that includes two columns: > {code} > case class JoinTable2Cols(intcol: Int, strcol: String) > {code} > Let us register two temp tables using this schema and insert 2 rows and 4 > rows respectively: > {code} > val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, > s"valA$ix")}) > rdd1.registerTempTable("SparkJoinTable1") > val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4)) > val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, > s"valB$is")}) > val table2 = rdd2.registerTempTable("SparkJoinTable2") > {code} > Here is the data in both tables: > {code} > Table1 Contents: > [1,valA1] > [2,valA2] > Table2 Contents: > [1,valB1] > [1,valB2] > [2,valB3] > [2,valB4] > {code} > Now let us join the tables on the first column: > {code} > select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol, > t2.strcol t2strcol from SparkJoinTable1 t1 JOIN > SparkJoinTable2 t2 on t1.intcol = t2.intcol > {code} > What results do we get: > came back with 4 results > {code} > Results > [1,1,valA1,valB2] > [1,1,valA1,valB2] > [2,2,valA2,valB4] > [2,2,valA2,valB4] > {code} > Huh?? > Where did valB1 and valB3 go? Why do we have duplicate rows? > Note: the expected results were: > {code} > Seq(1, 1, "valA1", "valB1"), > Seq(1, 1, "valA1", "valB2"), > Seq(2, 2, "valA2", "valB3"), > Seq(2, 2, "valA2", "valB4")) > {code} > A standalone testing program is attached SparkSQLJoinSuite. An abridged > version of the actual output is also attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger
[ https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4814: Target Version/s: 1.3.0 > Enable assertions in SBT, Maven tests / AssertionError from Hive's > LazyBinaryInteger > > > Key: SPARK-4814 > URL: https://issues.apache.org/jira/browse/SPARK-4814 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.1.0 >Reporter: Sean Owen > > Follow up to SPARK-4159, wherein we noticed that Java tests weren't running > in Maven, in part because a Java test actually fails with {{AssertionError}}. > That code/test was fixed in SPARK-4850. > The reason it wasn't caught by SBT tests was that they don't run with > assertions on, and Maven's surefire does. > Turning on assertions in the SBT build is trivial, adding one line: > {code} > javaOptions in Test += "-ea", > {code} > This reveals a test failure in Scala test suites though: > {code} > [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds) > [info] Failed to execute query using catalyst: > [info] Error: Job aborted due to stage failure: Task 1 in stage 551.0 > failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, > localhost): java.lang.AssertionError > [info]at > org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51) > [info]at > org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110) > [info]at > org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171) > [info]at > org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166) > [info]at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318) > [info]at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314) > [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > [info]at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132) > [info]at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128) > [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) > [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) > [info]at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > [info]at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) > [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) > [info]at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > [info]at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) > [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) > [info]at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > [info]at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > [info]at org.apache.spark.scheduler.Task.run(Task.scala:56) > [info]at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) > [info]at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [info]at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [info]at java.lang.Thread.run(Thread.java:745) > {code} > The items for this JIRA are therefore: > - Enable assertions in SBT > - Fix this failure > - Figure out why Maven scalatest didn't trigger it - may need assertions > explicitly turned on too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger
[ https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246024#comment-14246024 ] Michael Armbrust commented on SPARK-4814: - Either way, I don't think we should block turning on assertions for the rest of Spark. > Enable assertions in SBT, Maven tests / AssertionError from Hive's > LazyBinaryInteger > > > Key: SPARK-4814 > URL: https://issues.apache.org/jira/browse/SPARK-4814 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.1.0 >Reporter: Sean Owen > > Follow up to SPARK-4159, wherein we noticed that Java tests weren't running > in Maven, in part because a Java test actually fails with {{AssertionError}}. > That code/test was fixed in SPARK-4850. > The reason it wasn't caught by SBT tests was that they don't run with > assertions on, and Maven's surefire does. > Turning on assertions in the SBT build is trivial, adding one line: > {code} > javaOptions in Test += "-ea", > {code} > This reveals a test failure in Scala test suites though: > {code} > [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds) > [info] Failed to execute query using catalyst: > [info] Error: Job aborted due to stage failure: Task 1 in stage 551.0 > failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, > localhost): java.lang.AssertionError > [info]at > org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51) > [info]at > org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110) > [info]at > org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171) > [info]at > org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166) > [info]at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318) > [info]at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314) > [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > [info]at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132) > [info]at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128) > [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) > [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) > [info]at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > [info]at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) > [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) > [info]at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > [info]at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) > [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) > [info]at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > [info]at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > [info]at org.apache.spark.scheduler.Task.run(Task.scala:56) > [info]at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) > [info]at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [info]at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [info]at java.lang.Thread.run(Thread.java:745) > {code} > The items for this JIRA are therefore: > - Enable assertions in SBT > - Fix this failure > - Figure out why Maven scalatest didn't trigger it - may need assertions > explicitly turned on too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
[ https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4782: Affects Version/s: (was: 1.3.0) > Add inferSchema support for RDD[Map[String, Any]] > - > > Key: SPARK-4782 > URL: https://issues.apache.org/jira/browse/SPARK-4782 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jianshi Huang >Priority: Minor > > The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to > be converting each Map to JSON String first and use JsonRDD.inferSchema on it. > It's very inefficient. > Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for > Schemaless data as adding Map like interface to any serialization format is > easy. > So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new > serialization format we want to support, we just need to add a Map interface > wrapper to it* > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
[ https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4782: Target Version/s: 1.3.0 > Add inferSchema support for RDD[Map[String, Any]] > - > > Key: SPARK-4782 > URL: https://issues.apache.org/jira/browse/SPARK-4782 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jianshi Huang >Priority: Minor > > The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to > be converting each Map to JSON String first and use JsonRDD.inferSchema on it. > It's very inefficient. > Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for > Schemaless data as adding Map like interface to any serialization format is > easy. > So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new > serialization format we want to support, we just need to add a Map interface > wrapper to it* > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4838) StackOverflowError when serialization task
[ https://issues.apache.org/jira/browse/SPARK-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246023#comment-14246023 ] Michael Armbrust commented on SPARK-4838: - Yeah, any more detail you can provide would be very helpful. I have successfully ran queries with 20,000+ partitions using HadoopRDD on Spark 1.1. > StackOverflowError when serialization task > -- > > Key: SPARK-4838 > URL: https://issues.apache.org/jira/browse/SPARK-4838 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.1.0 >Reporter: Hong Shen > > When run a sql with more than 2000 partitions, each partition a HadoopRDD, > it will cause java.lang.StackOverflowError at serialize task. > Error message from spark is:Job aborted due to stage failure: Task > serialization failed: java.lang.StackOverflowError > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > .. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4841) Batch serializer bug in PySpark's RDD.zip
[ https://issues.apache.org/jira/browse/SPARK-4841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14245996#comment-14245996 ] Xiangrui Meng commented on SPARK-4841: -- This is the commit that caused the bug: 786e75b33f0bc1445bfc289fe4b62407cb79026e (SPARK-3886) > Batch serializer bug in PySpark's RDD.zip > - > > Key: SPARK-4841 > URL: https://issues.apache.org/jira/browse/SPARK-4841 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 >Reporter: Xiangrui Meng > > {code} > t = sc.textFile("README.md") > t.zip(t).count() > {code} > {code} > Py4JJavaError Traceback (most recent call last) > in () > > 1 readme.zip(readme).count() > /Users/meng/src/spark/python/pyspark/rdd.pyc in count(self) > 817 3 > 818 """ > --> 819 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() > 820 > 821 def stats(self): > /Users/meng/src/spark/python/pyspark/rdd.pyc in sum(self) > 808 6.0 > 809 """ > --> 810 return self.mapPartitions(lambda x: > [sum(x)]).reduce(operator.add) > 811 > 812 def count(self): > /Users/meng/src/spark/python/pyspark/rdd.pyc in reduce(self, f) > 713 yield reduce(f, iterator, initial) > 714 > --> 715 vals = self.mapPartitions(func).collect() > 716 if vals: > 717 return reduce(f, vals) > /Users/meng/src/spark/python/pyspark/rdd.pyc in collect(self) > 674 """ > 675 with SCCallSiteSync(self.context) as css: > --> 676 bytesInJava = self._jrdd.collect().iterator() > 677 return list(self._collect_iterator_through_file(bytesInJava)) > 678 > /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 536 answer = self.gateway_client.send_command(command) > 537 return_value = get_return_value(answer, self.gateway_client, > --> 538 self.target_id, self.name) > 539 > 540 for temp_arg in temp_args: > /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > Py4JJavaError: An error occurred while calling o69.collect. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 > (TID 2, localhost): org.apache.spark.api.python.PythonException: Traceback > (most recent call last): > File "/Users/meng/src/spark/python/pyspark/worker.py", line 107, in main > process() > File "/Users/meng/src/spark/python/pyspark/worker.py", line 98, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/Users/meng/src/spark/python/pyspark/serializers.py", line 198, in > dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/Users/meng/src/spark/python/pyspark/serializers.py", line 81, in > dump_stream > raise NotImplementedError > NotImplementedError > at > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137) > at > org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:174) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) > at > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) > at > org.apache.spark.scheduler.DAGScheduler$$a
[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"
[ https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14245978#comment-14245978 ] Josh Rosen commented on SPARK-4826: --- I committed a documentation typo fix to {{master}} and {{branch-1.2}} at the same time, which caused a huge number of Maven builds to kick off simultaneously in Jenkins (since it was otherwise idle), and alll of these builds failed due to tests in WriteAheadLogBackedBlockRDDSuite; ; it also broke the master SBT build. I wonder if there's some kind of sharing / contention where multiple copies of the test are attempting to write to the same directory. [~hshreedharan], it would be great to get your help with this to see if you can spot any potential problems in that test suite. > Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: > "java.lang.IllegalStateException: File exists and there is no append support!" > > > Key: SPARK-4826 > URL: https://issues.apache.org/jira/browse/SPARK-4826 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0, 1.3.0 >Reporter: Josh Rosen >Assignee: Tathagata Das > Labels: flaky-test > > I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite > where four tests failed with the same exception. > [Link to test result (this will eventually > break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/]. > In case that link breaks: > The failed tests: > {code} > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in block manager, not in write ahead log > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, not in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > available only in write ahead log, and test storing in block manager > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data > with partially available in block manager, and rest in write ahead log > {code} > The error messages are all (essentially) the same: > {code} > java.lang.IllegalStateException: File exists and there is no append > support! > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34) > at > org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.sca
[jira] [Closed] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider
[ https://issues.apache.org/jira/browse/SPARK-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar closed SPARK-3640. --- Resolution: Not a Problem Tested and Chris's suggestion of using EC2 IAM instance profile works fine. > KinesisUtils should accept a credentials object instead of forcing > DefaultCredentialsProvider > - > > Key: SPARK-3640 > URL: https://issues.apache.org/jira/browse/SPARK-3640 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Aniket Bhatnagar > Labels: kinesis > > KinesisUtils should accept AWS Credentials as a parameter and should default > to DefaultCredentialsProvider if no credentials are provided. Currently, the > implementation forces usage of DefaultCredentialsProvider which can be a pain > especially when jobs are run by multiple unix users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14245894#comment-14245894 ] Vincenzo Selvaggio commented on SPARK-1406: --- Scala examples on usage of ModelExporter.toPMML(model,path): https://github.com/selvinsource/spark-pmml-exporter-validator/tree/master/src/main/resources/spark_shell_exporter Exported PMML xml files: https://github.com/selvinsource/spark-pmml-exporter-validator/tree/master/src/main/resources/exported_pmml_models Evaluation using JPMML of the exported files: https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/java/org/selvinsource/spark_pmml_exporter_validator/SparkPMMLExporterValidator.java > PMML model evaluation support via MLib > -- > > Key: SPARK-1406 > URL: https://issues.apache.org/jira/browse/SPARK-1406 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Thomas Darimont >Assignee: Vincenzo Selvaggio > Attachments: MyJPMMLEval.java, SPARK-1406.pdf, SPARK-1406_v2.pdf, > kmeans.xml > > > It would be useful if spark would provide support the evaluation of PMML > models (http://www.dmg.org/v4-2/GeneralStructure.html). > This would allow to use analytical models that were created with a > statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which > would perform the actual model evaluation for a given input tuple. The PMML > model would then just contain the "parameterization" of an analytical model. > Other projects like JPMML-Evaluator do a similar thing. > https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14245893#comment-14245893 ] Vincenzo Selvaggio commented on SPARK-1406: --- Find at https://github.com/selvinsource/spark-pmml-exporter-validator a simple validator project showing that the prediction made by Apache Spark and JPMML Evaluator (by loading the PMML exported from Spark) produces comparable results, therefore proving the PMML export from Apache Spark works as expected. > PMML model evaluation support via MLib > -- > > Key: SPARK-1406 > URL: https://issues.apache.org/jira/browse/SPARK-1406 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Thomas Darimont >Assignee: Vincenzo Selvaggio > Attachments: MyJPMMLEval.java, SPARK-1406.pdf, SPARK-1406_v2.pdf, > kmeans.xml > > > It would be useful if spark would provide support the evaluation of PMML > models (http://www.dmg.org/v4-2/GeneralStructure.html). > This would allow to use analytical models that were created with a > statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which > would perform the actual model evaluation for a given input tuple. The PMML > model would then just contain the "parameterization" of an analytical model. > Other projects like JPMML-Evaluator do a similar thing. > https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincenzo Selvaggio updated SPARK-1406: -- Attachment: SPARK-1406_v2.pdf Updated document with model supported so far: KMeansModel, LogisticRegressionModel, SVMModel, LinearRegressionModel, RidgeRegressionModel, LassoModel > PMML model evaluation support via MLib > -- > > Key: SPARK-1406 > URL: https://issues.apache.org/jira/browse/SPARK-1406 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Thomas Darimont >Assignee: Vincenzo Selvaggio > Attachments: MyJPMMLEval.java, SPARK-1406.pdf, SPARK-1406_v2.pdf, > kmeans.xml > > > It would be useful if spark would provide support the evaluation of PMML > models (http://www.dmg.org/v4-2/GeneralStructure.html). > This would allow to use analytical models that were created with a > statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which > would perform the actual model evaluation for a given input tuple. The PMML > model would then just contain the "parameterization" of an analytical model. > Other projects like JPMML-Evaluator do a similar thing. > https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4842) Use WeakTypeTags in ScalaReflection
Michael Armbrust created SPARK-4842: --- Summary: Use WeakTypeTags in ScalaReflection Key: SPARK-4842 URL: https://issues.apache.org/jira/browse/SPARK-4842 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust Priority: Critical Right now we can't create SchemaRDDs from RDDs of case classes that are defined inside of functions. This is because only WeakTypeTags are available in this scope. This is pretty confusing to users: http://apache-spark-user-list.1001560.n3.nabble.com/parquet-file-not-loading-spark-v-1-1-0-td20618.html#a20628 https://issues.scala-lang.org/browse/SI-6649 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org