[jira] [Commented] (SPARK-4848) On a stand-alone cluster, several worker-specific variables are read only on the master

2014-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246391#comment-14246391
 ] 

Apache Spark commented on SPARK-4848:
-

User 'nkronenfeld' has created a pull request for this issue:
https://github.com/apache/spark/pull/3699

> On a stand-alone cluster, several worker-specific variables are read only on 
> the master
> ---
>
> Key: SPARK-4848
> URL: https://issues.apache.org/jira/browse/SPARK-4848
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
> Environment: stand-alone spark cluster
>Reporter: Nathan Kronenfeld
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> On a stand-alone spark cluster, much of the determination of worker 
> specifics, especially one has multiple instances per node, is done only on 
> the master.
> The master loops over instances, and starts a worker per instance on each 
> node.
> This means, if your workers have different values of SPARK_WORKER_INSTANCES 
> or SPARK_WORKER_WEBUI_PORT from each other (or from the master), all values 
> are ignored except the one on the master.
> SPARK_WORKER_PORT looks like it is unread in scripts, but read in code - I'm 
> not sure how it will behave, since all instances will read the same value 
> from the environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4848) On a stand-alone cluster, several worker-specific variables are read only on the master

2014-12-14 Thread Nathan Kronenfeld (JIRA)
Nathan Kronenfeld created SPARK-4848:


 Summary: On a stand-alone cluster, several worker-specific 
variables are read only on the master
 Key: SPARK-4848
 URL: https://issues.apache.org/jira/browse/SPARK-4848
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
 Environment: stand-alone spark cluster
Reporter: Nathan Kronenfeld


On a stand-alone spark cluster, much of the determination of worker specifics, 
especially one has multiple instances per node, is done only on the master.

The master loops over instances, and starts a worker per instance on each node.

This means, if your workers have different values of SPARK_WORKER_INSTANCES or 
SPARK_WORKER_WEBUI_PORT from each other (or from the master), all values are 
ignored except the one on the master.

SPARK_WORKER_PORT looks like it is unread in scripts, but read in code - I'm 
not sure how it will behave, since all instances will read the same value from 
the environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4847) extraStrategies cannot take effect in SQLContext

2014-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246380#comment-14246380
 ] 

Apache Spark commented on SPARK-4847:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/3698

> extraStrategies cannot take effect in SQLContext
> 
>
> Key: SPARK-4847
> URL: https://issues.apache.org/jira/browse/SPARK-4847
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Saisai Shao
>
> Because strategies is initialized when SparkPlanner is created, so later 
> added extraStrategies cannot be added into strategies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4812) SparkPlan.codegenEnabled may be initialized to a wrong value

2014-12-14 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-4812:

Description: 
The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, 
which can be override by subclasses. Here is a simple example to show this 
issue.

{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

abstract class Foo {

  protected val sqlContext = "Foo"

  val codegenEnabled: Boolean = {
println(sqlContext) // it will call subclass's `sqlContext` which has not 
yet been initialized.
if (sqlContext != null) {
  true
} else {
  false
}
  }
}

class Bar extends Foo {
  override val sqlContext = "Bar"
}

println(new Bar().codegenEnabled)

// Exiting paste mode, now interpreting.

null
false
defined class Foo
defined class Bar

scala> 
{code}

We should make `sqlContext` `final` to prevent subclasses from overriding it 
incorrectly.

  was:
The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, 
which can be override by subclasses. Here is a simple example to show this 
issue.

{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

abstract class Foo {

  protected val sqlContext = "Foo"

  val codegenEnabled: Boolean = {
println(sqlContext) // it will call subclass's `sqlContext` which has not 
yet been initialized.
if (sqlContext != null) {
  true
} else {
  false
}
  }
}

class Bar extends Foo {
  override val sqlContext = "Bar"
}

println(new Bar().codegenEnabled)

// Exiting paste mode, now interpreting.

null
false
defined class Foo
defined class Bar

scala> 
{code}

To fix it, should override codegenEnabled in `InMemoryColumnarTableScan`.


> SparkPlan.codegenEnabled may be initialized to a wrong value
> 
>
> Key: SPARK-4812
> URL: https://issues.apache.org/jira/browse/SPARK-4812
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, 
> which can be override by subclasses. Here is a simple example to show this 
> issue.
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> abstract class Foo {
>   protected val sqlContext = "Foo"
>   val codegenEnabled: Boolean = {
> println(sqlContext) // it will call subclass's `sqlContext` which has not 
> yet been initialized.
> if (sqlContext != null) {
>   true
> } else {
>   false
> }
>   }
> }
> class Bar extends Foo {
>   override val sqlContext = "Bar"
> }
> println(new Bar().codegenEnabled)
> // Exiting paste mode, now interpreting.
> null
> false
> defined class Foo
> defined class Bar
> scala> 
> {code}
> We should make `sqlContext` `final` to prevent subclasses from overriding it 
> incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4847) extraStrategies cannot take effect in SQLContext

2014-12-14 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-4847:
--

 Summary: extraStrategies cannot take effect in SQLContext
 Key: SPARK-4847
 URL: https://issues.apache.org/jira/browse/SPARK-4847
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Saisai Shao


Because strategies is initialized when SparkPlanner is created, so later added 
extraStrategies cannot be added into strategies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4843) Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module

2014-12-14 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4843:
--
Assignee: Kostas Sakellis

> Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module
> -
>
> Key: SPARK-4843
> URL: https://issues.apache.org/jira/browse/SPARK-4843
> Project: Spark
>  Issue Type: Improvement
>Reporter: Kostas Sakellis
>Assignee: Kostas Sakellis
>
> ExecutorRunnableUtil is a parent of ExecutorRunnable because of the 
> yarn-alpha and yarn-stable split. Now that yarn-alpha is gone, we can squash 
> the unnecessary hierarchy.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: Requested array size exceeds VM limit"

2014-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246337#comment-14246337
 ] 

Apache Spark commented on SPARK-4846:
-

User 'jinntrance' has created a pull request for this issue:
https://github.com/apache/spark/pull/3697

> When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: 
> Requested array size exceeds VM limit"
> ---
>
> Key: SPARK-4846
> URL: https://issues.apache.org/jira/browse/SPARK-4846
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
> Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
> partition.
> The corpus contains about 300 million words and its vocabulary size is about 
> 10 million.
>Reporter: Joseph Tang
>Priority: Critical
>
> Exception in thread "Driver" java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
> Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
> at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
> at 
> org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2604) Spark Application hangs on yarn in edge case scenario of executor memory requirement

2014-12-14 Thread Twinkle Sachdeva (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Twinkle Sachdeva closed SPARK-2604.
---
Resolution: Fixed

> Spark Application hangs on yarn in edge case scenario of executor memory 
> requirement
> 
>
> Key: SPARK-2604
> URL: https://issues.apache.org/jira/browse/SPARK-2604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Twinkle Sachdeva
>
> In yarn environment, let's say :
> MaxAM = Maximum allocatable memory
> ExecMem - Executor's memory
> if (MaxAM > ExecMem && ( MaxAM - ExecMem) > 384m ))
>   then Maximum resource validation fails w.r.t executor memory , and 
> application master gets launched, but when resource is allocated and again 
> validated, they are returned and application appears to be hanged.
> Typical use case is to ask for executor memory = maximum allowed memory as 
> per yarn config



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: Requested array size exceeds VM limit"

2014-12-14 Thread Joseph Tang (JIRA)
Joseph Tang created SPARK-4846:
--

 Summary: When the vocabulary size is large, Word2Vec may yield 
"OutOfMemoryError: Requested array size exceeds VM limit"
 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
partition.
The corpus contains about 300 million words and its vocabulary size is about 10 
million.
Reporter: Joseph Tang
Priority: Critical


Exception in thread "Driver" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
at 
org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4843) Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module

2014-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246318#comment-14246318
 ] 

Apache Spark commented on SPARK-4843:
-

User 'ksakellis' has created a pull request for this issue:
https://github.com/apache/spark/pull/3696

> Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module
> -
>
> Key: SPARK-4843
> URL: https://issues.apache.org/jira/browse/SPARK-4843
> Project: Spark
>  Issue Type: Improvement
>Reporter: Kostas Sakellis
>
> ExecutorRunnableUtil is a parent of ExecutorRunnable because of the 
> yarn-alpha and yarn-stable split. Now that yarn-alpha is gone, we can squash 
> the unnecessary hierarchy.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"

2014-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246289#comment-14246289
 ] 

Apache Spark commented on SPARK-4826:
-

User 'harishreedharan' has created a pull request for this issue:
https://github.com/apache/spark/pull/3695

> Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: 
> "java.lang.IllegalStateException: File exists and there is no append support!"
> 
>
> Key: SPARK-4826
> URL: https://issues.apache.org/jira/browse/SPARK-4826
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Josh Rosen
>Assignee: Tathagata Das
>  Labels: flaky-test
>
> I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite 
> where four tests failed with the same exception.
> [Link to test result (this will eventually 
> break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/].
>   In case that link breaks:
> The failed tests:
> {code}
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in block manager, not in write ahead log
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, not in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, and test storing in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> with partially available in block manager, and rest in write ahead log
> {code}
> The error messages are all (essentially) the same:
> {code}
>  java.lang.IllegalStateException: File exists and there is no append 
> support!
>   at 
> org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.run

[jira] [Updated] (SPARK-4845) Adding a parallelismRatio to control the partitions num of shuffledRDD

2014-12-14 Thread wangfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-4845:
---
Description: 
Adding parallelismRatio to control the partitions num of shuffledRDD, the rule 
is:

 Math.max(1, parallelismRatio * number of partitions of the largest upstream 
RDD)
The ratio is 1.0 by default to make it compatible with the old version. 
When we have a good experience on it, we can change this.

  was:
Adding parallelismRatio to control the partitions num of shuffledRDD, the rule 
is:

 Math.max(1, parallelismRatio * number of partitions of the largest upstream 
RDD)
The ratio is 1.0 by default to make it compatible with the old version. When we 
have a good experience on it, we can change this.


> Adding a parallelismRatio to control the partitions num of shuffledRDD
> --
>
> Key: SPARK-4845
> URL: https://issues.apache.org/jira/browse/SPARK-4845
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.3.0
>
>
> Adding parallelismRatio to control the partitions num of shuffledRDD, the 
> rule is:
>  Math.max(1, parallelismRatio * number of partitions of the largest upstream 
> RDD)
> The ratio is 1.0 by default to make it compatible with the old version. 
> When we have a good experience on it, we can change this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4845) Adding a parallelismRatio to control the partitions num of shuffledRDD

2014-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246286#comment-14246286
 ] 

Apache Spark commented on SPARK-4845:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3694

> Adding a parallelismRatio to control the partitions num of shuffledRDD
> --
>
> Key: SPARK-4845
> URL: https://issues.apache.org/jira/browse/SPARK-4845
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.3.0
>
>
> Adding parallelismRatio to control the partitions num of shuffledRDD, the 
> rule is:
>  Math.max(1, parallelismRatio * number of partitions of the largest upstream 
> RDD)
> The ratio is 1.0 by default to make it compatible with the old version. When 
> we have a good experience on it, we can change this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4845) Adding a parallelismRatio to control the partitions num of shuffledRDD

2014-12-14 Thread wangfei (JIRA)
wangfei created SPARK-4845:
--

 Summary: Adding a parallelismRatio to control the partitions num 
of shuffledRDD
 Key: SPARK-4845
 URL: https://issues.apache.org/jira/browse/SPARK-4845
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.3.0


Adding parallelismRatio to control the partitions num of shuffledRDD, the rule 
is:

 Math.max(1, parallelismRatio * number of partitions of the largest upstream 
RDD)
The ratio is 1.0 by default to make it compatible with the old version. When we 
have a good experience on it, we can change this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"

2014-12-14 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246270#comment-14246270
 ] 

Hari Shreedharan commented on SPARK-4826:
-

I suspect that the nextString is conflicting and producing strings that are 
likely conflicting (since createTempDir is atomic). Using monotonically 
increasing names for the file counter will likely fix the issue.

> Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: 
> "java.lang.IllegalStateException: File exists and there is no append support!"
> 
>
> Key: SPARK-4826
> URL: https://issues.apache.org/jira/browse/SPARK-4826
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Josh Rosen
>Assignee: Tathagata Das
>  Labels: flaky-test
>
> I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite 
> where four tests failed with the same exception.
> [Link to test result (this will eventually 
> break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/].
>   In case that link breaks:
> The failed tests:
> {code}
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in block manager, not in write ahead log
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, not in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, and test storing in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> with partially available in block manager, and rest in write ahead log
> {code}
> The error messages are all (essentially) the same:
> {code}
>  java.lang.IllegalStateException: File exists and there is no append 
> support!
>   at 
> org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest

[jira] [Created] (SPARK-4844) SGD should support custom sampling.

2014-12-14 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-4844:
--

 Summary: SGD should support custom sampling.
 Key: SPARK-4844
 URL: https://issues.apache.org/jira/browse/SPARK-4844
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4843) Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module

2014-12-14 Thread Kostas Sakellis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246254#comment-14246254
 ] 

Kostas Sakellis commented on SPARK-4843:


I'm working on this.

> Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module
> -
>
> Key: SPARK-4843
> URL: https://issues.apache.org/jira/browse/SPARK-4843
> Project: Spark
>  Issue Type: Improvement
>Reporter: Kostas Sakellis
>
> ExecutorRunnableUtil is a parent of ExecutorRunnable because of the 
> yarn-alpha and yarn-stable split. Now that yarn-alpha is gone, we can squash 
> the unnecessary hierarchy.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4843) Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module

2014-12-14 Thread Kostas Sakellis (JIRA)
Kostas Sakellis created SPARK-4843:
--

 Summary: Squash ExecutorRunnable and ExecutorRunnableUtil 
hierarchy in yarn module
 Key: SPARK-4843
 URL: https://issues.apache.org/jira/browse/SPARK-4843
 Project: Spark
  Issue Type: Improvement
Reporter: Kostas Sakellis


ExecutorRunnableUtil is a parent of ExecutorRunnable because of the yarn-alpha 
and yarn-stable split. Now that yarn-alpha is gone, we can squash the 
unnecessary hierarchy.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger

2014-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246135#comment-14246135
 ] 

Apache Spark commented on SPARK-4814:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3692

> Enable assertions in SBT, Maven tests / AssertionError from Hive's 
> LazyBinaryInteger
> 
>
> Key: SPARK-4814
> URL: https://issues.apache.org/jira/browse/SPARK-4814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.1.0
>Reporter: Sean Owen
>
> Follow up to SPARK-4159, wherein we noticed that Java tests weren't running 
> in Maven, in part because a Java test actually fails with {{AssertionError}}. 
> That code/test was fixed in SPARK-4850.
> The reason it wasn't caught by SBT tests was that they don't run with 
> assertions on, and Maven's surefire does.
> Turning on assertions in the SBT build is trivial, adding one line:
> {code}
> javaOptions in Test += "-ea",
> {code}
> This reveals a test failure in Scala test suites though:
> {code}
> [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds)
> [info]   Failed to execute query using catalyst:
> [info]   Error: Job aborted due to stage failure: Task 1 in stage 551.0 
> failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, 
> localhost): java.lang.AssertionError
> [info]at 
> org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51)
> [info]at 
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110)
> [info]at 
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171)
> [info]at 
> org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166)
> [info]at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318)
> [info]at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
> [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> [info]at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132)
> [info]at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
> [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> [info]at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> [info]at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> [info]at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> [info]at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:56)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [info]at java.lang.Thread.run(Thread.java:745)
> {code}
> The items for this JIRA are therefore:
> - Enable assertions in SBT
> - Fix this failure
> - Figure out why Maven scalatest didn't trigger it - may need assertions 
> explicitly turned on too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"

2014-12-14 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246122#comment-14246122
 ] 

Nicholas Chammas commented on SPARK-4826:
-

I just cooked up a quick way of invoking this test multiple times in parallel 
using [GNU parallel|http://www.gnu.org/software/parallel/]:

{code}
parallel 'sbt/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl 
-Phive -Phive-thriftserver "testOnly 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite"' ::: '' '' '' 
''
{code}

This will fire up 4 copies of that one test in parallel. I ran it a couple of 
times on my laptop without issue, but appears to be due to some sbt locking 
that prevents the tests from actually running in parallel.

I've [posted a question on Stack 
Overflow|http://stackoverflow.com/questions/27474000/how-can-i-run-multiple-copies-of-the-same-test-in-parallel]
 about this.

> Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: 
> "java.lang.IllegalStateException: File exists and there is no append support!"
> 
>
> Key: SPARK-4826
> URL: https://issues.apache.org/jira/browse/SPARK-4826
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Josh Rosen
>Assignee: Tathagata Das
>  Labels: flaky-test
>
> I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite 
> where four tests failed with the same exception.
> [Link to test result (this will eventually 
> break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/].
>   In case that link breaks:
> The failed tests:
> {code}
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in block manager, not in write ahead log
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, not in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, and test storing in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> with partially available in block manager, and rest in write ahead log
> {code}
> The error messages are all (essentially) the same:
> {code}
>  java.lang.IllegalStateException: File exists and there is no append 
> support!
>   at 
> org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalate

[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"

2014-12-14 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246091#comment-14246091
 ] 

Nicholas Chammas commented on SPARK-4826:
-

This raises an interesting test infrastructure question: Do we have a way of 
invoking multiple copies of the same test (on the same or across multiple JVMs) 
to check a test's level of isolation? If not, that might be a good thing to 
look into.

> Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: 
> "java.lang.IllegalStateException: File exists and there is no append support!"
> 
>
> Key: SPARK-4826
> URL: https://issues.apache.org/jira/browse/SPARK-4826
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Josh Rosen
>Assignee: Tathagata Das
>  Labels: flaky-test
>
> I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite 
> where four tests failed with the same exception.
> [Link to test result (this will eventually 
> break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/].
>   In case that link breaks:
> The failed tests:
> {code}
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in block manager, not in write ahead log
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, not in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, and test storing in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> with partially available in block manager, and rest in write ahead log
> {code}
> The error messages are all (essentially) the same:
> {code}
>  java.lang.IllegalStateException: File exists and there is no append 
> support!
>   at 
> org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:

[jira] [Updated] (SPARK-4812) SparkPlan.codegenEnabled may be initialized to a wrong value

2014-12-14 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4812:

Assignee: Shixiong Zhu

> SparkPlan.codegenEnabled may be initialized to a wrong value
> 
>
> Key: SPARK-4812
> URL: https://issues.apache.org/jira/browse/SPARK-4812
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, 
> which can be override by subclasses. Here is a simple example to show this 
> issue.
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> abstract class Foo {
>   protected val sqlContext = "Foo"
>   val codegenEnabled: Boolean = {
> println(sqlContext) // it will call subclass's `sqlContext` which has not 
> yet been initialized.
> if (sqlContext != null) {
>   true
> } else {
>   false
> }
>   }
> }
> class Bar extends Foo {
>   override val sqlContext = "Bar"
> }
> println(new Bar().codegenEnabled)
> // Exiting paste mode, now interpreting.
> null
> false
> defined class Foo
> defined class Bar
> scala> 
> {code}
> To fix it, should override codegenEnabled in `InMemoryColumnarTableScan`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4775) Possible problem in a simple join? Getting duplicate rows and missing rows

2014-12-14 Thread Stephen Boesch (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Boesch closed SPARK-4775.
-
Resolution: Not a Problem

> Possible problem in a simple join?  Getting duplicate rows and missing rows
> ---
>
> Key: SPARK-4775
> URL: https://issues.apache.org/jira/browse/SPARK-4775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Run on Mac but should be agnostic
>Reporter: Stephen Boesch
>Assignee: Michael Armbrust
>
> I am working on testing of HBase joins. As part of this work some simple 
> vanilla SparkSQL tests were created.  Some of the results are surprising: 
> here are the details:
> 
> Consider the following schema that includes two columns:
> {code}
> case class JoinTable2Cols(intcol: Int, strcol: String)
> {code}
> Let us register two temp tables using this schema and insert 2 rows and 4 
> rows respectively:
> {code}
> val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, 
> s"valA$ix")})
> rdd1.registerTempTable("SparkJoinTable1")
> val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4))
> val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, 
> s"valB$is")})
> val table2 = rdd2.registerTempTable("SparkJoinTable2")
> {code}
> Here is the data in both tables:
> {code}
> Table1 Contents:
> [1,valA1]
> [2,valA2]
> Table2 Contents:
> [1,valB1]
> [1,valB2]
> [2,valB3]
> [2,valB4]
> {code}
> Now let us join the tables on the first column:
> {code}
> select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol,
> t2.strcol t2strcol from SparkJoinTable1 t1 JOIN
> SparkJoinTable2 t2 on t1.intcol = t2.intcol
> {code}
> What results do we get:
>  came back with 4 results
> {code}
> Results
> [1,1,valA1,valB2]
> [1,1,valA1,valB2]
> [2,2,valA2,valB4]
> [2,2,valA2,valB4]
> {code}
> Huh??
> Where did valB1 and valB3 go? Why do we have duplicate rows?
> Note: the expected results were:
> {code}
>   Seq(1, 1, "valA1", "valB1"),
>   Seq(1, 1, "valA1", "valB2"),
>   Seq(2, 2, "valA2", "valB3"),
>   Seq(2, 2, "valA2", "valB4"))
> {code}
> A standalone testing program is attached  SparkSQLJoinSuite. An abridged 
> version of the actual output is also attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4775) Possible problem in a simple join? Getting duplicate rows and missing rows

2014-12-14 Thread Stephen Boesch (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246084#comment-14246084
 ] 

Stephen Boesch commented on SPARK-4775:
---

Thanks v much Michael. You hit the nail on the head.   I will update our 
internal code here to remove that antipattern.  Issue is being closed.

> Possible problem in a simple join?  Getting duplicate rows and missing rows
> ---
>
> Key: SPARK-4775
> URL: https://issues.apache.org/jira/browse/SPARK-4775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Run on Mac but should be agnostic
>Reporter: Stephen Boesch
>Assignee: Michael Armbrust
>
> I am working on testing of HBase joins. As part of this work some simple 
> vanilla SparkSQL tests were created.  Some of the results are surprising: 
> here are the details:
> 
> Consider the following schema that includes two columns:
> {code}
> case class JoinTable2Cols(intcol: Int, strcol: String)
> {code}
> Let us register two temp tables using this schema and insert 2 rows and 4 
> rows respectively:
> {code}
> val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, 
> s"valA$ix")})
> rdd1.registerTempTable("SparkJoinTable1")
> val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4))
> val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, 
> s"valB$is")})
> val table2 = rdd2.registerTempTable("SparkJoinTable2")
> {code}
> Here is the data in both tables:
> {code}
> Table1 Contents:
> [1,valA1]
> [2,valA2]
> Table2 Contents:
> [1,valB1]
> [1,valB2]
> [2,valB3]
> [2,valB4]
> {code}
> Now let us join the tables on the first column:
> {code}
> select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol,
> t2.strcol t2strcol from SparkJoinTable1 t1 JOIN
> SparkJoinTable2 t2 on t1.intcol = t2.intcol
> {code}
> What results do we get:
>  came back with 4 results
> {code}
> Results
> [1,1,valA1,valB2]
> [1,1,valA1,valB2]
> [2,2,valA2,valB4]
> [2,2,valA2,valB4]
> {code}
> Huh??
> Where did valB1 and valB3 go? Why do we have duplicate rows?
> Note: the expected results were:
> {code}
>   Seq(1, 1, "valA1", "valB1"),
>   Seq(1, 1, "valA1", "valB2"),
>   Seq(2, 2, "valA2", "valB3"),
>   Seq(2, 2, "valA2", "valB4"))
> {code}
> A standalone testing program is attached  SparkSQLJoinSuite. An abridged 
> version of the actual output is also attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4812) SparkPlan.codegenEnabled may be initialized to a wrong value

2014-12-14 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4812:

Target Version/s: 1.3.0

> SparkPlan.codegenEnabled may be initialized to a wrong value
> 
>
> Key: SPARK-4812
> URL: https://issues.apache.org/jira/browse/SPARK-4812
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>
> The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, 
> which can be override by subclasses. Here is a simple example to show this 
> issue.
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> abstract class Foo {
>   protected val sqlContext = "Foo"
>   val codegenEnabled: Boolean = {
> println(sqlContext) // it will call subclass's `sqlContext` which has not 
> yet been initialized.
> if (sqlContext != null) {
>   true
> } else {
>   false
> }
>   }
> }
> class Bar extends Foo {
>   override val sqlContext = "Bar"
> }
> println(new Bar().codegenEnabled)
> // Exiting paste mode, now interpreting.
> null
> false
> defined class Foo
> defined class Bar
> scala> 
> {code}
> To fix it, should override codegenEnabled in `InMemoryColumnarTableScan`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4684) Add a script to run JDBC server on Windows

2014-12-14 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4684:

Target Version/s: 1.3.0

> Add a script to run JDBC server on Windows
> --
>
> Key: SPARK-4684
> URL: https://issues.apache.org/jira/browse/SPARK-4684
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Matei Zaharia
>Assignee: Cheng Lian
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"

2014-12-14 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246041#comment-14246041
 ] 

Hari Shreedharan commented on SPARK-4826:
-

It looks like there is some issue with the directories/files existing (though 
we use random names for files/dirs). I will see try to get something ready 
later today

> Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: 
> "java.lang.IllegalStateException: File exists and there is no append support!"
> 
>
> Key: SPARK-4826
> URL: https://issues.apache.org/jira/browse/SPARK-4826
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Josh Rosen
>Assignee: Tathagata Das
>  Labels: flaky-test
>
> I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite 
> where four tests failed with the same exception.
> [Link to test result (this will eventually 
> break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/].
>   In case that link breaks:
> The failed tests:
> {code}
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in block manager, not in write ahead log
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, not in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, and test storing in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> with partially available in block manager, and rest in write ahead log
> {code}
> The error messages are all (essentially) the same:
> {code}
>  java.lang.IllegalStateException: File exists and there is no append 
> support!
>   at 
> org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTests

[jira] [Commented] (SPARK-4775) Possible problem in a simple join? Getting duplicate rows and missing rows

2014-12-14 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246029#comment-14246029
 ] 

Michael Armbrust commented on SPARK-4775:
-

I only scanned over the code quickly, but I think likely the problem is that 
you are calling "toRDD".  This function is an developer API not intended for 
users and is documented "Internal version of the RDD. Avoids copies and has no 
schema".  If you use it directly without defensively copying you'll see weird 
repeated rows.  Instead just use the SchemaRDD as an RDD and we'll do the 
copying for you.

> Possible problem in a simple join?  Getting duplicate rows and missing rows
> ---
>
> Key: SPARK-4775
> URL: https://issues.apache.org/jira/browse/SPARK-4775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Run on Mac but should be agnostic
>Reporter: Stephen Boesch
>Assignee: Michael Armbrust
>
> I am working on testing of HBase joins. As part of this work some simple 
> vanilla SparkSQL tests were created.  Some of the results are surprising: 
> here are the details:
> 
> Consider the following schema that includes two columns:
> {code}
> case class JoinTable2Cols(intcol: Int, strcol: String)
> {code}
> Let us register two temp tables using this schema and insert 2 rows and 4 
> rows respectively:
> {code}
> val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, 
> s"valA$ix")})
> rdd1.registerTempTable("SparkJoinTable1")
> val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4))
> val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, 
> s"valB$is")})
> val table2 = rdd2.registerTempTable("SparkJoinTable2")
> {code}
> Here is the data in both tables:
> {code}
> Table1 Contents:
> [1,valA1]
> [2,valA2]
> Table2 Contents:
> [1,valB1]
> [1,valB2]
> [2,valB3]
> [2,valB4]
> {code}
> Now let us join the tables on the first column:
> {code}
> select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol,
> t2.strcol t2strcol from SparkJoinTable1 t1 JOIN
> SparkJoinTable2 t2 on t1.intcol = t2.intcol
> {code}
> What results do we get:
>  came back with 4 results
> {code}
> Results
> [1,1,valA1,valB2]
> [1,1,valA1,valB2]
> [2,2,valA2,valB4]
> [2,2,valA2,valB4]
> {code}
> Huh??
> Where did valB1 and valB3 go? Why do we have duplicate rows?
> Note: the expected results were:
> {code}
>   Seq(1, 1, "valA1", "valB1"),
>   Seq(1, 1, "valA1", "valB2"),
>   Seq(2, 2, "valA2", "valB3"),
>   Seq(2, 2, "valA2", "valB4"))
> {code}
> A standalone testing program is attached  SparkSQLJoinSuite. An abridged 
> version of the actual output is also attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4775) Possible problem in a simple join? Getting duplicate rows and missing rows

2014-12-14 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-4775:
---

Assignee: Michael Armbrust  (was: Cheng Lian)

> Possible problem in a simple join?  Getting duplicate rows and missing rows
> ---
>
> Key: SPARK-4775
> URL: https://issues.apache.org/jira/browse/SPARK-4775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Run on Mac but should be agnostic
>Reporter: Stephen Boesch
>Assignee: Michael Armbrust
>
> I am working on testing of HBase joins. As part of this work some simple 
> vanilla SparkSQL tests were created.  Some of the results are surprising: 
> here are the details:
> 
> Consider the following schema that includes two columns:
> {code}
> case class JoinTable2Cols(intcol: Int, strcol: String)
> {code}
> Let us register two temp tables using this schema and insert 2 rows and 4 
> rows respectively:
> {code}
> val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, 
> s"valA$ix")})
> rdd1.registerTempTable("SparkJoinTable1")
> val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4))
> val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, 
> s"valB$is")})
> val table2 = rdd2.registerTempTable("SparkJoinTable2")
> {code}
> Here is the data in both tables:
> {code}
> Table1 Contents:
> [1,valA1]
> [2,valA2]
> Table2 Contents:
> [1,valB1]
> [1,valB2]
> [2,valB3]
> [2,valB4]
> {code}
> Now let us join the tables on the first column:
> {code}
> select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol,
> t2.strcol t2strcol from SparkJoinTable1 t1 JOIN
> SparkJoinTable2 t2 on t1.intcol = t2.intcol
> {code}
> What results do we get:
>  came back with 4 results
> {code}
> Results
> [1,1,valA1,valB2]
> [1,1,valA1,valB2]
> [2,2,valA2,valB4]
> [2,2,valA2,valB4]
> {code}
> Huh??
> Where did valB1 and valB3 go? Why do we have duplicate rows?
> Note: the expected results were:
> {code}
>   Seq(1, 1, "valA1", "valB1"),
>   Seq(1, 1, "valA1", "valB2"),
>   Seq(2, 2, "valA2", "valB3"),
>   Seq(2, 2, "valA2", "valB4"))
> {code}
> A standalone testing program is attached  SparkSQLJoinSuite. An abridged 
> version of the actual output is also attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger

2014-12-14 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4814:

Target Version/s: 1.3.0

> Enable assertions in SBT, Maven tests / AssertionError from Hive's 
> LazyBinaryInteger
> 
>
> Key: SPARK-4814
> URL: https://issues.apache.org/jira/browse/SPARK-4814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.1.0
>Reporter: Sean Owen
>
> Follow up to SPARK-4159, wherein we noticed that Java tests weren't running 
> in Maven, in part because a Java test actually fails with {{AssertionError}}. 
> That code/test was fixed in SPARK-4850.
> The reason it wasn't caught by SBT tests was that they don't run with 
> assertions on, and Maven's surefire does.
> Turning on assertions in the SBT build is trivial, adding one line:
> {code}
> javaOptions in Test += "-ea",
> {code}
> This reveals a test failure in Scala test suites though:
> {code}
> [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds)
> [info]   Failed to execute query using catalyst:
> [info]   Error: Job aborted due to stage failure: Task 1 in stage 551.0 
> failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, 
> localhost): java.lang.AssertionError
> [info]at 
> org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51)
> [info]at 
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110)
> [info]at 
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171)
> [info]at 
> org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166)
> [info]at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318)
> [info]at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
> [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> [info]at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132)
> [info]at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
> [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> [info]at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> [info]at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> [info]at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> [info]at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:56)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [info]at java.lang.Thread.run(Thread.java:745)
> {code}
> The items for this JIRA are therefore:
> - Enable assertions in SBT
> - Fix this failure
> - Figure out why Maven scalatest didn't trigger it - may need assertions 
> explicitly turned on too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger

2014-12-14 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246024#comment-14246024
 ] 

Michael Armbrust commented on SPARK-4814:
-

Either way, I don't think we should block turning on assertions for the rest of 
Spark.

> Enable assertions in SBT, Maven tests / AssertionError from Hive's 
> LazyBinaryInteger
> 
>
> Key: SPARK-4814
> URL: https://issues.apache.org/jira/browse/SPARK-4814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.1.0
>Reporter: Sean Owen
>
> Follow up to SPARK-4159, wherein we noticed that Java tests weren't running 
> in Maven, in part because a Java test actually fails with {{AssertionError}}. 
> That code/test was fixed in SPARK-4850.
> The reason it wasn't caught by SBT tests was that they don't run with 
> assertions on, and Maven's surefire does.
> Turning on assertions in the SBT build is trivial, adding one line:
> {code}
> javaOptions in Test += "-ea",
> {code}
> This reveals a test failure in Scala test suites though:
> {code}
> [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds)
> [info]   Failed to execute query using catalyst:
> [info]   Error: Job aborted due to stage failure: Task 1 in stage 551.0 
> failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, 
> localhost): java.lang.AssertionError
> [info]at 
> org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51)
> [info]at 
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110)
> [info]at 
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171)
> [info]at 
> org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166)
> [info]at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318)
> [info]at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
> [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> [info]at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132)
> [info]at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
> [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
> [info]at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> [info]at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> [info]at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> [info]at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
> [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> [info]at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:56)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [info]at java.lang.Thread.run(Thread.java:745)
> {code}
> The items for this JIRA are therefore:
> - Enable assertions in SBT
> - Fix this failure
> - Figure out why Maven scalatest didn't trigger it - may need assertions 
> explicitly turned on too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]

2014-12-14 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4782:

Affects Version/s: (was: 1.3.0)

> Add inferSchema support for RDD[Map[String, Any]]
> -
>
> Key: SPARK-4782
> URL: https://issues.apache.org/jira/browse/SPARK-4782
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jianshi Huang
>Priority: Minor
>
> The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to 
> be converting each Map to JSON String first and use JsonRDD.inferSchema on it.
> It's very inefficient.
> Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for 
> Schemaless data as adding Map like interface to any serialization format is 
> easy.
> So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new 
> serialization format we want to support, we just need to add a Map interface 
> wrapper to it*
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]

2014-12-14 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4782:

Target Version/s: 1.3.0

> Add inferSchema support for RDD[Map[String, Any]]
> -
>
> Key: SPARK-4782
> URL: https://issues.apache.org/jira/browse/SPARK-4782
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jianshi Huang
>Priority: Minor
>
> The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to 
> be converting each Map to JSON String first and use JsonRDD.inferSchema on it.
> It's very inefficient.
> Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for 
> Schemaless data as adding Map like interface to any serialization format is 
> easy.
> So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new 
> serialization format we want to support, we just need to add a Map interface 
> wrapper to it*
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4838) StackOverflowError when serialization task

2014-12-14 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246023#comment-14246023
 ] 

Michael Armbrust commented on SPARK-4838:
-

Yeah, any more detail you can provide would be very helpful.  I have 
successfully ran queries with 20,000+ partitions using HadoopRDD on Spark 1.1.

> StackOverflowError when serialization task
> --
>
> Key: SPARK-4838
> URL: https://issues.apache.org/jira/browse/SPARK-4838
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.1.0
>Reporter: Hong Shen
>
> When run a sql with more than 2000 partitions, each partition a  HadoopRDD, 
> it will cause java.lang.StackOverflowError at serialize task.
>  Error message from spark is:Job aborted due to stage failure: Task 
> serialization failed: java.lang.StackOverflowError
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> ..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4841) Batch serializer bug in PySpark's RDD.zip

2014-12-14 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14245996#comment-14245996
 ] 

Xiangrui Meng commented on SPARK-4841:
--

This is the commit that caused the bug: 
786e75b33f0bc1445bfc289fe4b62407cb79026e (SPARK-3886)

> Batch serializer bug in PySpark's RDD.zip
> -
>
> Key: SPARK-4841
> URL: https://issues.apache.org/jira/browse/SPARK-4841
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
>Reporter: Xiangrui Meng
>
> {code}
> t = sc.textFile("README.md")
> t.zip(t).count()
> {code}
> {code}
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 readme.zip(readme).count()
> /Users/meng/src/spark/python/pyspark/rdd.pyc in count(self)
> 817 3
> 818 """
> --> 819 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
> 820
> 821 def stats(self):
> /Users/meng/src/spark/python/pyspark/rdd.pyc in sum(self)
> 808 6.0
> 809 """
> --> 810 return self.mapPartitions(lambda x: 
> [sum(x)]).reduce(operator.add)
> 811
> 812 def count(self):
> /Users/meng/src/spark/python/pyspark/rdd.pyc in reduce(self, f)
> 713 yield reduce(f, iterator, initial)
> 714
> --> 715 vals = self.mapPartitions(func).collect()
> 716 if vals:
> 717 return reduce(f, vals)
> /Users/meng/src/spark/python/pyspark/rdd.pyc in collect(self)
> 674 """
> 675 with SCCallSiteSync(self.context) as css:
> --> 676 bytesInJava = self._jrdd.collect().iterator()
> 677 return list(self._collect_iterator_through_file(bytesInJava))
> 678
> /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539
> 540 for temp_arg in temp_args:
> /Users/meng/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o69.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 2, localhost): org.apache.spark.api.python.PythonException: Traceback 
> (most recent call last):
>   File "/Users/meng/src/spark/python/pyspark/worker.py", line 107, in main
> process()
>   File "/Users/meng/src/spark/python/pyspark/worker.py", line 98, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/meng/src/spark/python/pyspark/serializers.py", line 198, in 
> dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/Users/meng/src/spark/python/pyspark/serializers.py", line 81, in 
> dump_stream
> raise NotImplementedError
> NotImplementedError
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:174)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$a

[jira] [Commented] (SPARK-4826) Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: "java.lang.IllegalStateException: File exists and there is no append support!"

2014-12-14 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14245978#comment-14245978
 ] 

Josh Rosen commented on SPARK-4826:
---

I committed a documentation typo fix to {{master}} and {{branch-1.2}} at the 
same time, which caused a huge number of Maven builds to kick off 
simultaneously in Jenkins (since it was otherwise idle), and alll of these 
builds failed due to tests in WriteAheadLogBackedBlockRDDSuite; ; it also broke 
the master SBT build.

I wonder if there's some kind of sharing / contention where multiple copies of 
the test are attempting to write to the same directory.

[~hshreedharan], it would be great to get your help with this to see if you can 
spot any potential problems in that test suite.

> Possible flaky tests in WriteAheadLogBackedBlockRDDSuite: 
> "java.lang.IllegalStateException: File exists and there is no append support!"
> 
>
> Key: SPARK-4826
> URL: https://issues.apache.org/jira/browse/SPARK-4826
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Josh Rosen
>Assignee: Tathagata Das
>  Labels: flaky-test
>
> I saw a recent master Maven build failure in WriteHeadLogBackedBlockRDDSuite 
> where four tests failed with the same exception.
> [Link to test result (this will eventually 
> break)|https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/1156/].
>   In case that link breaks:
> The failed tests:
> {code}
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in block manager, not in write ahead log
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, not in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> available only in write ahead log, and test storing in block manager
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.Read data 
> with partially available in block manager, and rest in write ahead log
> {code}
> The error messages are all (essentially) the same:
> {code}
>  java.lang.IllegalStateException: File exists and there is no append 
> support!
>   at 
> org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream$lzycompute(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.org$apache$spark$streaming$util$WriteAheadLogWriter$$stream(WriteAheadLogWriter.scala:34)
>   at 
> org.apache.spark.streaming.util.WriteAheadLogWriter.(WriteAheadLogWriter.scala:42)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.writeLogSegments(WriteAheadLogBackedBlockRDDSuite.scala:140)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDDSuite$$testRDD(WriteAheadLogBackedBlockRDDSuite.scala:95)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply$mcV$sp(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite$$anonfun$4.apply(WriteAheadLogBackedBlockRDDSuite.scala:67)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.sca

[jira] [Closed] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider

2014-12-14 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar closed SPARK-3640.
---
Resolution: Not a Problem

Tested and Chris's suggestion of using EC2 IAM instance profile works fine.

> KinesisUtils should accept a credentials object instead of forcing 
> DefaultCredentialsProvider
> -
>
> Key: SPARK-3640
> URL: https://issues.apache.org/jira/browse/SPARK-3640
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Aniket Bhatnagar
>  Labels: kinesis
>
> KinesisUtils should accept AWS Credentials as a parameter and should default 
> to DefaultCredentialsProvider if no credentials are provided. Currently, the 
> implementation forces usage of DefaultCredentialsProvider which can be a pain 
> especially when jobs are run by multiple  unix users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-12-14 Thread Vincenzo Selvaggio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14245894#comment-14245894
 ] 

Vincenzo Selvaggio commented on SPARK-1406:
---

Scala examples on usage of ModelExporter.toPMML(model,path):
https://github.com/selvinsource/spark-pmml-exporter-validator/tree/master/src/main/resources/spark_shell_exporter

Exported PMML xml files:
https://github.com/selvinsource/spark-pmml-exporter-validator/tree/master/src/main/resources/exported_pmml_models

Evaluation using JPMML of the exported files:
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/java/org/selvinsource/spark_pmml_exporter_validator/SparkPMMLExporterValidator.java

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>Assignee: Vincenzo Selvaggio
> Attachments: MyJPMMLEval.java, SPARK-1406.pdf, SPARK-1406_v2.pdf, 
> kmeans.xml
>
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-12-14 Thread Vincenzo Selvaggio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14245893#comment-14245893
 ] 

Vincenzo Selvaggio commented on SPARK-1406:
---

Find at
https://github.com/selvinsource/spark-pmml-exporter-validator
a simple validator project showing that the prediction made by Apache Spark and 
JPMML Evaluator (by loading the PMML exported from Spark) produces comparable 
results, therefore proving the PMML export from Apache Spark works as expected.

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>Assignee: Vincenzo Selvaggio
> Attachments: MyJPMMLEval.java, SPARK-1406.pdf, SPARK-1406_v2.pdf, 
> kmeans.xml
>
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1406) PMML model evaluation support via MLib

2014-12-14 Thread Vincenzo Selvaggio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincenzo Selvaggio updated SPARK-1406:
--
Attachment: SPARK-1406_v2.pdf

Updated document with model supported so far: KMeansModel, 
LogisticRegressionModel, SVMModel, LinearRegressionModel, RidgeRegressionModel, 
LassoModel

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>Assignee: Vincenzo Selvaggio
> Attachments: MyJPMMLEval.java, SPARK-1406.pdf, SPARK-1406_v2.pdf, 
> kmeans.xml
>
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4842) Use WeakTypeTags in ScalaReflection

2014-12-14 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-4842:
---

 Summary: Use WeakTypeTags in ScalaReflection
 Key: SPARK-4842
 URL: https://issues.apache.org/jira/browse/SPARK-4842
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust
Priority: Critical


Right now we can't create SchemaRDDs from RDDs of case classes that are defined 
inside of functions.  This is because only WeakTypeTags are available in this 
scope.  This is pretty confusing to users:

http://apache-spark-user-list.1001560.n3.nabble.com/parquet-file-not-loading-spark-v-1-1-0-td20618.html#a20628

https://issues.scala-lang.org/browse/SI-6649



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org