[jira] [Created] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark classpath

2015-09-24 Thread Jonathan Kelly (JIRA)
Jonathan Kelly created SPARK-10789:
--

 Summary: Cluster mode SparkSubmit classpath only includes Spark 
classpath
 Key: SPARK-10789
 URL: https://issues.apache.org/jira/browse/SPARK-10789
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Jonathan Kelly


When using cluster deploy mode, the classpath of the SparkSubmit process that 
gets launched only includes the Spark assembly and not 
spark.driver.extraClassPath. This is of course by design, since the driver 
actually runs on the cluster and not inside the SparkSubmit process.

However, if the SparkSubmit process, minimal as it may be, needs any extra 
libraries that are not part of the Spark assembly, there is no good way to 
include them. (I say "no good way" because including them in the 
SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
include them, but this is not acceptable because this environment variable has 
long been deprecated, and it prevents the use of spark.driver.extraClassPath.)

An example of when this matters is on Amazon EMR when using an S3 path for the 
application JAR and running in yarn-cluster mode. The SparkSubmit process needs 
the EmrFileSystem implementation and its dependencies in the classpath in order 
to download the application JAR from S3, so it fails with a 
ClassNotFoundException. (EMR currently gets around this by setting 
SPARK_CLASSPATH, but as mentioned above this is less than ideal.)

I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
classpath whether it's client mode or cluster mode, and this seems to work, but 
I don't know if there is any downside to this.

Example that fails on emr-4.0.0 (if you switch to setting 
spark.{driver,executor}.extraClassPath instead of SPARK_CLASSPATH): 
spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
s3://my-bucket/word-count-input.txt

Resulting Exception:
Exception in thread "main" java.lang.RuntimeException: 
java.lang.ClassNotFoundException: Class 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at 
org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
at 
org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
... 27 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsu

[jira] [Updated] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark classpath

2015-09-24 Thread Jonathan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Kelly updated SPARK-10789:
---
Component/s: Spark Submit

> Cluster mode SparkSubmit classpath only includes Spark classpath
> 
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.{driver,executor}.extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache

[jira] [Created] (SPARK-10790) Dynamic Allocation does not request any executors if first stage needs less than or equal to spark.dynamicAllocation.initialExecutors

2015-09-24 Thread Jonathan Kelly (JIRA)
Jonathan Kelly created SPARK-10790:
--

 Summary: Dynamic Allocation does not request any executors if 
first stage needs less than or equal to spark.dynamicAllocation.initialExecutors
 Key: SPARK-10790
 URL: https://issues.apache.org/jira/browse/SPARK-10790
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.5.0
Reporter: Jonathan Kelly
Priority: Critical


If you set spark.dynamicAllocation.initialExecutors > 0 (or 
spark.dynamicAllocation.minExecutors, since 
spark.dynamicAllocation.initialExecutors defaults to 
spark.dynamicAllocation.minExecutors), and the number of tasks in the first 
stage of your job is less than or equal to this min/init number of executors, 
dynamic allocation won't actually request any executors and will just hang 
indefinitely with the warning "Initial job has not accepted any resources; 
check your cluster UI to ensure that workers are registered and have sufficient 
resources".

The cause appears to be that ExecutorAllocationManager does not request any 
executors while the application is still initializing, but it still sets the 
initial value of numExecutorsTarget to 
spark.dynamicAllocation.initialExecutors. Once the job is running and has 
submitted its first task, if the first task does not need more than 
spark.dynamicAllocation.initialExecutors, 
ExecutorAllocationManager.updateAndSyncNumExecutorsTarget() does not think that 
it needs to request any executors, so it doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8386) DataFrame and JDBC regression

2015-09-24 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905940#comment-14905940
 ] 

Liang-Chi Hsieh commented on SPARK-8386:


[~phaumer] I can't reproduce this problem. Can you give me a code snippet that 
causes this problem? Thanks.

> DataFrame and JDBC regression
> -
>
> Key: SPARK-8386
> URL: https://issues.apache.org/jira/browse/SPARK-8386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: RHEL 7.1
>Reporter: Peter Haumer
>Priority: Critical
>
> I have an ETL app that appends to a JDBC table new results found at each run. 
>  In 1.3.1 I did this:
> testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false);
> When I do this now in 1.4 it complains that the "object" 'TABLE_NAME' already 
> exists. I get this even if I switch the overwrite to true.  I also tried this 
> now:
> testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, TABLE_NAME, 
> connectionProperties);
> getting the same error. It works running the first time creating the new 
> table and adding data successfully. But, running it a second time it (the 
> jdbc driver) will tell me that the table already exists. Even 
> SaveMode.Overwrite will give me the same error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark classpath

2015-09-24 Thread Jonathan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Kelly updated SPARK-10789:
---
Description: 
When using cluster deploy mode, the classpath of the SparkSubmit process that 
gets launched only includes the Spark assembly and not 
spark.driver.extraClassPath. This is of course by design, since the driver 
actually runs on the cluster and not inside the SparkSubmit process.

However, if the SparkSubmit process, minimal as it may be, needs any extra 
libraries that are not part of the Spark assembly, there is no good way to 
include them. (I say "no good way" because including them in the 
SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
include them, but this is not acceptable because this environment variable has 
long been deprecated, and it prevents the use of spark.driver.extraClassPath.)

An example of when this matters is on Amazon EMR when using an S3 path for the 
application JAR and running in yarn-cluster mode. The SparkSubmit process needs 
the EmrFileSystem implementation and its dependencies in the classpath in order 
to download the application JAR from S3, so it fails with a 
ClassNotFoundException. (EMR currently gets around this by setting 
SPARK_CLASSPATH, but as mentioned above this is less than ideal.)

I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
classpath whether it's client mode or cluster mode, and this seems to work, but 
I don't know if there is any downside to this.

Example that fails on emr-4.0.0 (if you switch to setting 
spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
s3://my-bucket/word-count-input.txt

Resulting Exception:
Exception in thread "main" java.lang.RuntimeException: 
java.lang.ClassNotFoundException: Class 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at 
org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
at 
org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
... 27 more

  was:
When using cluster deploy mode, the classpath of the SparkSubmit process that 
gets launched only includes the Spark assembly and not 
spark.driver.extraClassPath. This is of course by design, since the driver 
actually runs on the cluster and not inside the SparkSubmit process.

However, if the SparkSubmit process, minimal as i

[jira] [Commented] (SPARK-10790) Dynamic Allocation does not request any executors if first stage needs less than or equal to spark.dynamicAllocation.initialExecutors

2015-09-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905989#comment-14905989
 ] 

Sean Owen commented on SPARK-10790:
---

[~jonathak] A number of quite similar sounding things were fixed in 1.5. Can 
you check vs master before opening a JIRA? I suspect it's a duplicate.

> Dynamic Allocation does not request any executors if first stage needs less 
> than or equal to spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-10790
> URL: https://issues.apache.org/jira/browse/SPARK-10790
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>Priority: Critical
>
> If you set spark.dynamicAllocation.initialExecutors > 0 (or 
> spark.dynamicAllocation.minExecutors, since 
> spark.dynamicAllocation.initialExecutors defaults to 
> spark.dynamicAllocation.minExecutors), and the number of tasks in the first 
> stage of your job is less than or equal to this min/init number of executors, 
> dynamic allocation won't actually request any executors and will just hang 
> indefinitely with the warning "Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources".
> The cause appears to be that ExecutorAllocationManager does not request any 
> executors while the application is still initializing, but it still sets the 
> initial value of numExecutorsTarget to 
> spark.dynamicAllocation.initialExecutors. Once the job is running and has 
> submitted its first task, if the first task does not need more than 
> spark.dynamicAllocation.initialExecutors, 
> ExecutorAllocationManager.updateAndSyncNumExecutorsTarget() does not think 
> that it needs to request any executors, so it doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10791) Optimize MLlib LDA topic distribution query performance

2015-09-24 Thread Marko Asplund (JIRA)
Marko Asplund created SPARK-10791:
-

 Summary: Optimize MLlib LDA topic distribution query performance
 Key: SPARK-10791
 URL: https://issues.apache.org/jira/browse/SPARK-10791
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
 Environment: Ubuntu 13.10, Oracle Java 8
Reporter: Marko Asplund


I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size and 
~3.4 M documents using EMLDAOptimizer.

Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit 
training with the same data and on the same system set took ~5 minutes. Loading 
the persisted model from disk (~2 minutes), as well as querying LDA model topic 
distributions (~4 seconds for one document) are also quite slow operations.

Our application is querying LDA model topic distribution (for one doc at a 
time) as part of end-user operation execution flow, so a ~4 second execution 
time is very problematic.

The log includes the following message, which AFAIK, should mean that 
netlib-java is using machine optimised native implementation: 
"com.github.fommil.jni.JniLoader - successfully loaded 
/tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so"

My test code can be found here:
https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57

I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable change 
in training performance. Model loading time was reduced to ~ 5 seconds from ~ 2 
minutes (now persisted as LocalLDAModel). However, query / prediction time was 
unchanged.
Unfortunately, this is the critical performance characteristic in our case.

I did some profiling for my LDA prototype code that requests topic 
distributions from a model. According to Java Mission Control more than 80 % of 
execution time during sample interval is spent in the following methods:

- org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%
- org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91%
- org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50;
6.98%
- java.lang.Double.valueOf(double); count: 31; 4.33%

Is there any way of using the API more optimally?
Are there any opportunities for optimising the "topicDistributions" code
path in MLlib?

My query test code looks like this essentially:

// executed once
val model = LocalLDAModel.load(ctx, ModelFileName)

// executed four times
val samples = Transformers.toSparseVectors(vocabularySize,
ctx.parallelize(Seq(input))) // fast
model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this
seems to take about 4 seconds to execute




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10487) MLlib model fitting causes DataFrame write to break with OutOfMemory exception

2015-09-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906008#comment-14906008
 ] 

Zsolt Tóth commented on SPARK-10487:


Increasing the perm size on the driver fixes the OOM: 
spark.driver.extraJavaOptions="-XX:MaxPermSize=128m"

> MLlib model fitting causes DataFrame write to break with OutOfMemory exception
> --
>
> Key: SPARK-10487
> URL: https://issues.apache.org/jira/browse/SPARK-10487
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tried in a centos-based 1-node YARN in docker and on a 
> real-world CDH5 cluster
> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest 
> nightly build)
> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn 
> -DzincPort=3034
> I'm using the default resource setup
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor 
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>Reporter: Zoltan Toth
>
> After fitting a _spark.ml_ or _mllib model_ in *cluster* deploy mode, no 
> dataframes can be written to hdfs. The driver receives an OutOfMemory 
> exception during the writing. It seems, however, that the file gets written 
> successfully.
>  * This happens both in SparkR and pyspark
>  * Only happens in cluster deploy mode
>  * The write fails regardless the size of the dataframe and whether the 
> dataframe is associated with the ml model.
> REPRO:
> {code}
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
> from pyspark.ml.classification import LogisticRegression
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vector, Vectors
> conf = SparkConf().setAppName("LogRegTest")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
> sqlContext.setConf("park.sql.parquet.compression.codec", "uncompressed")
> training = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
> df = training.toDF()
> reg = LogisticRegression().setMaxIter(10).setRegParam(0.01)
> model = reg.fit(df)
> # Note that this is a brand new dataframe:
> one_df = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF()
> one_df.write.mode("overwrite").parquet("/tmp/df.parquet")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10773) Repartition operation failing on RDD with "argument type mismatch" error

2015-09-24 Thread Bo soon Park (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906022#comment-14906022
 ] 

Bo soon Park commented on SPARK-10773:
--

I also so this error like this in mapr-spark-1.4.1

[Code]
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object Justtest {

def main(args: Array[String]) {

val sparkConf = new 
SparkConf().setMaster("spark://maprdemo:7077").setAppName("LogNormalizer").set("spark.serializer",
 "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(sparkConf)
val a = sc.parallelize(List(1, 2, 10, 4, 5, 2, 1, 1, 1), 3).repartition(6)

}
}



> Repartition operation failing on RDD with "argument type mismatch" error
> 
>
> Key: SPARK-10773
> URL: https://issues.apache.org/jira/browse/SPARK-10773
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Da Fox
>
> Hello,
> Erorr occures in following Spark application:
> {code}
> object RunSpark {
> def main(args: Array[String]) {
> val sparkContext: SparkContext = new SparkContext()
> val data: RDD[String] = sparkContext.textFile("banana-big.tsv")
> val repartitioned: RDD[String] = data.repartition(5)
> val mean: Double = repartitioned
> .groupBy((s: String) => s.split("\t")(1))
> .mapValues((strings: Iterable[String]) =>strings.size)
> .values.mean()
> println(mean)
> }
> }
> {code}
> The exception:
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: argument type 
> mismatch
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$instantiateClass(ClosureCleaner.scala:330)
>   at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$22.apply(ClosureCleaner.scala:268)
>   at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$22.apply(ClosureCleaner.scala:262)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:262)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:1893)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:700)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:699)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:699)
>   at org.apache.spark.rdd.RDD$$anonfun$coalesce$1.apply(RDD.scala:381)
>   at org.apache.spark.rdd.RDD$$anonfun$coalesce$1.apply(RDD.scala:367)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
>   at org.apache.spark.rdd.RDD.coalesce(RDD.scala:366)
>   at org.apache.spark.rdd.RDD$$anonfun$repartition$1.apply(RDD.scala:342)
>   at org.apache.spark.rdd.RDD$$anonfun$repartition$1.apply(RDD.scala:342)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
>   at org.apache.spark.rdd.RDD.repartition(RDD.scala:341)
>   at repartitionissue.RunSpark$.main(RunSpark.scala:10)
>   at repartitionissue.RunSpark.main(RunSpark.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces

[jira] [Commented] (SPARK-10644) Applications wait even if free executors are available

2015-09-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906030#comment-14906030
 ] 

Sean Owen commented on SPARK-10644:
---

How many cores per executor? I'm assuming you mean 1 and have configured 
accordingly. I assume you do see 63 executors run successfully. What about 
memory? it could have enough cores but not enough memory.

On a side note, why have 3 executors per worker, instead of 1 with 3 cores? I 
get the overallocating cores thing, although I wonder out loud if Spark would 
just let a worker use "10" cores on a 4 core machine if you set it that way.

> Applications wait even if free executors are available
> --
>
> Key: SPARK-10644
> URL: https://issues.apache.org/jira/browse/SPARK-10644
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
> Environment: RHEL 6.5 64 bit
>Reporter: Balagopal Nair
>Priority: Minor
>
> Number of workers: 21
> Number of executors: 63
> Steps to reproduce:
> 1. Run 4 jobs each with max cores set to 10
> 2. The first 3 jobs run with 10 each. (30 executors consumed so far)
> 3. The 4 th job waits even though there are 33 idle executors.
> The reason is that a job will not get executors unless 
> the total number of EXECUTORS in use < the number of WORKERS
> If there are executors available, resources should be allocated to the 
> pending job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10790) Dynamic Allocation does not request any executors if first stage needs less than or equal to spark.dynamicAllocation.initialExecutors

2015-09-24 Thread Jonathan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906047#comment-14906047
 ] 

Jonathan Kelly commented on SPARK-10790:


I did search through all dynamicAllocation-related JIRAs targeted for 1.5.1+ 
before cutting this one, and I didn't find anything. Also, I don't see any diff 
at all  in core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala 
between v1.5.0 and the current master (d91967e), though maybe it's possible for 
this issue to have been fixed elsewhere. I can try the latest master, but it 
doesn't seem like it will have been fixed. Thanks for the response though.

> Dynamic Allocation does not request any executors if first stage needs less 
> than or equal to spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-10790
> URL: https://issues.apache.org/jira/browse/SPARK-10790
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>Priority: Critical
>
> If you set spark.dynamicAllocation.initialExecutors > 0 (or 
> spark.dynamicAllocation.minExecutors, since 
> spark.dynamicAllocation.initialExecutors defaults to 
> spark.dynamicAllocation.minExecutors), and the number of tasks in the first 
> stage of your job is less than or equal to this min/init number of executors, 
> dynamic allocation won't actually request any executors and will just hang 
> indefinitely with the warning "Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources".
> The cause appears to be that ExecutorAllocationManager does not request any 
> executors while the application is still initializing, but it still sets the 
> initial value of numExecutorsTarget to 
> spark.dynamicAllocation.initialExecutors. Once the job is running and has 
> submitted its first task, if the first task does not need more than 
> spark.dynamicAllocation.initialExecutors, 
> ExecutorAllocationManager.updateAndSyncNumExecutorsTarget() does not think 
> that it needs to request any executors, so it doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10790) Dynamic Allocation does not request any executors if first stage needs less than or equal to spark.dynamicAllocation.initialExecutors

2015-09-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906055#comment-14906055
 ] 

Sean Owen commented on SPARK-10790:
---

In your email I think you said you were using 1.4.1; just to be clear you are 
using 1.5.0 already?

> Dynamic Allocation does not request any executors if first stage needs less 
> than or equal to spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-10790
> URL: https://issues.apache.org/jira/browse/SPARK-10790
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>Priority: Critical
>
> If you set spark.dynamicAllocation.initialExecutors > 0 (or 
> spark.dynamicAllocation.minExecutors, since 
> spark.dynamicAllocation.initialExecutors defaults to 
> spark.dynamicAllocation.minExecutors), and the number of tasks in the first 
> stage of your job is less than or equal to this min/init number of executors, 
> dynamic allocation won't actually request any executors and will just hang 
> indefinitely with the warning "Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources".
> The cause appears to be that ExecutorAllocationManager does not request any 
> executors while the application is still initializing, but it still sets the 
> initial value of numExecutorsTarget to 
> spark.dynamicAllocation.initialExecutors. Once the job is running and has 
> submitted its first task, if the first task does not need more than 
> spark.dynamicAllocation.initialExecutors, 
> ExecutorAllocationManager.updateAndSyncNumExecutorsTarget() does not think 
> that it needs to request any executors, so it doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)
Adrian Tanase created SPARK-10792:
-

 Summary: Spark streaming + YARN – executor is not re-created on 
machine restart
 Key: SPARK-10792
 URL: https://issues.apache.org/jira/browse/SPARK-10792
 Project: Spark
  Issue Type: Bug
  Components: Streaming, YARN
Affects Versions: 1.4.0
 Environment: - centos7 deployed on AWS
- yarn / hadoop 2.6.0-cdh5.4.2
- spark 1.4.0 compiled with hadoop 2.6
Reporter: Adrian Tanase


We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
stateful app that reads from kafka (with the new direct API) and we’re 
checkpointing to HDFS.

During some resilience testing, we restarted one of the machines and brought it 
back online. During the offline period, the Yarn cluster would not have 
resources to re-create the missing executor.
After starting all the services on the machine, it correctly joined the Yarn 
cluster, however the spark streaming app does not seem to notice that the 
resources are back and has not re-created the missing executor.

The app is correctly running with 6 out of 7 executors, however it’s running 
under capacity.
If we manually kill the driver and re-submit the app to yarn, all the sate is 
correctly recreated from checkpoint and all 7 executors are now online – 
however this seems like a brutal workaround.

Scenarios tested to isolate the issue:

The expected outcome after a machine reboot + services back is that processing 
continues on it. *FAILED* below means that processing continues in a reduced 
capacity, as the machine lost rarely re-joins as container/executor even if 
YARN sees it as healthy node.

The expected outcome after a machine reboot + services back is that processing 
continues on it.
FAILED below means that processing continues in a reduced capacity, as the 
machine lost rarely re-joins as container/executor even if YARN sees it as 
healthy node.

|| No || Failure scenario || test result || data loss || Notes ||
| 1  | Single node restart | FAILED | NO | Executor NOT redeployed when machine 
comes back and services are restarted |
| 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
restoring services on machines that are down, the app OR kafka OR zookeeper 
metadata gets corrupted, app crashes and can't be restarted w/o clearing 
checkpoint -> dataloss. Root cause is unhealthy cluster when too many machines 
are lost. |
| 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
restart, driver does not crash |
| 4  | Graceful services restart | FAILED | NO | Behaves just like single node 
restart even if we take the time to manually stop services before machine 
reboot. |
| 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
will usually start even if YARN can't fullfill all the resource requests (e.g. 
5 out of 7 nodes are up when app is started). However, when the nodes are added 
to YARN, we see that Spark deploys executors on them, as expected in all the 
scenarios. |
| 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts it 
behaves like machine restart - the rest work as expected, container/executor 
are redeployed in a matter of seconds |
| 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
validate if the behavior is caused by maxing out the cluster and having no 
slack to redeploy a crashed node. We are still behaving like single node 
restart event with lots of extra capacity in YARN - nodes, cores and RAM. |

*Logs for Scenario 6 – correct behavior on process restart*
{noformat}
2015-09-21 11:00:11,193 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
2015-09-21 11:00:11,193 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal

..
(logical continuation from earlier restart attempt)

2015-09-21 10:33:20,658 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
2015-09-21 10:33:20,658 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Container request (host: Any, 
capability: )

..

2015-09-21 10:33:25,663 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Launching container 
container_1442827158253_0004_01_12 for on host ip-10-0-1-16.ec2.internal
2015-09-21 10:33:25,664 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Launching ExecutorRunnable. 
driverUrl: akka.tcp://sparkDriver@10.0.1.14:32938/user/CoarseGrainedScheduler,  
executorHostname: ip-10-0-1-16.ec2.internal
2015-09-21 10:33:25,6

[jira] [Updated] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10792:
--
Priority: Minor  (was: Major)

> Spark streaming + YARN – executor is not re-created on machine restart
> --
>
> Key: SPARK-10792
> URL: https://issues.apache.org/jira/browse/SPARK-10792
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.4.0
> Environment: - centos7 deployed on AWS
> - yarn / hadoop 2.6.0-cdh5.4.2
> - spark 1.4.0 compiled with hadoop 2.6
>Reporter: Adrian Tanase
>Priority: Minor
>
> We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
> stateful app that reads from kafka (with the new direct API) and we’re 
> checkpointing to HDFS.
> During some resilience testing, we restarted one of the machines and brought 
> it back online. During the offline period, the Yarn cluster would not have 
> resources to re-create the missing executor.
> After starting all the services on the machine, it correctly joined the Yarn 
> cluster, however the spark streaming app does not seem to notice that the 
> resources are back and has not re-created the missing executor.
> The app is correctly running with 6 out of 7 executors, however it’s running 
> under capacity.
> If we manually kill the driver and re-submit the app to yarn, all the sate is 
> correctly recreated from checkpoint and all 7 executors are now online – 
> however this seems like a brutal workaround.
> Scenarios tested to isolate the issue:
> The expected outcome after a machine reboot + services back is that 
> processing continues on it. *FAILED* below means that processing continues in 
> a reduced capacity, as the machine lost rarely re-joins as container/executor 
> even if YARN sees it as healthy node.
> The expected outcome after a machine reboot + services back is that 
> processing continues on it.
> FAILED below means that processing continues in a reduced capacity, as the 
> machine lost rarely re-joins as container/executor even if YARN sees it as 
> healthy node.
> || No || Failure scenario || test result || data loss || Notes ||
> | 1  | Single node restart | FAILED | NO | Executor NOT redeployed when 
> machine comes back and services are restarted |
> | 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
> restoring services on machines that are down, the app OR kafka OR zookeeper 
> metadata gets corrupted, app crashes and can't be restarted w/o clearing 
> checkpoint -> dataloss. Root cause is unhealthy cluster when too many 
> machines are lost. |
> | 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
> restart, driver does not crash |
> | 4  | Graceful services restart | FAILED | NO | Behaves just like single 
> node restart even if we take the time to manually stop services before 
> machine reboot. |
> | 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
> will usually start even if YARN can't fullfill all the resource requests 
> (e.g. 5 out of 7 nodes are up when app is started). However, when the nodes 
> are added to YARN, we see that Spark deploys executors on them, as expected 
> in all the scenarios. |
> | 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts 
> it behaves like machine restart - the rest work as expected, 
> container/executor are redeployed in a matter of seconds |
> | 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
> validate if the behavior is caused by maxing out the cluster and having no 
> slack to redeploy a crashed node. We are still behaving like single node 
> restart event with lots of extra capacity in YARN - nodes, cores and RAM. |
> *Logs for Scenario 6 – correct behavior on process restart*
> {noformat}
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
> container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
> container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal
> ..
> (logical continuation from earlier restart attempt)
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
> containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container request (host: Any, 
> capability: )
> ..
> 2015-09-21 10:33:25,663 [Reporter] INFO  
> org.apache.spark.deploy.yar

[jira] [Commented] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906068#comment-14906068
 ] 

Adrian Tanase commented on SPARK-10792:
---

https://issues.apache.org/jira/browse/SPARK-8297 seems to be related, however I 
can't upgrade to 1.5 yet because of 
https://issues.apache.org/jira/browse/SPARK-8630 - didn't seem to get in the 
1.5.0 release.

> Spark streaming + YARN – executor is not re-created on machine restart
> --
>
> Key: SPARK-10792
> URL: https://issues.apache.org/jira/browse/SPARK-10792
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.4.0
> Environment: - centos7 deployed on AWS
> - yarn / hadoop 2.6.0-cdh5.4.2
> - spark 1.4.0 compiled with hadoop 2.6
>Reporter: Adrian Tanase
>
> We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
> stateful app that reads from kafka (with the new direct API) and we’re 
> checkpointing to HDFS.
> During some resilience testing, we restarted one of the machines and brought 
> it back online. During the offline period, the Yarn cluster would not have 
> resources to re-create the missing executor.
> After starting all the services on the machine, it correctly joined the Yarn 
> cluster, however the spark streaming app does not seem to notice that the 
> resources are back and has not re-created the missing executor.
> The app is correctly running with 6 out of 7 executors, however it’s running 
> under capacity.
> If we manually kill the driver and re-submit the app to yarn, all the sate is 
> correctly recreated from checkpoint and all 7 executors are now online – 
> however this seems like a brutal workaround.
> Scenarios tested to isolate the issue:
> The expected outcome after a machine reboot + services back is that 
> processing continues on it. *FAILED* below means that processing continues in 
> a reduced capacity, as the machine lost rarely re-joins as container/executor 
> even if YARN sees it as healthy node.
> The expected outcome after a machine reboot + services back is that 
> processing continues on it.
> FAILED below means that processing continues in a reduced capacity, as the 
> machine lost rarely re-joins as container/executor even if YARN sees it as 
> healthy node.
> || No || Failure scenario || test result || data loss || Notes ||
> | 1  | Single node restart | FAILED | NO | Executor NOT redeployed when 
> machine comes back and services are restarted |
> | 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
> restoring services on machines that are down, the app OR kafka OR zookeeper 
> metadata gets corrupted, app crashes and can't be restarted w/o clearing 
> checkpoint -> dataloss. Root cause is unhealthy cluster when too many 
> machines are lost. |
> | 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
> restart, driver does not crash |
> | 4  | Graceful services restart | FAILED | NO | Behaves just like single 
> node restart even if we take the time to manually stop services before 
> machine reboot. |
> | 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
> will usually start even if YARN can't fullfill all the resource requests 
> (e.g. 5 out of 7 nodes are up when app is started). However, when the nodes 
> are added to YARN, we see that Spark deploys executors on them, as expected 
> in all the scenarios. |
> | 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts 
> it behaves like machine restart - the rest work as expected, 
> container/executor are redeployed in a matter of seconds |
> | 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
> validate if the behavior is caused by maxing out the cluster and having no 
> slack to redeploy a crashed node. We are still behaving like single node 
> restart event with lots of extra capacity in YARN - nodes, cores and RAM. |
> *Logs for Scenario 6 – correct behavior on process restart*
> {noformat}
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
> container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
> container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal
> ..
> (logical continuation from earlier restart attempt)
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
> containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
> 

[jira] [Commented] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906081#comment-14906081
 ] 

Sean Owen commented on SPARK-10792:
---

I wonder if this is interacting with a blacklist mechanism? sort of a guess 
here.

> Spark streaming + YARN – executor is not re-created on machine restart
> --
>
> Key: SPARK-10792
> URL: https://issues.apache.org/jira/browse/SPARK-10792
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.4.0
> Environment: - centos7 deployed on AWS
> - yarn / hadoop 2.6.0-cdh5.4.2
> - spark 1.4.0 compiled with hadoop 2.6
>Reporter: Adrian Tanase
>Priority: Minor
>
> We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
> stateful app that reads from kafka (with the new direct API) and we’re 
> checkpointing to HDFS.
> During some resilience testing, we restarted one of the machines and brought 
> it back online. During the offline period, the Yarn cluster would not have 
> resources to re-create the missing executor.
> After starting all the services on the machine, it correctly joined the Yarn 
> cluster, however the spark streaming app does not seem to notice that the 
> resources are back and has not re-created the missing executor.
> The app is correctly running with 6 out of 7 executors, however it’s running 
> under capacity.
> If we manually kill the driver and re-submit the app to yarn, all the sate is 
> correctly recreated from checkpoint and all 7 executors are now online – 
> however this seems like a brutal workaround.
> Scenarios tested to isolate the issue:
> The expected outcome after a machine reboot + services back is that 
> processing continues on it. *FAILED* below means that processing continues in 
> a reduced capacity, as the machine lost rarely re-joins as container/executor 
> even if YARN sees it as healthy node.
> The expected outcome after a machine reboot + services back is that 
> processing continues on it.
> FAILED below means that processing continues in a reduced capacity, as the 
> machine lost rarely re-joins as container/executor even if YARN sees it as 
> healthy node.
> || No || Failure scenario || test result || data loss || Notes ||
> | 1  | Single node restart | FAILED | NO | Executor NOT redeployed when 
> machine comes back and services are restarted |
> | 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
> restoring services on machines that are down, the app OR kafka OR zookeeper 
> metadata gets corrupted, app crashes and can't be restarted w/o clearing 
> checkpoint -> dataloss. Root cause is unhealthy cluster when too many 
> machines are lost. |
> | 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
> restart, driver does not crash |
> | 4  | Graceful services restart | FAILED | NO | Behaves just like single 
> node restart even if we take the time to manually stop services before 
> machine reboot. |
> | 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
> will usually start even if YARN can't fullfill all the resource requests 
> (e.g. 5 out of 7 nodes are up when app is started). However, when the nodes 
> are added to YARN, we see that Spark deploys executors on them, as expected 
> in all the scenarios. |
> | 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts 
> it behaves like machine restart - the rest work as expected, 
> container/executor are redeployed in a matter of seconds |
> | 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
> validate if the behavior is caused by maxing out the cluster and having no 
> slack to redeploy a crashed node. We are still behaving like single node 
> restart event with lots of extra capacity in YARN - nodes, cores and RAM. |
> *Logs for Scenario 6 – correct behavior on process restart*
> {noformat}
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
> container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
> container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal
> ..
> (logical continuation from earlier restart attempt)
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
> containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container request 

[jira] [Updated] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Tanase updated SPARK-10792:
--
Description: 
We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
stateful app that reads from kafka (with the new direct API) and we’re 
checkpointing to HDFS.

During some resilience testing, we restarted one of the machines and brought it 
back online. During the offline period, the Yarn cluster would not have 
resources to re-create the missing executor.
After starting all the services on the machine, it correctly joined the Yarn 
cluster, however the spark streaming app does not seem to notice that the 
resources are back and has not re-created the missing executor.

The app is correctly running with 6 out of 7 executors, however it’s running 
under capacity.
If we manually kill the driver and re-submit the app to yarn, all the sate is 
correctly recreated from checkpoint and all 7 executors are now online – 
however this seems like a brutal workaround.

*Scenarios tested to isolate the issue:*

The expected outcome after a machine reboot + services back is that processing 
continues on it. *FAILED* below means that processing continues in a reduced 
capacity, as the machine lost rarely re-joins as container/executor even if 
YARN sees it as healthy node.

|| No || Failure scenario || test result || data loss || Notes ||
| 1  | Single node restart | FAILED | NO | Executor NOT redeployed when machine 
comes back and services are restarted |
| 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
restoring services on machines that are down, the app OR kafka OR zookeeper 
metadata gets corrupted, app crashes and can't be restarted w/o clearing 
checkpoint -> dataloss. Root cause is unhealthy cluster when too many machines 
are lost. |
| 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
restart, driver does not crash |
| 4  | Graceful services restart | FAILED | NO | Behaves just like single node 
restart even if we take the time to manually stop services before machine 
reboot. |
| 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
will usually start even if YARN can't fullfill all the resource requests (e.g. 
5 out of 7 nodes are up when app is started). However, when the nodes are added 
to YARN, we see that Spark deploys executors on them, as expected in all the 
scenarios. |
| 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts it 
behaves like machine restart - the rest work as expected, container/executor 
are redeployed in a matter of seconds |
| 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
validate if the behavior is caused by maxing out the cluster and having no 
slack to redeploy a crashed node. We are still behaving like single node 
restart event with lots of extra capacity in YARN - nodes, cores and RAM. |

*Logs for Scenario 6 – correct behavior on process restart*
{noformat}
2015-09-21 11:00:11,193 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
2015-09-21 11:00:11,193 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal

..
(logical continuation from earlier restart attempt)

2015-09-21 10:33:20,658 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
2015-09-21 10:33:20,658 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Container request (host: Any, 
capability: )

..

2015-09-21 10:33:25,663 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Launching container 
container_1442827158253_0004_01_12 for on host ip-10-0-1-16.ec2.internal
2015-09-21 10:33:25,664 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Launching ExecutorRunnable. 
driverUrl: akka.tcp://sparkDriver@10.0.1.14:32938/user/CoarseGrainedScheduler,  
executorHostname: ip-10-0-1-16.ec2.internal
2015-09-21 10:33:25,664 [Reporter] INFO  
org.apache.spark.deploy.yarn.YarnAllocator - Received 1 containers from YARN, 
launching executors on 1 of them.
{noformat}


*Logs for Scenario 1 – weird resource requests / behavior on node restart*

{noformat}
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-31] INFO  
org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver terminated 
or disconnected! Shutting down. ip-10-0-1-16.ec2.internal:34741
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-24] ERROR 
org.apache.spark.scheduler.cluster.YarnClusterScheduler - Lost

[jira] [Commented] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906109#comment-14906109
 ] 

Adrian Tanase commented on SPARK-10792:
---

Yarn side or Spark side? If it does, shouldn't that also apply to the scenario 
#6 where we're just killing the executor process? That one gets redeployed in 
tens of seconds.

My assumption from the logs is that YarnAllocator is not tracking resource 
requests correctly and instead of always trying to go back to 7 containers, it 
remains satisfied with 6, 5, 4 and so on..

> Spark streaming + YARN – executor is not re-created on machine restart
> --
>
> Key: SPARK-10792
> URL: https://issues.apache.org/jira/browse/SPARK-10792
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.4.0
> Environment: - centos7 deployed on AWS
> - yarn / hadoop 2.6.0-cdh5.4.2
> - spark 1.4.0 compiled with hadoop 2.6
>Reporter: Adrian Tanase
>Priority: Minor
>
> We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
> stateful app that reads from kafka (with the new direct API) and we’re 
> checkpointing to HDFS.
> During some resilience testing, we restarted one of the machines and brought 
> it back online. During the offline period, the Yarn cluster would not have 
> resources to re-create the missing executor.
> After starting all the services on the machine, it correctly joined the Yarn 
> cluster, however the spark streaming app does not seem to notice that the 
> resources are back and has not re-created the missing executor.
> The app is correctly running with 6 out of 7 executors, however it’s running 
> under capacity.
> If we manually kill the driver and re-submit the app to yarn, all the sate is 
> correctly recreated from checkpoint and all 7 executors are now online – 
> however this seems like a brutal workaround.
> *Scenarios tested to isolate the issue:*
> The expected outcome after a machine reboot + services back is that 
> processing continues on it. *FAILED* below means that processing continues in 
> a reduced capacity, as the machine lost rarely re-joins as container/executor 
> even if YARN sees it as healthy node.
> || No || Failure scenario || test result || data loss || Notes ||
> | 1  | Single node restart | FAILED | NO | Executor NOT redeployed when 
> machine comes back and services are restarted |
> | 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
> restoring services on machines that are down, the app OR kafka OR zookeeper 
> metadata gets corrupted, app crashes and can't be restarted w/o clearing 
> checkpoint -> dataloss. Root cause is unhealthy cluster when too many 
> machines are lost. |
> | 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
> restart, driver does not crash |
> | 4  | Graceful services restart | FAILED | NO | Behaves just like single 
> node restart even if we take the time to manually stop services before 
> machine reboot. |
> | 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
> will usually start even if YARN can't fullfill all the resource requests 
> (e.g. 5 out of 7 nodes are up when app is started). However, when the nodes 
> are added to YARN, we see that Spark deploys executors on them, as expected 
> in all the scenarios. |
> | 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts 
> it behaves like machine restart - the rest work as expected, 
> container/executor are redeployed in a matter of seconds |
> | 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
> validate if the behavior is caused by maxing out the cluster and having no 
> slack to redeploy a crashed node. We are still behaving like single node 
> restart event with lots of extra capacity in YARN - nodes, cores and RAM. |
> *Logs for Scenario 6 – correct behavior on process restart*
> {noformat}
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
> container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
> container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal
> ..
> (logical continuation from earlier restart attempt)
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
> containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.Yarn

[jira] [Updated] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Tanase updated SPARK-10792:
--
Priority: Minor  (was: Major)

> Spark streaming + YARN – executor is not re-created on machine restart
> --
>
> Key: SPARK-10792
> URL: https://issues.apache.org/jira/browse/SPARK-10792
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.4.0
> Environment: - centos7 deployed on AWS
> - yarn / hadoop 2.6.0-cdh5.4.2
> - spark 1.4.0 compiled with hadoop 2.6
>Reporter: Adrian Tanase
>Priority: Minor
>
> We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
> stateful app that reads from kafka (with the new direct API) and we’re 
> checkpointing to HDFS.
> During some resilience testing, we restarted one of the machines and brought 
> it back online. During the offline period, the Yarn cluster would not have 
> resources to re-create the missing executor.
> After starting all the services on the machine, it correctly joined the Yarn 
> cluster, however the spark streaming app does not seem to notice that the 
> resources are back and has not re-created the missing executor.
> The app is correctly running with 6 out of 7 executors, however it’s running 
> under capacity.
> If we manually kill the driver and re-submit the app to yarn, all the sate is 
> correctly recreated from checkpoint and all 7 executors are now online – 
> however this seems like a brutal workaround.
> *Scenarios tested to isolate the issue:*
> The expected outcome after a machine reboot + services back is that 
> processing continues on it. *FAILED* below means that processing continues in 
> a reduced capacity, as the machine lost rarely re-joins as container/executor 
> even if YARN sees it as healthy node.
> || No || Failure scenario || test result || data loss || Notes ||
> | 1  | Single node restart | FAILED | NO | Executor NOT redeployed when 
> machine comes back and services are restarted |
> | 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
> restoring services on machines that are down, the app OR kafka OR zookeeper 
> metadata gets corrupted, app crashes and can't be restarted w/o clearing 
> checkpoint -> dataloss. Root cause is unhealthy cluster when too many 
> machines are lost. |
> | 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
> restart, driver does not crash |
> | 4  | Graceful services restart | FAILED | NO | Behaves just like single 
> node restart even if we take the time to manually stop services before 
> machine reboot. |
> | 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
> will usually start even if YARN can't fullfill all the resource requests 
> (e.g. 5 out of 7 nodes are up when app is started). However, when the nodes 
> are added to YARN, we see that Spark deploys executors on them, as expected 
> in all the scenarios. |
> | 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts 
> it behaves like machine restart - the rest work as expected, 
> container/executor are redeployed in a matter of seconds |
> | 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
> validate if the behavior is caused by maxing out the cluster and having no 
> slack to redeploy a crashed node. We are still behaving like single node 
> restart event with lots of extra capacity in YARN - nodes, cores and RAM. |
> *Logs for Scenario 6 – correct behavior on process restart*
> {noformat}
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
> container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
> container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal
> ..
> (logical continuation from earlier restart attempt)
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
> containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container request (host: Any, 
> capability: )
> ..
> 2015-09-21 10:33:25,663 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Launching container 
> container_1442827158253_0004_01_12 for on host ip-10-0-1-16.ec2.internal
> 2015-09-21 10:33:25,664 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Launching ExecutorRunnable. 
> driverUrl: 
> akka.tcp:

[jira] [Updated] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Tanase updated SPARK-10792:
--
Priority: Major  (was: Minor)

> Spark streaming + YARN – executor is not re-created on machine restart
> --
>
> Key: SPARK-10792
> URL: https://issues.apache.org/jira/browse/SPARK-10792
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.4.0
> Environment: - centos7 deployed on AWS
> - yarn / hadoop 2.6.0-cdh5.4.2
> - spark 1.4.0 compiled with hadoop 2.6
>Reporter: Adrian Tanase
>
> We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
> stateful app that reads from kafka (with the new direct API) and we’re 
> checkpointing to HDFS.
> During some resilience testing, we restarted one of the machines and brought 
> it back online. During the offline period, the Yarn cluster would not have 
> resources to re-create the missing executor.
> After starting all the services on the machine, it correctly joined the Yarn 
> cluster, however the spark streaming app does not seem to notice that the 
> resources are back and has not re-created the missing executor.
> The app is correctly running with 6 out of 7 executors, however it’s running 
> under capacity.
> If we manually kill the driver and re-submit the app to yarn, all the sate is 
> correctly recreated from checkpoint and all 7 executors are now online – 
> however this seems like a brutal workaround.
> *Scenarios tested to isolate the issue:*
> The expected outcome after a machine reboot + services back is that 
> processing continues on it. *FAILED* below means that processing continues in 
> a reduced capacity, as the machine lost rarely re-joins as container/executor 
> even if YARN sees it as healthy node.
> || No || Failure scenario || test result || data loss || Notes ||
> | 1  | Single node restart | FAILED | NO | Executor NOT redeployed when 
> machine comes back and services are restarted |
> | 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
> restoring services on machines that are down, the app OR kafka OR zookeeper 
> metadata gets corrupted, app crashes and can't be restarted w/o clearing 
> checkpoint -> dataloss. Root cause is unhealthy cluster when too many 
> machines are lost. |
> | 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
> restart, driver does not crash |
> | 4  | Graceful services restart | FAILED | NO | Behaves just like single 
> node restart even if we take the time to manually stop services before 
> machine reboot. |
> | 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
> will usually start even if YARN can't fullfill all the resource requests 
> (e.g. 5 out of 7 nodes are up when app is started). However, when the nodes 
> are added to YARN, we see that Spark deploys executors on them, as expected 
> in all the scenarios. |
> | 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts 
> it behaves like machine restart - the rest work as expected, 
> container/executor are redeployed in a matter of seconds |
> | 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
> validate if the behavior is caused by maxing out the cluster and having no 
> slack to redeploy a crashed node. We are still behaving like single node 
> restart event with lots of extra capacity in YARN - nodes, cores and RAM. |
> *Logs for Scenario 6 – correct behavior on process restart*
> {noformat}
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
> container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
> container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal
> ..
> (logical continuation from earlier restart attempt)
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
> containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container request (host: Any, 
> capability: )
> ..
> 2015-09-21 10:33:25,663 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Launching container 
> container_1442827158253_0004_01_12 for on host ip-10-0-1-16.ec2.internal
> 2015-09-21 10:33:25,664 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Launching ExecutorRunnable. 
> driverUrl: 
> akka.tcp://sparkDriver@10.0.1.14:32938

[jira] [Commented] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906116#comment-14906116
 ] 

Sean Owen commented on SPARK-10792:
---

Both potentially, though I mean the Spark side. In 6 it is successfully 
restarting the executor quickly right? as opposed to being unable to start an 
executor on the host for an extended period.

> Spark streaming + YARN – executor is not re-created on machine restart
> --
>
> Key: SPARK-10792
> URL: https://issues.apache.org/jira/browse/SPARK-10792
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.4.0
> Environment: - centos7 deployed on AWS
> - yarn / hadoop 2.6.0-cdh5.4.2
> - spark 1.4.0 compiled with hadoop 2.6
>Reporter: Adrian Tanase
>Priority: Minor
>
> We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
> stateful app that reads from kafka (with the new direct API) and we’re 
> checkpointing to HDFS.
> During some resilience testing, we restarted one of the machines and brought 
> it back online. During the offline period, the Yarn cluster would not have 
> resources to re-create the missing executor.
> After starting all the services on the machine, it correctly joined the Yarn 
> cluster, however the spark streaming app does not seem to notice that the 
> resources are back and has not re-created the missing executor.
> The app is correctly running with 6 out of 7 executors, however it’s running 
> under capacity.
> If we manually kill the driver and re-submit the app to yarn, all the sate is 
> correctly recreated from checkpoint and all 7 executors are now online – 
> however this seems like a brutal workaround.
> *Scenarios tested to isolate the issue:*
> The expected outcome after a machine reboot + services back is that 
> processing continues on it. *FAILED* below means that processing continues in 
> a reduced capacity, as the machine lost rarely re-joins as container/executor 
> even if YARN sees it as healthy node.
> || No || Failure scenario || test result || data loss || Notes ||
> | 1  | Single node restart | FAILED | NO | Executor NOT redeployed when 
> machine comes back and services are restarted |
> | 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
> restoring services on machines that are down, the app OR kafka OR zookeeper 
> metadata gets corrupted, app crashes and can't be restarted w/o clearing 
> checkpoint -> dataloss. Root cause is unhealthy cluster when too many 
> machines are lost. |
> | 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
> restart, driver does not crash |
> | 4  | Graceful services restart | FAILED | NO | Behaves just like single 
> node restart even if we take the time to manually stop services before 
> machine reboot. |
> | 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
> will usually start even if YARN can't fullfill all the resource requests 
> (e.g. 5 out of 7 nodes are up when app is started). However, when the nodes 
> are added to YARN, we see that Spark deploys executors on them, as expected 
> in all the scenarios. |
> | 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts 
> it behaves like machine restart - the rest work as expected, 
> container/executor are redeployed in a matter of seconds |
> | 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
> validate if the behavior is caused by maxing out the cluster and having no 
> slack to redeploy a crashed node. We are still behaving like single node 
> restart event with lots of extra capacity in YARN - nodes, cores and RAM. |
> *Logs for Scenario 6 – correct behavior on process restart*
> {noformat}
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
> container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
> container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal
> ..
> (logical continuation from earlier restart attempt)
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
> containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container request (host: Any, 
> capability: )
> ..
> 2015-09-21 10:33:25,663 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Launching container 
> container

[jira] [Commented] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906133#comment-14906133
 ] 

Adrian Tanase commented on SPARK-10792:
---

Correct - I forgot to attach a screenshot where this is obvious
https://issues.apache.org/jira/secure/attachment/12762103/Screen%20Shot%202015-09-21%20at%201.58.28%20PM.png

You can see that when 7 dies 8 is created right away, as opposed to the 
subsequent nodes being restarted.

> Spark streaming + YARN – executor is not re-created on machine restart
> --
>
> Key: SPARK-10792
> URL: https://issues.apache.org/jira/browse/SPARK-10792
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.4.0
> Environment: - centos7 deployed on AWS
> - yarn / hadoop 2.6.0-cdh5.4.2
> - spark 1.4.0 compiled with hadoop 2.6
>Reporter: Adrian Tanase
>Priority: Minor
> Attachments: Screen Shot 2015-09-21 at 1.58.28 PM.png
>
>
> We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
> stateful app that reads from kafka (with the new direct API) and we’re 
> checkpointing to HDFS.
> During some resilience testing, we restarted one of the machines and brought 
> it back online. During the offline period, the Yarn cluster would not have 
> resources to re-create the missing executor.
> After starting all the services on the machine, it correctly joined the Yarn 
> cluster, however the spark streaming app does not seem to notice that the 
> resources are back and has not re-created the missing executor.
> The app is correctly running with 6 out of 7 executors, however it’s running 
> under capacity.
> If we manually kill the driver and re-submit the app to yarn, all the sate is 
> correctly recreated from checkpoint and all 7 executors are now online – 
> however this seems like a brutal workaround.
> *Scenarios tested to isolate the issue:*
> The expected outcome after a machine reboot + services back is that 
> processing continues on it. *FAILED* below means that processing continues in 
> a reduced capacity, as the machine lost rarely re-joins as container/executor 
> even if YARN sees it as healthy node.
> || No || Failure scenario || test result || data loss || Notes ||
> | 1  | Single node restart | FAILED | NO | Executor NOT redeployed when 
> machine comes back and services are restarted |
> | 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
> restoring services on machines that are down, the app OR kafka OR zookeeper 
> metadata gets corrupted, app crashes and can't be restarted w/o clearing 
> checkpoint -> dataloss. Root cause is unhealthy cluster when too many 
> machines are lost. |
> | 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
> restart, driver does not crash |
> | 4  | Graceful services restart | FAILED | NO | Behaves just like single 
> node restart even if we take the time to manually stop services before 
> machine reboot. |
> | 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
> will usually start even if YARN can't fullfill all the resource requests 
> (e.g. 5 out of 7 nodes are up when app is started). However, when the nodes 
> are added to YARN, we see that Spark deploys executors on them, as expected 
> in all the scenarios. |
> | 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts 
> it behaves like machine restart - the rest work as expected, 
> container/executor are redeployed in a matter of seconds |
> | 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
> validate if the behavior is caused by maxing out the cluster and having no 
> slack to redeploy a crashed node. We are still behaving like single node 
> restart event with lots of extra capacity in YARN - nodes, cores and RAM. |
> *Logs for Scenario 6 – correct behavior on process restart*
> {noformat}
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
> container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
> container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal
> ..
> (logical continuation from earlier restart attempt)
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
> containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container request

[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-09-24 Thread Konstantinos Kougios (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906126#comment-14906126
 ] 

Konstantinos Kougios commented on SPARK-5928:
-

Same issue here with spark 1.5.0, probably caused because of

rdd.keyBy(... key...).reduceByKey {...}

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio

[jira] [Updated] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Tanase updated SPARK-10792:
--
Attachment: Screen Shot 2015-09-21 at 1.58.28 PM.png

> Spark streaming + YARN – executor is not re-created on machine restart
> --
>
> Key: SPARK-10792
> URL: https://issues.apache.org/jira/browse/SPARK-10792
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.4.0
> Environment: - centos7 deployed on AWS
> - yarn / hadoop 2.6.0-cdh5.4.2
> - spark 1.4.0 compiled with hadoop 2.6
>Reporter: Adrian Tanase
>Priority: Minor
> Attachments: Screen Shot 2015-09-21 at 1.58.28 PM.png
>
>
> We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
> stateful app that reads from kafka (with the new direct API) and we’re 
> checkpointing to HDFS.
> During some resilience testing, we restarted one of the machines and brought 
> it back online. During the offline period, the Yarn cluster would not have 
> resources to re-create the missing executor.
> After starting all the services on the machine, it correctly joined the Yarn 
> cluster, however the spark streaming app does not seem to notice that the 
> resources are back and has not re-created the missing executor.
> The app is correctly running with 6 out of 7 executors, however it’s running 
> under capacity.
> If we manually kill the driver and re-submit the app to yarn, all the sate is 
> correctly recreated from checkpoint and all 7 executors are now online – 
> however this seems like a brutal workaround.
> *Scenarios tested to isolate the issue:*
> The expected outcome after a machine reboot + services back is that 
> processing continues on it. *FAILED* below means that processing continues in 
> a reduced capacity, as the machine lost rarely re-joins as container/executor 
> even if YARN sees it as healthy node.
> || No || Failure scenario || test result || data loss || Notes ||
> | 1  | Single node restart | FAILED | NO | Executor NOT redeployed when 
> machine comes back and services are restarted |
> | 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
> restoring services on machines that are down, the app OR kafka OR zookeeper 
> metadata gets corrupted, app crashes and can't be restarted w/o clearing 
> checkpoint -> dataloss. Root cause is unhealthy cluster when too many 
> machines are lost. |
> | 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
> restart, driver does not crash |
> | 4  | Graceful services restart | FAILED | NO | Behaves just like single 
> node restart even if we take the time to manually stop services before 
> machine reboot. |
> | 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
> will usually start even if YARN can't fullfill all the resource requests 
> (e.g. 5 out of 7 nodes are up when app is started). However, when the nodes 
> are added to YARN, we see that Spark deploys executors on them, as expected 
> in all the scenarios. |
> | 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts 
> it behaves like machine restart - the rest work as expected, 
> container/executor are redeployed in a matter of seconds |
> | 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
> validate if the behavior is caused by maxing out the cluster and having no 
> slack to redeploy a crashed node. We are still behaving like single node 
> restart event with lots of extra capacity in YARN - nodes, cores and RAM. |
> *Logs for Scenario 6 – correct behavior on process restart*
> {noformat}
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
> container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
> container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal
> ..
> (logical continuation from earlier restart attempt)
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
> containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container request (host: Any, 
> capability: )
> ..
> 2015-09-21 10:33:25,663 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Launching container 
> container_1442827158253_0004_01_12 for on host ip-10-0-1-16.ec2.internal
> 2015-09-21 10:33:25,664 [Reporter] INFO  
> org.apache

[jira] [Created] (SPARK-10793) Make sparks use/subclassing of hive more maintainable

2015-09-24 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-10793:
--

 Summary: Make sparks use/subclassing of hive more maintainable
 Key: SPARK-10793
 URL: https://issues.apache.org/jira/browse/SPARK-10793
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Steve Loughran


The latest spark/hive integration round has closed the gap with Hive versions, 
but the integration is still pretty complex
# SparkSQL has deep hooks into the parser
# hivethriftserver uses "aggressive reflection" to inject spark classes into 
the Hive base classes.
# there's a separate org.sparkproject.hive JAR to isolate Kryo versions while 
avoiding the hive uberjar with all its dependencies getting into the spark 
uberjar.

We can improve this with some assistance from the other projects, even though 
no guarantees of stability of things like the parser and thrift server APIs are 
likely in the near future



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10793) Make sparks use/subclassing of hive more maintainable

2015-09-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906172#comment-14906172
 ] 

Steve Loughran commented on SPARK-10793:


Leaving the sqp/hive parser integration alone, this is what can be done by 
making some basic changes upstream

# move spark up the same version of kryo that hive uses, currently 2.22. This 
has to be done via an upgrade to chill: 
[https://github.com/steveloughran/chill/tree/feature/support-kryo-2.22]
# hive thrift server services to call protected methods to create various 
classes (e.g. the cli), calls which can be overridden in subclasses: 
[https://github.com/steveloughran/hive/tree/stevel/1.2.2-SNAPSHOT-managed-thriftserver]
# spark code to directly import hive artifacts, and the hive thriftserver 
subclasses to override the creator methods to build up their service without 
any reflection. 
[https://github.com/steveloughran/spark/tree/stevel/feature/SPARK-10793-managed-hive]

> Make sparks use/subclassing of hive more maintainable
> -
>
> Key: SPARK-10793
> URL: https://issues.apache.org/jira/browse/SPARK-10793
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Steve Loughran
>
> The latest spark/hive integration round has closed the gap with Hive 
> versions, but the integration is still pretty complex
> # SparkSQL has deep hooks into the parser
> # hivethriftserver uses "aggressive reflection" to inject spark classes into 
> the Hive base classes.
> # there's a separate org.sparkproject.hive JAR to isolate Kryo versions while 
> avoiding the hive uberjar with all its dependencies getting into the spark 
> uberjar.
> We can improve this with some assistance from the other projects, even though 
> no guarantees of stability of things like the parser and thrift server APIs 
> are likely in the near future



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9346) Conversion is applied three times on partitioned data sources that require conversion

2015-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9346:
---

Assignee: Apache Spark

> Conversion is applied three times on partitioned data sources that require 
> conversion
> -
>
> Key: SPARK-9346
> URL: https://issues.apache.org/jira/browse/SPARK-9346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> See https://github.com/apache/spark/pull/7649



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9346) Conversion is applied three times on partitioned data sources that require conversion

2015-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906197#comment-14906197
 ] 

Apache Spark commented on SPARK-9346:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/8903

> Conversion is applied three times on partitioned data sources that require 
> conversion
> -
>
> Key: SPARK-9346
> URL: https://issues.apache.org/jira/browse/SPARK-9346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> See https://github.com/apache/spark/pull/7649



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9346) Conversion is applied three times on partitioned data sources that require conversion

2015-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9346:
---

Assignee: (was: Apache Spark)

> Conversion is applied three times on partitioned data sources that require 
> conversion
> -
>
> Key: SPARK-9346
> URL: https://issues.apache.org/jira/browse/SPARK-9346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> See https://github.com/apache/spark/pull/7649



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10794) Spark-SQL- select query on table column with binary Data Type displays error message- java.lang.ClassCastException: java.lang.String cannot be cast to [B

2015-09-24 Thread Anilkumar Kalshetti (JIRA)
Anilkumar Kalshetti created SPARK-10794:
---

 Summary: Spark-SQL- select query on table column with binary Data 
Type displays error message- java.lang.ClassCastException: java.lang.String 
cannot be cast to [B
 Key: SPARK-10794
 URL: https://issues.apache.org/jira/browse/SPARK-10794
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Spark 1.5.0 running on MapR 5.0 sandbox
Reporter: Anilkumar Kalshetti
Priority: Minor


Use beeline interface for Spark-SQL

1] Execute below query to create Table,

CREATE TABLE default.testbinary  ( 
c1 binary, 
c2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

2] Copy the attachment file: testbinary.txt in VM directory - /home/mapr/data/
and execute below script to load data in table

LOAD DATA LOCAL INPATH '/home/mapr/data/testbinary.txt' INTO TABLE testbinary

//testbinary.txt  contains data
1001,'russia'

3] Execute below 'Describe' command to get table information, and select 
command to get table data
describe  testbinary;

SELECT c1 FROM testbinary;

4] Select query displays error message:
 java.lang.ClassCastException: java.lang.String cannot be cast to [B 

Info:  for same table - select query on column c2 - string datatype works 
properly
SELECT c2 FROM testbinary;

Please refer screenshot- binaryDataType.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception

2015-09-24 Thread simon.lou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906222#comment-14906222
 ] 

simon.lou commented on SPARK-7483:
--

kyro not support ListBuffer because ListBuffer don't have any "zero argument 
constructor".
refer to : 
https://github.com/EsotericSoftware/kryo#using-standard-java-serialization

"By default, if a class has a zero argument constructor then it is invoked via 
ReflectASM or reflection, otherwise an exception is thrown. "

Is that the reason?

When using kyro , use ArrayBuffer instead of ListBuffer

> [MLLib] Using Kryo with FPGrowth fails with an exception
> 
>
> Key: SPARK-7483
> URL: https://issues.apache.org/jira/browse/SPARK-7483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When using FPGrowth algorithm with KryoSerializer - Spark fails with
> {code}
> Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): 
> com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
> Can not set final scala.collection.mutable.ListBuffer field 
> org.apache.spark.mllib.fpm.FPTree$Summary.nodes to 
> scala.collection.mutable.ArrayBuffer
> Serialization trace:
> nodes (org.apache.spark.mllib.fpm.FPTree$Summary)
> org$apache$spark$mllib$fpm$FPTree$$summaries 
> (org.apache.spark.mllib.fpm.FPTree)
> {code}
> This can be easily reproduced in spark codebase by setting 
> {code}
> conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> {code} and running FPGrowthSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10794) Spark-SQL- select query on table column with binary Data Type displays error message- java.lang.ClassCastException: java.lang.String cannot be cast to [B

2015-09-24 Thread Anilkumar Kalshetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anilkumar Kalshetti updated SPARK-10794:

Description: 
Spark-SQL connected to Hive Metastore-- MapR5.0 has Hive 1.0.0
Use beeline interface for Spark-SQL

1] Execute below query to create Table,

CREATE TABLE default.testbinary  ( 
c1 binary, 
c2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

2] Copy the attachment file: testbinary.txt in VM directory - /home/mapr/data/
and execute below script to load data in table

LOAD DATA LOCAL INPATH '/home/mapr/data/testbinary.txt' INTO TABLE testbinary

//testbinary.txt  contains data
1001,'russia'

3] Execute below 'Describe' command to get table information, and select 
command to get table data
describe  testbinary;

SELECT c1 FROM testbinary;

4] Select query displays error message:
 java.lang.ClassCastException: java.lang.String cannot be cast to [B 

Info:  for same table - select query on column c2 - string datatype works 
properly
SELECT c2 FROM testbinary;

Please refer screenshot- binaryDataType.png

  was:
Use beeline interface for Spark-SQL

1] Execute below query to create Table,

CREATE TABLE default.testbinary  ( 
c1 binary, 
c2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

2] Copy the attachment file: testbinary.txt in VM directory - /home/mapr/data/
and execute below script to load data in table

LOAD DATA LOCAL INPATH '/home/mapr/data/testbinary.txt' INTO TABLE testbinary

//testbinary.txt  contains data
1001,'russia'

3] Execute below 'Describe' command to get table information, and select 
command to get table data
describe  testbinary;

SELECT c1 FROM testbinary;

4] Select query displays error message:
 java.lang.ClassCastException: java.lang.String cannot be cast to [B 

Info:  for same table - select query on column c2 - string datatype works 
properly
SELECT c2 FROM testbinary;

Please refer screenshot- binaryDataType.png


> Spark-SQL- select query on table column with binary Data Type displays error 
> message- java.lang.ClassCastException: java.lang.String cannot be cast to [B
> -
>
> Key: SPARK-10794
> URL: https://issues.apache.org/jira/browse/SPARK-10794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Spark 1.5.0 running on MapR 5.0 sandbox
>Reporter: Anilkumar Kalshetti
>Priority: Minor
> Attachments: binaryDataType.png, testbinary.txt
>
>
> Spark-SQL connected to Hive Metastore-- MapR5.0 has Hive 1.0.0
> Use beeline interface for Spark-SQL
> 1] Execute below query to create Table,
> CREATE TABLE default.testbinary  ( 
> c1 binary, 
> c2 string)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;
> 2] Copy the attachment file: testbinary.txt in VM directory - /home/mapr/data/
> and execute below script to load data in table
> LOAD DATA LOCAL INPATH '/home/mapr/data/testbinary.txt' INTO TABLE testbinary
> //testbinary.txt  contains data
> 1001,'russia'
> 3] Execute below 'Describe' command to get table information, and select 
> command to get table data
> describe  testbinary;
> SELECT c1 FROM testbinary;
> 4] Select query displays error message:
>  java.lang.ClassCastException: java.lang.String cannot be cast to [B 
> Info:  for same table - select query on column c2 - string datatype works 
> properly
> SELECT c2 FROM testbinary;
> Please refer screenshot- binaryDataType.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception

2015-09-24 Thread simon.lou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906221#comment-14906221
 ] 

simon.lou commented on SPARK-7483:
--

kyro not support ListBuffer because ListBuffer don't have any "zero argument 
constructor".
refer to : 
https://github.com/EsotericSoftware/kryo#using-standard-java-serialization

"By default, if a class has a zero argument constructor then it is invoked via 
ReflectASM or reflection, otherwise an exception is thrown. "

Is that the reason?

When using kyro , use ArrayBuffer instead of ListBuffer

> [MLLib] Using Kryo with FPGrowth fails with an exception
> 
>
> Key: SPARK-7483
> URL: https://issues.apache.org/jira/browse/SPARK-7483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When using FPGrowth algorithm with KryoSerializer - Spark fails with
> {code}
> Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): 
> com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
> Can not set final scala.collection.mutable.ListBuffer field 
> org.apache.spark.mllib.fpm.FPTree$Summary.nodes to 
> scala.collection.mutable.ArrayBuffer
> Serialization trace:
> nodes (org.apache.spark.mllib.fpm.FPTree$Summary)
> org$apache$spark$mllib$fpm$FPTree$$summaries 
> (org.apache.spark.mllib.fpm.FPTree)
> {code}
> This can be easily reproduced in spark codebase by setting 
> {code}
> conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> {code} and running FPGrowthSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception

2015-09-24 Thread simon.lou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

simon.lou updated SPARK-7483:
-
Comment: was deleted

(was: kyro not support ListBuffer because ListBuffer don't have any "zero 
argument constructor".
refer to : 
https://github.com/EsotericSoftware/kryo#using-standard-java-serialization

"By default, if a class has a zero argument constructor then it is invoked via 
ReflectASM or reflection, otherwise an exception is thrown. "

Is that the reason?

When using kyro , use ArrayBuffer instead of ListBuffer)

> [MLLib] Using Kryo with FPGrowth fails with an exception
> 
>
> Key: SPARK-7483
> URL: https://issues.apache.org/jira/browse/SPARK-7483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When using FPGrowth algorithm with KryoSerializer - Spark fails with
> {code}
> Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): 
> com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
> Can not set final scala.collection.mutable.ListBuffer field 
> org.apache.spark.mllib.fpm.FPTree$Summary.nodes to 
> scala.collection.mutable.ArrayBuffer
> Serialization trace:
> nodes (org.apache.spark.mllib.fpm.FPTree$Summary)
> org$apache$spark$mllib$fpm$FPTree$$summaries 
> (org.apache.spark.mllib.fpm.FPTree)
> {code}
> This can be easily reproduced in spark codebase by setting 
> {code}
> conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> {code} and running FPGrowthSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10794) Spark-SQL- select query on table column with binary Data Type displays error message- java.lang.ClassCastException: java.lang.String cannot be cast to [B

2015-09-24 Thread Anilkumar Kalshetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anilkumar Kalshetti updated SPARK-10794:

Attachment: binaryDataType.png
testbinary.txt

> Spark-SQL- select query on table column with binary Data Type displays error 
> message- java.lang.ClassCastException: java.lang.String cannot be cast to [B
> -
>
> Key: SPARK-10794
> URL: https://issues.apache.org/jira/browse/SPARK-10794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Spark 1.5.0 running on MapR 5.0 sandbox
>Reporter: Anilkumar Kalshetti
>Priority: Minor
> Attachments: binaryDataType.png, testbinary.txt
>
>
> Use beeline interface for Spark-SQL
> 1] Execute below query to create Table,
> CREATE TABLE default.testbinary  ( 
> c1 binary, 
> c2 string)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;
> 2] Copy the attachment file: testbinary.txt in VM directory - /home/mapr/data/
> and execute below script to load data in table
> LOAD DATA LOCAL INPATH '/home/mapr/data/testbinary.txt' INTO TABLE testbinary
> //testbinary.txt  contains data
> 1001,'russia'
> 3] Execute below 'Describe' command to get table information, and select 
> command to get table data
> describe  testbinary;
> SELECT c1 FROM testbinary;
> 4] Select query displays error message:
>  java.lang.ClassCastException: java.lang.String cannot be cast to [B 
> Info:  for same table - select query on column c2 - string datatype works 
> properly
> SELECT c2 FROM testbinary;
> Please refer screenshot- binaryDataType.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2015-09-24 Thread Harshit (JIRA)
Harshit created SPARK-10795:
---

 Summary: FileNotFoundException while deploying pyspark job on 
cluster
 Key: SPARK-10795
 URL: https://issues.apache.org/jira/browse/SPARK-10795
 Project: Spark
  Issue Type: Bug
  Components: PySpark
 Environment: EMR 
Reporter: Harshit


I am trying to run simple spark job using pyspark, it works as standalone , but 
while I deploy over cluster it fails.

Events :

2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) - 
Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip

Above uploading resource file is successfull , I manually checked file is 
present in above specified path , but after a file I face following error :

Diagnostics: File does not exist: 
hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
java.io.FileNotFoundException: File does not exist: 
hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2015-09-24 Thread Harshit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harshit updated SPARK-10795:

Description: 
I am trying to run simple spark job using pyspark, it works as standalone , but 
while I deploy over cluster it fails.

Events :

2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) - 
Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip

Above uploading resource file is successfull , I manually checked file is 
present in above specified path , but after a while I face following error :

Diagnostics: File does not exist: 
hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
java.io.FileNotFoundException: File does not exist: 
hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip

  was:
I am trying to run simple spark job using pyspark, it works as standalone , but 
while I deploy over cluster it fails.

Events :

2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) - 
Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip

Above uploading resource file is successfull , I manually checked file is 
present in above specified path , but after a file I face following error :

Diagnostics: File does not exist: 
hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
java.io.FileNotFoundException: File does not exist: 
hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip


> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
> hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10778) Implement toString for AssociationRules.Rule

2015-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906272#comment-14906272
 ] 

Apache Spark commented on SPARK-10778:
--

User 'y-shimizu' has created a pull request for this issue:
https://github.com/apache/spark/pull/8904

> Implement toString for AssociationRules.Rule
> 
>
> Key: SPARK-10778
> URL: https://issues.apache.org/jira/browse/SPARK-10778
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Priority: Trivial
>  Labels: starter
>
> pretty print for association rules, e.g.
> {code}
> {a, b, c} => {d}: 0.8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10778) Implement toString for AssociationRules.Rule

2015-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10778:


Assignee: Apache Spark

> Implement toString for AssociationRules.Rule
> 
>
> Key: SPARK-10778
> URL: https://issues.apache.org/jira/browse/SPARK-10778
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: starter
>
> pretty print for association rules, e.g.
> {code}
> {a, b, c} => {d}: 0.8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10778) Implement toString for AssociationRules.Rule

2015-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10778:


Assignee: (was: Apache Spark)

> Implement toString for AssociationRules.Rule
> 
>
> Key: SPARK-10778
> URL: https://issues.apache.org/jira/browse/SPARK-10778
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Priority: Trivial
>  Labels: starter
>
> pretty print for association rules, e.g.
> {code}
> {a, b, c} => {d}: 0.8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2015-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906282#comment-14906282
 ] 

Apache Spark commented on SPARK-6028:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/8905

> Provide an alternative RPC implementation based on the network transport 
> module
> ---
>
> Key: SPARK-6028
> URL: https://issues.apache.org/jira/browse/SPARK-6028
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0
>
>
> Network transport module implements a low level RPC interface. We can build a 
> new RPC implementation on top of that to replace Akka's.
> Design document: 
> https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-24 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906291#comment-14906291
 ] 

Yi Zhou commented on SPARK-10474:
-

Hi [~andrewor14] [~yhuai]. It's OK for me and get no errors. Thanks !

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

--

[jira] [Created] (SPARK-10796) The Stage taskSets may are all removed while stage still have pending partitions after having lost some executors

2015-09-24 Thread SuYan (JIRA)
SuYan created SPARK-10796:
-

 Summary: The Stage taskSets may are all removed while stage still 
have pending partitions after having lost some executors
 Key: SPARK-10796
 URL: https://issues.apache.org/jira/browse/SPARK-10796
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
Reporter: SuYan


We meet that problem in Spark 1.3.0, and I also check the lastest Spark code. 
and I think that problem still exist.

1. while a stage occurs fetchFailed, then will new resubmit the running stage, 
and mark previous stage as zombie.

2. if there have a executor lost, the zombie taskset may lost the results of 
already successful tasks. In Current code, it will resubmit, but it useless 
because it is zombie, will not be scheduler again.

so if the active taskset and zombie taskset all finished the task in 
`runningtasks`, Spark will think they are finished.  but the running Stage 
still have pending partitions. so it will be hangbecause no logical to 
re-run this pending partitions.

Driver logical is complicated, it will be helpful if any one will check that 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10688) Python API for AFTSurvivalRegression

2015-09-24 Thread Kai Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906372#comment-14906372
 ] 

Kai Jiang commented on SPARK-10688:
---

Working on it~

> Python API for AFTSurvivalRegression
> 
>
> Key: SPARK-10688
> URL: https://issues.apache.org/jira/browse/SPARK-10688
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>  Labels: starter
>
> After SPARK-10686, we should add Python API for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10797) RDD's coalesce should not write out the temporary key

2015-09-24 Thread JIRA
Zoltán Zvara created SPARK-10797:


 Summary: RDD's coalesce should not write out the temporary key
 Key: SPARK-10797
 URL: https://issues.apache.org/jira/browse/SPARK-10797
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Zoltán Zvara


It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle files) 
temporary keys used on the shuffle code path. Consider the following code:

{code:title=RDD.scala|borderStyle=solid}
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions).values
} else {
{code}

{{ShuffledRDD}} will hash using {{position}}s as keys as in the 
{{distributePartition}} function. After the bucket has been chosen by the 
sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
{{DiskBlockObjectWriter}} write out both the (temporary) key and value to the 
spacified partition. On the next stage, after reading we take only the values 
with {{PairRDDFunctions}}.

This certainly has a performance impact, as we unnecessarily write/read 
{{Int}}s and transform the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10797) RDD's coalesce should not write out the temporary key

2015-09-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906397#comment-14906397
 ] 

Zoltán Zvara commented on SPARK-10797:
--

I have prepared a solution for this, because I had to extract keys from the 
original values written out, to be able to visualize key-distributions. 
Basically what I do is, that I tell the {{ShuffleHandle}} setup at 
{{ShuffledRDD}}'s {{ShuffleDependency}} to write and read only values 
(basically the objects), so the {{DiskBlockObjectWriter}} will write out 
objects, instead of keys and values for each record.

I'm willing to raise a pull request against this, if you guys think this is a 
good approach.

> RDD's coalesce should not write out the temporary key
> -
>
> Key: SPARK-10797
> URL: https://issues.apache.org/jira/browse/SPARK-10797
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Zoltán Zvara
>
> It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle 
> files) temporary keys used on the shuffle code path. Consider the following 
> code:
> {code:title=RDD.scala|borderStyle=solid}
> if (shuffle) {
>   /** Distributes elements evenly across output partitions, starting from 
> a random partition. */
>   val distributePartition = (index: Int, items: Iterator[T]) => {
> var position = (new Random(index)).nextInt(numPartitions)
> items.map { t =>
>   // Note that the hash code of the key will just be the key itself. 
> The HashPartitioner
>   // will mod it with the number of total partitions.
>   position = position + 1
>   (position, t)
> }
>   } : Iterator[(Int, T)]
>   // include a shuffle step so that our upstream tasks are still 
> distributed
>   new CoalescedRDD(
> new ShuffledRDD[Int, T, 
> T](mapPartitionsWithIndex(distributePartition),
> new HashPartitioner(numPartitions)),
> numPartitions).values
> } else {
> {code}
> {{ShuffledRDD}} will hash using {{position}}s as keys as in the 
> {{distributePartition}} function. After the bucket has been chosen by the 
> sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
> {{DiskBlockObjectWriter}} write out both the (temporary) key and value to the 
> spacified partition. On the next stage, after reading we take only the values 
> with {{PairRDDFunctions}}.
> This certainly has a performance impact, as we unnecessarily write/read 
> {{Int}}s and transform the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10798) JsonMappingException with Spark Context Parallelize

2015-09-24 Thread Dev Lakhani (JIRA)
Dev Lakhani created SPARK-10798:
---

 Summary: JsonMappingException with Spark Context Parallelize
 Key: SPARK-10798
 URL: https://issues.apache.org/jira/browse/SPARK-10798
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
 Environment: Linux, Java 1.8.40
Reporter: Dev Lakhani


When trying to create an RDD of Rows using a Java Spark Context:

List rows= new Vector();
rows.add(RowFactory.create("test"));
javaSparkContext.parallelize(rows);

I get :

com.fasterxml.jackson.databind.JsonMappingException: (None,None) (of class 
scala.Tuple2) (through reference chain: 
org.apache.spark.rdd.RDDOperationScope["parent"])
   at 
com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:210)
   at 
com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:177)
   at 
com.fasterxml.jackson.databind.ser.std.StdSerializer.wrapAndThrow(StdSerializer.java:187)
   at 
com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:647)
   at 
com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152)
   at 
com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
   at 
com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
   at 
com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
   at 
org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:50)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:141)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
   at 
org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
   at 
org.apache.spark.SparkContext.parallelize(SparkContext.scala:714)
   at 
org.apache.spark.api.java.JavaSparkContext.parallelize(JavaSparkContext.scala:145)
   at 
org.apache.spark.api.java.JavaSparkContext.parallelize(JavaSparkContext.scala:157)
   ...
Caused by: scala.MatchError: (None,None) (of class scala.Tuple2)
   at 
com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply$mcV$sp(OptionSerializerModule.scala:32)
   at 
com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply(OptionSerializerModule.scala:32)
   at 
com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply(OptionSerializerModule.scala:32)
   at scala.Option.getOrElse(Option.scala:120)
   at 
com.fasterxml.jackson.module.scala.ser.OptionSerializer.serialize(OptionSerializerModule.scala:31)
   at 
com.fasterxml.jackson.module.scala.ser.OptionSerializer.serialize(OptionSerializerModule.scala:22)
   at 
com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505)
   at 
com.fasterxml.jackson.module.scala.ser.OptionPropertyWriter.serializeAsField(OptionSerializerModule.scala:128)
   at 
com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639)
   ... 19 more

I've tried updating jackson module scala to 2.6.1 but same issue. This happens 
in local mode with java 1.8_40
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10798) JsonMappingException with Spark Context Parallelize

2015-09-24 Thread Dev Lakhani (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dev Lakhani updated SPARK-10798:

Description: 
When trying to create an RDD of Rows using a Java Spark Context:

List rows= new Vector();
rows.add(RowFactory.create("test"));
javaSparkContext.parallelize(rows);

I get :

com.fasterxml.jackson.databind.JsonMappingException: (None,None) (of class 
scala.Tuple2) (through reference chain: 
org.apache.spark.rdd.RDDOperationScope["parent"])
   at 
com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:210)
   at 
com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:177)
   at 
com.fasterxml.jackson.databind.ser.std.StdSerializer.wrapAndThrow(StdSerializer.java:187)
   at 
com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:647)
   at 
com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152)
   at 
com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
   at 
com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
   at 
com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
   at 
org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:50)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:141)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
   at 
org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
   at 
org.apache.spark.SparkContext.parallelize(SparkContext.scala:714)
   at 
org.apache.spark.api.java.JavaSparkContext.parallelize(JavaSparkContext.scala:145)
   at 
org.apache.spark.api.java.JavaSparkContext.parallelize(JavaSparkContext.scala:157)
   ...
Caused by: scala.MatchError: (None,None) (of class scala.Tuple2)
   at 
com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply$mcV$sp(OptionSerializerModule.scala:32)
   at 
com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply(OptionSerializerModule.scala:32)
   at 
com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply(OptionSerializerModule.scala:32)
   at scala.Option.getOrElse(Option.scala:120)
   at 
com.fasterxml.jackson.module.scala.ser.OptionSerializer.serialize(OptionSerializerModule.scala:31)
   at 
com.fasterxml.jackson.module.scala.ser.OptionSerializer.serialize(OptionSerializerModule.scala:22)
   at 
com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505)
   at 
com.fasterxml.jackson.module.scala.ser.OptionPropertyWriter.serializeAsField(OptionSerializerModule.scala:128)
   at 
com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639)
   ... 19 more

I've tried updating jackson module scala to 2.6.1 but same issue. This happens 
in local mode with java 1.8_40. I searched the web and this Jira for similar 
issues but found nothing of interest.
 

  was:
When trying to create an RDD of Rows using a Java Spark Context:

List rows= new Vector();
rows.add(RowFactory.create("test"));
javaSparkContext.parallelize(rows);

I get :

com.fasterxml.jackson.databind.JsonMappingException: (None,None) (of class 
scala.Tuple2) (through reference chain: 
org.apache.spark.rdd.RDDOperationScope["parent"])
   at 
com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:210)
   at 
com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:177)
   at 
com.fasterxml.jackson.databind.ser.std.StdSerializer.wrapAndThrow(StdSerializer.java:187)
   at 
com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:647)
   at 
com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152)
   at 
com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
   at 
com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
   at 
com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
   at 
org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:50)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:141)
   at 
org.apache.spark.

[jira] [Created] (SPARK-10799) Flaky test: org.apache.spark.rpc.netty.InboxSuite.post: multiple threads

2015-09-24 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10799:
-

 Summary: Flaky test: org.apache.spark.rpc.netty.InboxSuite.post: 
multiple threads
 Key: SPARK-10799
 URL: https://issues.apache.org/jira/browse/SPARK-10799
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Reporter: Xiangrui Meng
Assignee: Shixiong Zhu


Saw test failures after PR #6457, e.g.,

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3596/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark.rpc.netty/InboxSuite/post__multiple_threads/

{code}
org.apache.spark.rpc.netty.InboxSuite.post: multiple threads

Failing for the past 1 build (Since Failed#3596 )
Took 9 ms.
add description
Error Message

org.apache.spark.rpc.netty.InboxSuite$$anonfun$3$$anon$1@73986812 was not empty
Stacktrace

sbt.ForkMain$ForkError: 
org.apache.spark.rpc.netty.InboxSuite$$anonfun$3$$anon$1@73986812 was not empty
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
at 
org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply$mcV$sp(InboxSuite.scala:94)
at 
org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply(InboxSuite.scala:66)
at 
org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply(InboxSuite.scala:66)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at org.scalatest.FunSuite.run(FunSuite.scala:1555)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
at sbt.ForkMain$Run$2.call(ForkMain.java:294)
at sbt.ForkMain$Run$2.call(ForkMain.java:284)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10799) Flaky test: org.apache.spark.rpc.netty.InboxSuite.post: multiple threads

2015-09-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10799:
--
Labels: flaky-test  (was: )

> Flaky test: org.apache.spark.rpc.netty.InboxSuite.post: multiple threads
> 
>
> Key: SPARK-10799
> URL: https://issues.apache.org/jira/browse/SPARK-10799
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Reporter: Xiangrui Meng
>Assignee: Shixiong Zhu
>  Labels: flaky-test
>
> Saw test failures after PR #6457, e.g.,
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3596/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark.rpc.netty/InboxSuite/post__multiple_threads/
> {code}
> org.apache.spark.rpc.netty.InboxSuite.post: multiple threads
> Failing for the past 1 build (Since Failed#3596 )
> Took 9 ms.
> add description
> Error Message
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3$$anon$1@73986812 was not 
> empty
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3$$anon$1@73986812 was not 
> empty
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply$mcV$sp(InboxSuite.scala:94)
>   at 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply(InboxSuite.scala:66)
>   at 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply(InboxSuite.scala:66)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at org.scalatest.FunSuite.run(FunSuite.scala:1555)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.

[jira] [Reopened] (SPARK-10651) Flaky test: BroadcastSuite

2015-09-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-10651:
---

I think timeout doesn't work well. Saw more failures:

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3596/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/testReport/junit/org.apache.spark.broadcast/BroadcastSuite/Unpersisting_TorrentBroadcast_on_executors_only_in_distributed_mode/

{code}
org.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast on 
executors only in distributed mode

Failing for the past 1 build (Since Failed#3596 )
Took 1 min 0 sec.
add description
Error Message

Can't find 2 executors before 6 milliseconds elapsed
Stacktrace

sbt.ForkMain$ForkError: Can't find 2 executors before 6 milliseconds elapsed
at 
org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561)
at 
org.apache.spark.broadcast.BroadcastSuite.liftedTree1$1(BroadcastSuite.scala:314)
at 
org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313)
at 
org.apache.spark.broadcast.BroadcastSuite.org$apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287)
at 
org.apache.spark.broadcast.BroadcastSuite$$anonfun$15.apply$mcV$sp(BroadcastSuite.scala:161)
at 
org.apache.spark.broadcast.BroadcastSuite$$anonfun$15.apply(BroadcastSuite.scala:161)
at 
org.apache.spark.broadcast.BroadcastSuite$$anonfun$15.apply(BroadcastSuite.scala:161)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 
org.apache.spark.broadcast.BroadcastSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(BroadcastSuite.scala:46)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
at 
org.apache.spark.broadcast.BroadcastSuite.runTest(BroadcastSuite.scala:46)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at 
org.apache.spark.broadcast.BroadcastSuite.org$scalatest$BeforeAndAfterAll$$super$run(BroadcastSuite.scala:46)
at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
at 
org.apache.spark.broadcast.BroadcastSuite.run(BroadcastSuite.scala:46)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
at sbt.ForkMain$Run$2.call(ForkMain.java:294)
at sbt.ForkMain$Run$2.call(ForkMain.java:284)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.Thr

[jira] [Created] (SPARK-10800) Flaky test: org.apache.spark.deploy.StandaloneDynamicAllocationSuite

2015-09-24 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10800:
-

 Summary: Flaky test: 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite
 Key: SPARK-10800
 URL: https://issues.apache.org/jira/browse/SPARK-10800
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 1.6.0
Reporter: Xiangrui Meng
Assignee: Shixiong Zhu


Saw several failures on master:

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3622/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/junit/org.apache.spark.deploy/

{code}
org.apache.spark.deploy.StandaloneDynamicAllocationSuite.dynamic allocation 
default behavior

Failing for the past 1 build (Since Failed#3622 )
Took 0.12 sec.
add description
Error Message

1 did not equal 2
Stacktrace

  org.scalatest.exceptions.TestFailedException: 1 did not equal 2
  at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
  at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
  at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
  at 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply$mcV$sp(StandaloneDynamicAllocationSuite.scala:78)
  at 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:73)
  at 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:73)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
  at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
  at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
  at 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StandaloneDynamicAllocationSuite.scala:33)
  at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
  at 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite.runTest(StandaloneDynamicAllocationSuite.scala:33)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
  at scala.collection.immutable.List.foreach(List.scala:318)
  at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
  at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
  at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
  at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
  at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
  at org.scalatest.Suite$class.run(Suite.scala:1424)
  at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
  at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
  at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
  at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
  at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
  at 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite.org$scalatest$BeforeAndAfterAll$$super$run(StandaloneDynamicAllocationSuite.scala:33)
  at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
  at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
  at 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite.run(StandaloneDynamicAllocationSuite.scala:33)
  at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
  at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
  at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
  at org.scalatest.Suite$class.runNestedSuites(Suite.sc

[jira] [Updated] (SPARK-10651) Flaky test: BroadcastSuite

2015-09-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10651:
--
Fix Version/s: (was: 1.6.0)

> Flaky test: BroadcastSuite
> --
>
> Key: SPARK-10651
> URL: https://issues.apache.org/jira/browse/SPARK-10651
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Shixiong Zhu
>Priority: Blocker
>  Labels: flaky-test
> Attachments: BroadcastSuiteFailures.csv
>
>
> Saw many failures recently in master build. See attached CSV for a full list. 
> Most of the error messages are:
> {code}
> Can't find 2 executors before 1 milliseconds elapsed
> {code}
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10799) Flaky test: org.apache.spark.rpc.netty.InboxSuite.post: multiple threads

2015-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10799:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Flaky test: org.apache.spark.rpc.netty.InboxSuite.post: multiple threads
> 
>
> Key: SPARK-10799
> URL: https://issues.apache.org/jira/browse/SPARK-10799
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>  Labels: flaky-test
>
> Saw test failures after PR #6457, e.g.,
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3596/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark.rpc.netty/InboxSuite/post__multiple_threads/
> {code}
> org.apache.spark.rpc.netty.InboxSuite.post: multiple threads
> Failing for the past 1 build (Since Failed#3596 )
> Took 9 ms.
> add description
> Error Message
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3$$anon$1@73986812 was not 
> empty
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3$$anon$1@73986812 was not 
> empty
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply$mcV$sp(InboxSuite.scala:94)
>   at 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply(InboxSuite.scala:66)
>   at 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply(InboxSuite.scala:66)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at org.scalatest.FunSuite.run(FunSuite.scala:1555)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issu

[jira] [Commented] (SPARK-10799) Flaky test: org.apache.spark.rpc.netty.InboxSuite.post: multiple threads

2015-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906490#comment-14906490
 ] 

Apache Spark commented on SPARK-10799:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/8905

> Flaky test: org.apache.spark.rpc.netty.InboxSuite.post: multiple threads
> 
>
> Key: SPARK-10799
> URL: https://issues.apache.org/jira/browse/SPARK-10799
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Reporter: Xiangrui Meng
>Assignee: Shixiong Zhu
>  Labels: flaky-test
>
> Saw test failures after PR #6457, e.g.,
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3596/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark.rpc.netty/InboxSuite/post__multiple_threads/
> {code}
> org.apache.spark.rpc.netty.InboxSuite.post: multiple threads
> Failing for the past 1 build (Since Failed#3596 )
> Took 9 ms.
> add description
> Error Message
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3$$anon$1@73986812 was not 
> empty
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3$$anon$1@73986812 was not 
> empty
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply$mcV$sp(InboxSuite.scala:94)
>   at 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply(InboxSuite.scala:66)
>   at 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply(InboxSuite.scala:66)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at org.scalatest.FunSuite.run(FunSuite.scala:1555)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#63

[jira] [Assigned] (SPARK-10799) Flaky test: org.apache.spark.rpc.netty.InboxSuite.post: multiple threads

2015-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10799:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Flaky test: org.apache.spark.rpc.netty.InboxSuite.post: multiple threads
> 
>
> Key: SPARK-10799
> URL: https://issues.apache.org/jira/browse/SPARK-10799
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Reporter: Xiangrui Meng
>Assignee: Shixiong Zhu
>  Labels: flaky-test
>
> Saw test failures after PR #6457, e.g.,
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3596/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark.rpc.netty/InboxSuite/post__multiple_threads/
> {code}
> org.apache.spark.rpc.netty.InboxSuite.post: multiple threads
> Failing for the past 1 build (Since Failed#3596 )
> Took 9 ms.
> add description
> Error Message
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3$$anon$1@73986812 was not 
> empty
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3$$anon$1@73986812 was not 
> empty
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply$mcV$sp(InboxSuite.scala:94)
>   at 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply(InboxSuite.scala:66)
>   at 
> org.apache.spark.rpc.netty.InboxSuite$$anonfun$3.apply(InboxSuite.scala:66)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at org.scalatest.FunSuite.run(FunSuite.scala:1555)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issu

[jira] [Updated] (SPARK-10801) StatCounter uses mutability and is not thread-safe

2015-09-24 Thread Gianmario Spacagna (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gianmario Spacagna updated SPARK-10801:
---
Summary: StatCounter uses mutability and is not thread-safe  (was: 
StatCounter uses mutability, is not thread-safe and hard to understand its 
implementation)

> StatCounter uses mutability and is not thread-safe
> --
>
> Key: SPARK-10801
> URL: https://issues.apache.org/jira/browse/SPARK-10801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Gianmario Spacagna
>
> The current implementation of StatCounter is mutable and not thread-safe.
> The API for creating it is also limiting since that it only expose the 
> constructor using a TraversableOnce[Double].
> More over the current implementation does not offer any equality.
> My proposal is to use case classes to store the minimum amount of fields 
> necessary to compute the statistics and make it so that it would be easy to 
> apply the Monoid pattern to reduce an RDD or a Scala collection of 
> StatCounter into a single StatCounter.
> I have re-implemented and tested StatCounter at my work after I found a bug 
> when trying to merge multiple stat counter in parallel using Scalaz Monoid. I 
> would like to send a pull request of that functional, clean and concise 
> re-implementation.
> This would be the declaration of the class:
> case class StatCounter(n: Long, sum: Double, sos: Double, min: Double, max: 
> Double)
> That would also change the implementation of variance into a single line:
> def variance = (sos - n * mean * mean) / (n - 1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10801) StatCounter uses mutability, is not thread-safe and hard to understand its implementation

2015-09-24 Thread Gianmario Spacagna (JIRA)
Gianmario Spacagna created SPARK-10801:
--

 Summary: StatCounter uses mutability, is not thread-safe and hard 
to understand its implementation
 Key: SPARK-10801
 URL: https://issues.apache.org/jira/browse/SPARK-10801
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Gianmario Spacagna


The current implementation of StatCounter is mutable and not thread-safe.
The API for creating it is also limiting since that it only expose the 
constructor using a TraversableOnce[Double].
More over the current implementation does not offer any equality.

My proposal is to use case classes to store the minimum amount of fields 
necessary to compute the statistics and make it so that it would be easy to 
apply the Monoid pattern to reduce an RDD or a Scala collection of StatCounter 
into a single StatCounter.

I have re-implemented and tested StatCounter at my work after I found a bug 
when trying to merge multiple stat counter in parallel using Scalaz Monoid. I 
would like to send a pull request of that functional, clean and concise 
re-implementation.

This would be the declaration of the class:

case class StatCounter(n: Long, sum: Double, sos: Double, min: Double, max: 
Double)

That would also change the implementation of variance into a single line:
def variance = (sos - n * mean * mean) / (n - 1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10790) Dynamic Allocation does not request any executors if first stage needs less than or equal to spark.dynamicAllocation.initialExecutors

2015-09-24 Thread Jonathan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906505#comment-14906505
 ] 

Jonathan Kelly commented on SPARK-10790:


Yes, this is on Spark 1.5.0. That's why I chose 1.5.0 for the Affected 
Version(s). In my email, I said, "I'm running into a problem with YARN 
dynamicAllocation on Spark 1.5.0 after using it successfully on an identically 
configured cluster with Spark 1.4.1." I was just stating up front that I know 
that what I was doing worked in 1.4.1 but broke after the upgrade to 1.5.0, 
rather than me never having gotten it to work.

So in your initial correspondence above when you said that there were fixes in 
1.5, you meant 1.5.0? I had assumed you knew that I was using 1.5.0 and you 
just meant that there were fixes in 1.5.1+.

When I look at the diff for this file between v1.4.1 and v1.5.0, I see that 
this is when the "initializing" check was added, so this is most likely the 
cause of this issue. See SPARK-7699.

> Dynamic Allocation does not request any executors if first stage needs less 
> than or equal to spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-10790
> URL: https://issues.apache.org/jira/browse/SPARK-10790
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>Priority: Critical
>
> If you set spark.dynamicAllocation.initialExecutors > 0 (or 
> spark.dynamicAllocation.minExecutors, since 
> spark.dynamicAllocation.initialExecutors defaults to 
> spark.dynamicAllocation.minExecutors), and the number of tasks in the first 
> stage of your job is less than or equal to this min/init number of executors, 
> dynamic allocation won't actually request any executors and will just hang 
> indefinitely with the warning "Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources".
> The cause appears to be that ExecutorAllocationManager does not request any 
> executors while the application is still initializing, but it still sets the 
> initial value of numExecutorsTarget to 
> spark.dynamicAllocation.initialExecutors. Once the job is running and has 
> submitted its first task, if the first task does not need more than 
> spark.dynamicAllocation.initialExecutors, 
> ExecutorAllocationManager.updateAndSyncNumExecutorsTarget() does not think 
> that it needs to request any executors, so it doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10801) StatCounter uses mutability and is not thread-safe

2015-09-24 Thread Gianmario Spacagna (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gianmario Spacagna updated SPARK-10801:
---
Description: 
The current implementation of  org.apache.spark.util.StatCounter is mutable and 
not thread-safe.
The API for creating it is also limiting since that it only expose the 
constructor using a TraversableOnce[Double].
More over the current implementation does not offer any equality.

My proposal is to use case classes to store the minimum amount of fields 
necessary to compute the statistics and make it so that it would be easy to 
apply the Monoid pattern to reduce an RDD or a Scala collection of StatCounter 
into a single StatCounter.

I have re-implemented and tested StatCounter at my work after I found a bug 
when trying to merge multiple stat counter in parallel using Scalaz Monoid. I 
would like to send a pull request of that functional, clean and concise 
re-implementation.

This would be the declaration of the class:

case class StatCounter(n: Long, sum: Double, sos: Double, min: Double, max: 
Double)

That would also change the implementation of variance into a single line:
def variance = (sos - n * mean * mean) / (n - 1)

  was:
The current implementation of StatCounter is mutable and not thread-safe.
The API for creating it is also limiting since that it only expose the 
constructor using a TraversableOnce[Double].
More over the current implementation does not offer any equality.

My proposal is to use case classes to store the minimum amount of fields 
necessary to compute the statistics and make it so that it would be easy to 
apply the Monoid pattern to reduce an RDD or a Scala collection of StatCounter 
into a single StatCounter.

I have re-implemented and tested StatCounter at my work after I found a bug 
when trying to merge multiple stat counter in parallel using Scalaz Monoid. I 
would like to send a pull request of that functional, clean and concise 
re-implementation.

This would be the declaration of the class:

case class StatCounter(n: Long, sum: Double, sos: Double, min: Double, max: 
Double)

That would also change the implementation of variance into a single line:
def variance = (sos - n * mean * mean) / (n - 1)


> StatCounter uses mutability and is not thread-safe
> --
>
> Key: SPARK-10801
> URL: https://issues.apache.org/jira/browse/SPARK-10801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Gianmario Spacagna
>
> The current implementation of  org.apache.spark.util.StatCounter is mutable 
> and not thread-safe.
> The API for creating it is also limiting since that it only expose the 
> constructor using a TraversableOnce[Double].
> More over the current implementation does not offer any equality.
> My proposal is to use case classes to store the minimum amount of fields 
> necessary to compute the statistics and make it so that it would be easy to 
> apply the Monoid pattern to reduce an RDD or a Scala collection of 
> StatCounter into a single StatCounter.
> I have re-implemented and tested StatCounter at my work after I found a bug 
> when trying to merge multiple stat counter in parallel using Scalaz Monoid. I 
> would like to send a pull request of that functional, clean and concise 
> re-implementation.
> This would be the declaration of the class:
> case class StatCounter(n: Long, sum: Double, sos: Double, min: Double, max: 
> Double)
> That would also change the implementation of variance into a single line:
> def variance = (sos - n * mean * mean) / (n - 1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10801) StatCounter uses mutability and is not thread-safe

2015-09-24 Thread Gianmario Spacagna (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gianmario Spacagna updated SPARK-10801:
---
Affects Version/s: 1.0.0

> StatCounter uses mutability and is not thread-safe
> --
>
> Key: SPARK-10801
> URL: https://issues.apache.org/jira/browse/SPARK-10801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Gianmario Spacagna
>
> The current implementation of  org.apache.spark.util.StatCounter is mutable 
> and not thread-safe.
> The API for creating it is also limiting since that it only expose the 
> constructor using a TraversableOnce[Double].
> More over the current implementation does not offer any equality.
> My proposal is to use case classes to store the minimum amount of fields 
> necessary to compute the statistics and make it so that it would be easy to 
> apply the Monoid pattern to reduce an RDD or a Scala collection of 
> StatCounter into a single StatCounter.
> I have re-implemented and tested StatCounter at my work after I found a bug 
> when trying to merge multiple stat counter in parallel using Scalaz Monoid. I 
> would like to send a pull request of that functional, clean and concise 
> re-implementation.
> This would be the declaration of the class:
> case class StatCounter(n: Long, sum: Double, sos: Double, min: Double, max: 
> Double)
> That would also change the implementation of variance into a single line:
> def variance = (sos - n * mean * mean) / (n - 1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10778) Implement toString for AssociationRules.Rule

2015-09-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10778:
--
Assignee: shimizu yoshihiro

> Implement toString for AssociationRules.Rule
> 
>
> Key: SPARK-10778
> URL: https://issues.apache.org/jira/browse/SPARK-10778
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: shimizu yoshihiro
>Priority: Trivial
>  Labels: starter
>
> pretty print for association rules, e.g.
> {code}
> {a, b, c} => {d}: 0.8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10801) StatCounter uses mutability and is not thread-safe

2015-09-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906512#comment-14906512
 ] 

Sean Owen commented on SPARK-10801:
---

Are you suggesting it be immutable? I think that would be much slower and is 
probably a non-starter. Where is thread-safety required? The pattern you're 
using is a good default, but I think it is intentionally not used here.

> StatCounter uses mutability and is not thread-safe
> --
>
> Key: SPARK-10801
> URL: https://issues.apache.org/jira/browse/SPARK-10801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Gianmario Spacagna
>
> The current implementation of  org.apache.spark.util.StatCounter is mutable 
> and not thread-safe.
> The API for creating it is also limiting since that it only expose the 
> constructor using a TraversableOnce[Double].
> More over the current implementation does not offer any equality.
> My proposal is to use case classes to store the minimum amount of fields 
> necessary to compute the statistics and make it so that it would be easy to 
> apply the Monoid pattern to reduce an RDD or a Scala collection of 
> StatCounter into a single StatCounter.
> I have re-implemented and tested StatCounter at my work after I found a bug 
> when trying to merge multiple stat counter in parallel using Scalaz Monoid. I 
> would like to send a pull request of that functional, clean and concise 
> re-implementation.
> This would be the declaration of the class:
> case class StatCounter(n: Long, sum: Double, sos: Double, min: Double, max: 
> Double)
> That would also change the implementation of variance into a single line:
> def variance = (sos - n * mean * mean) / (n - 1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10670) Link to each language's API in codetabs in ML docs: spark.ml

2015-09-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10670:
--
Assignee: yuhao yang

> Link to each language's API in codetabs in ML docs: spark.ml
> 
>
> Key: SPARK-10670
> URL: https://issues.apache.org/jira/browse/SPARK-10670
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> In the Markdown docs for the spark.ml Programming Guide, we have code 
> examples with codetabs for each language. We should link to each language's 
> API docs within the corresponding codetab, but we are inconsistent about 
> this. For an example of what we want to do, see the "Word2Vec" section in 
> https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/ml-features.md
> This JIRA is just for spark.ml, not spark.mllib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10790) Dynamic Allocation does not request any executors if first stage needs less than or equal to spark.dynamicAllocation.initialExecutors

2015-09-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10790:
--
Priority: Major  (was: Critical)

Yeah if so then ignore most of this since I thought you were on 1.4.1. I 
realize you say 1.5.0 here.

[~jerryshao] what do you think of this? It looks like there's a missed step in 
here; when the stage is submitted and it's then allowed to request executors, 
it never does actually request the initial set of executors? whereas before it 
happened to have already done that at the outset?

Of course, you probably want to set your initial number of executors lower 
anyway, and that's a workaround, but it shouldn't work this way.

> Dynamic Allocation does not request any executors if first stage needs less 
> than or equal to spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-10790
> URL: https://issues.apache.org/jira/browse/SPARK-10790
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>
> If you set spark.dynamicAllocation.initialExecutors > 0 (or 
> spark.dynamicAllocation.minExecutors, since 
> spark.dynamicAllocation.initialExecutors defaults to 
> spark.dynamicAllocation.minExecutors), and the number of tasks in the first 
> stage of your job is less than or equal to this min/init number of executors, 
> dynamic allocation won't actually request any executors and will just hang 
> indefinitely with the warning "Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources".
> The cause appears to be that ExecutorAllocationManager does not request any 
> executors while the application is still initializing, but it still sets the 
> initial value of numExecutorsTarget to 
> spark.dynamicAllocation.initialExecutors. Once the job is running and has 
> submitted its first task, if the first task does not need more than 
> spark.dynamicAllocation.initialExecutors, 
> ExecutorAllocationManager.updateAndSyncNumExecutorsTarget() does not think 
> that it needs to request any executors, so it doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10789) Cluster mode SparkSubmit classpath only includes Spark assembly

2015-09-24 Thread Jonathan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Kelly updated SPARK-10789:
---
Summary: Cluster mode SparkSubmit classpath only includes Spark assembly  
(was: Cluster mode SparkSubmit classpath only includes Spark classpath)

> Cluster mode SparkSubmit classpath only includes Spark assembly
> ---
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: 

[jira] [Created] (SPARK-10802) Let ALS recommend for subset of data

2015-09-24 Thread Tomasz Bartczak (JIRA)
Tomasz Bartczak created SPARK-10802:
---

 Summary: Let ALS recommend for subset of data
 Key: SPARK-10802
 URL: https://issues.apache.org/jira/browse/SPARK-10802
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Tomasz Bartczak


Currently MatrixFactorizationModel allows to get recommendations for
- single user 
- single product 
- all users
- all products

recommendation for all users/products do a cartesian join inside.

It would be useful in some cases to get recommendations for subset of 
users/products by providing an RDD with which MatrixFactorizationModel could do 
an intersection before doing a cartesian join. This would make it much faster 
in situation where recommendations are needed only for subset of 
users/products, and when the subset is still too large to make it feasible to 
recommend one-by-one.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9103) Tracking spark's memory usage

2015-09-24 Thread Zhang, Liye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang, Liye updated SPARK-9103:
---
Attachment: Tracking Spark Memory Usage - Phase 1.pdf

> Tracking spark's memory usage
> -
>
> Key: SPARK-9103
> URL: https://issues.apache.org/jira/browse/SPARK-9103
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
> Attachments: Tracking Spark Memory Usage - Phase 1.pdf
>
>
> Currently spark only provides little memory usage information (RDD cache on 
> webUI) for the executors. User have no idea on what is the memory consumption 
> when they are running spark applications with a lot of memory used in spark 
> executors. Especially when they encounter the OOM, it’s really hard to know 
> what is the cause of the problem. So it would be helpful to give out the 
> detail memory consumption information for each part of spark, so that user 
> can clearly have a picture of where the memory is exactly used. 
> The memory usage info to expose should include but not limited to shuffle, 
> cache, network, serializer, etc.
> User can optionally choose to open this functionality since this is mainly 
> for debugging and tuning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-09-24 Thread Konstantinos Kougios (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906572#comment-14906572
 ] 

Konstantinos Kougios commented on SPARK-5928:
-

Is there a work around for this?

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.

[jira] [Commented] (SPARK-10688) Python API for AFTSurvivalRegression

2015-09-24 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906583#comment-14906583
 ] 

Gayathri Murali commented on SPARK-10688:
-

I started working on it as well. Since there isnt a way to ensure exclusivity 
we should go with first come first served. 

> Python API for AFTSurvivalRegression
> 
>
> Key: SPARK-10688
> URL: https://issues.apache.org/jira/browse/SPARK-10688
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>  Labels: starter
>
> After SPARK-10686, we should add Python API for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10803) Allow users to write and query Parquet user-defined key-value metadata directly

2015-09-24 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10803:
--

 Summary: Allow users to write and query Parquet user-defined 
key-value metadata directly
 Key: SPARK-10803
 URL: https://issues.apache.org/jira/browse/SPARK-10803
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0, 1.4.1, 1.3.1, 1.2.2, 1.1.1, 1.0.2
Reporter: Cheng Lian


Currently Spark SQL only allows users to set and get per-column metadata of a 
DataFrame. This metadata can be then persisted to Parquet as part of Catalyst 
schema information contained in the user-defined key-value metadata. It would 
be nice if we can allow users to write and query Parquet user-defined key-value 
metadata directly. Or maybe a more general way to allow DataFrame level (rather 
than column level) metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data

2015-09-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906612#comment-14906612
 ] 

Sean Owen commented on SPARK-10802:
---

You can already pass an RDD of user,item pairs. I think that's exactly what 
you're asking for? you have to do the join, but that's a feature in a way -- 
you define exactly what you want to predict.

> Let ALS recommend for subset of data
> 
>
> Key: SPARK-10802
> URL: https://issues.apache.org/jira/browse/SPARK-10802
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Tomasz Bartczak
>
> Currently MatrixFactorizationModel allows to get recommendations for
> - single user 
> - single product 
> - all users
> - all products
> recommendation for all users/products do a cartesian join inside.
> It would be useful in some cases to get recommendations for subset of 
> users/products by providing an RDD with which MatrixFactorizationModel could 
> do an intersection before doing a cartesian join. This would make it much 
> faster in situation where recommendations are needed only for subset of 
> users/products, and when the subset is still too large to make it feasible to 
> recommend one-by-one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-24 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906618#comment-14906618
 ] 

Yin Huai commented on SPARK-10741:
--

[~ianlcsd] Can you try the following queries to see if you can workaround this 
issue for now?

{code}
// First query
SELECT c1, avg ( c2 ) as c_avg
FROM test10741
GROUP BY c1
HAVING ( c_avg > 5)  ORDER BY c1

// Second query
SELECT c1, avg ( c2 ) c_avg
FROM test10741
GROUP BY c1
ORDER BY c_avg
{code}

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>Assignee: Wenchen Fan
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10765) use new aggregate interface for hive UDAF

2015-09-24 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10765.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> use new aggregate interface for hive UDAF
> -
>
> Key: SPARK-10765
> URL: https://issues.apache.org/jira/browse/SPARK-10765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10765) use new aggregate interface for hive UDAF

2015-09-24 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906637#comment-14906637
 ] 

Yin Huai commented on SPARK-10765:
--

This issue has been resolved by https://github.com/apache/spark/pull/8874.

> use new aggregate interface for hive UDAF
> -
>
> Key: SPARK-10765
> URL: https://issues.apache.org/jira/browse/SPARK-10765
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10487) MLlib model fitting causes DataFrame write to break with OutOfMemory exception

2015-09-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906656#comment-14906656
 ] 

Joseph K. Bradley commented on SPARK-10487:
---

Ohh, that's very helpful.  I suspect it's because Parquet allocates large 
buffers for each column.  It's still surprising to me since there are only 2 
columns.  (I've only seen this problem before with saving decision trees, which 
creates 13+ columns.)  I'm wondering if some of the data from model fitting is 
still cached and does not get kicked out of the cache when needed.  I'll try 
running this and will monitor the Spark UI to see if some temp data are staying 
cached unnecessarily.

Note though that you're using a very small amount of memory.  In general, I try 
to use about 20GB for the driver and 8GB for executors for common jobs.

> MLlib model fitting causes DataFrame write to break with OutOfMemory exception
> --
>
> Key: SPARK-10487
> URL: https://issues.apache.org/jira/browse/SPARK-10487
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tried in a centos-based 1-node YARN in docker and on a 
> real-world CDH5 cluster
> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest 
> nightly build)
> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn 
> -DzincPort=3034
> I'm using the default resource setup
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor 
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>Reporter: Zoltan Toth
>
> After fitting a _spark.ml_ or _mllib model_ in *cluster* deploy mode, no 
> dataframes can be written to hdfs. The driver receives an OutOfMemory 
> exception during the writing. It seems, however, that the file gets written 
> successfully.
>  * This happens both in SparkR and pyspark
>  * Only happens in cluster deploy mode
>  * The write fails regardless the size of the dataframe and whether the 
> dataframe is associated with the ml model.
> REPRO:
> {code}
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
> from pyspark.ml.classification import LogisticRegression
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vector, Vectors
> conf = SparkConf().setAppName("LogRegTest")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
> sqlContext.setConf("park.sql.parquet.compression.codec", "uncompressed")
> training = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
> df = training.toDF()
> reg = LogisticRegression().setMaxIter(10).setRegParam(0.01)
> model = reg.fit(df)
> # Note that this is a brand new dataframe:
> one_df = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF()
> one_df.write.mode("overwrite").parquet("/tmp/df.parquet")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10797) RDD's coalesce should not write out the temporary key

2015-09-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Zvara updated SPARK-10797:
-
Description: 
It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle files) 
temporary keys used on the shuffle code path. Consider the following code:

{code:title=RDD.scala|borderStyle=solid}
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions).values
} else {
{code}

{{ShuffledRDD}} will hash using {{position}} as keys as in the 
{{distributePartition}} function. After the bucket has been chosen by the 
sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
{{DiskBlockObjectWriter}} writes out both the (temporary) key and value to the 
spacified partition. On the next stage, after reading we take only the values 
with {{PairRDDFunctions}}.

This certainly has a performance impact, as we unnecessarily write/read 
{{Int}}s and transform the data.

  was:
It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle files) 
temporary keys used on the shuffle code path. Consider the following code:

{code:title=RDD.scala|borderStyle=solid}
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions).values
} else {
{code}

{{ShuffledRDD}} will hash using {{position}} as keys as in the 
{{distributePartition}} function. After the bucket has been chosen by the 
sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
{{DiskBlockObjectWriter}} write out both the (temporary) key and value to the 
spacified partition. On the next stage, after reading we take only the values 
with {{PairRDDFunctions}}.

This certainly has a performance impact, as we unnecessarily write/read 
{{Int}}s and transform the data.


> RDD's coalesce should not write out the temporary key
> -
>
> Key: SPARK-10797
> URL: https://issues.apache.org/jira/browse/SPARK-10797
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Zoltán Zvara
>
> It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle 
> files) temporary keys used on the shuffle code path. Consider the following 
> code:
> {code:title=RDD.scala|borderStyle=solid}
> if (shuffle) {
>   /** Distributes elements evenly across output partitions, starting from 
> a random partition. */
>   val distributePartition = (index: Int, items: Iterator[T]) => {
> var position = (new Random(index)).nextInt(numPartitions)
> items.map { t =>
>   // Note that the hash code of the key will just be the key itself. 
> The HashPartitioner
>   // will mod it with the number of total partitions.
>   position = position + 1
>   (position, t)
> }
>   } : Iterator[(Int, T)]
>   // include a shuffle step so that our upstream tasks are still 
> distributed
>   new CoalescedRDD(
> new ShuffledRDD[Int, T, 
> T](mapPartitionsWithIndex(distributePartition),
> new HashPartitioner(numPartitions)),
> numPartitions).values
> } else {
> {code}
> {{ShuffledRDD}} will hash using {{position}} as keys as in the 
> {{distributePartition}} function. After the bucket has been chosen by the 
> sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
> {{DiskBlockObjectWriter}} writes out both the (temporary) key and value to 
> the spacified partition. On th

[jira] [Updated] (SPARK-10797) RDD's coalesce should not write out the temporary key

2015-09-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Zvara updated SPARK-10797:
-
Description: 
It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle files) 
temporary keys used on the shuffle code path. Consider the following code:

{code:title=RDD.scala|borderStyle=solid}
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions).values
} else {
{code}

{{ShuffledRDD}} will hash using {{position}} as keys as in the 
{{distributePartition}} function. After the bucket has been chosen by the 
sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
{{DiskBlockObjectWriter}} writes out both the (temporary) key and value to the 
specified partition. On the next stage, after reading we take only the values 
with {{PairRDDFunctions}}.

This certainly has a performance impact, as we unnecessarily write/read 
{{Int}}s and transform the data.

  was:
It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle files) 
temporary keys used on the shuffle code path. Consider the following code:

{code:title=RDD.scala|borderStyle=solid}
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions).values
} else {
{code}

{{ShuffledRDD}} will hash using {{position}} as keys as in the 
{{distributePartition}} function. After the bucket has been chosen by the 
sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
{{DiskBlockObjectWriter}} writes out both the (temporary) key and value to the 
spacified partition. On the next stage, after reading we take only the values 
with {{PairRDDFunctions}}.

This certainly has a performance impact, as we unnecessarily write/read 
{{Int}}s and transform the data.


> RDD's coalesce should not write out the temporary key
> -
>
> Key: SPARK-10797
> URL: https://issues.apache.org/jira/browse/SPARK-10797
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Zoltán Zvara
>
> It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle 
> files) temporary keys used on the shuffle code path. Consider the following 
> code:
> {code:title=RDD.scala|borderStyle=solid}
> if (shuffle) {
>   /** Distributes elements evenly across output partitions, starting from 
> a random partition. */
>   val distributePartition = (index: Int, items: Iterator[T]) => {
> var position = (new Random(index)).nextInt(numPartitions)
> items.map { t =>
>   // Note that the hash code of the key will just be the key itself. 
> The HashPartitioner
>   // will mod it with the number of total partitions.
>   position = position + 1
>   (position, t)
> }
>   } : Iterator[(Int, T)]
>   // include a shuffle step so that our upstream tasks are still 
> distributed
>   new CoalescedRDD(
> new ShuffledRDD[Int, T, 
> T](mapPartitionsWithIndex(distributePartition),
> new HashPartitioner(numPartitions)),
> numPartitions).values
> } else {
> {code}
> {{ShuffledRDD}} will hash using {{position}} as keys as in the 
> {{distributePartition}} function. After the bucket has been chosen by the 
> sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
> {{DiskBlockObjectWriter}} writes out both the (temporary) key and value to 
> the specified partition. On t

[jira] [Updated] (SPARK-10797) RDD's coalesce should not write out the temporary key

2015-09-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Zvara updated SPARK-10797:
-
Description: 
It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle files) 
temporary keys used on the shuffle code path. Consider the following code:

{code:title=RDD.scala|borderStyle=solid}
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions).values
} else {
{code}

{{ShuffledRDD}} will hash using {{position}} as keys as in the 
{{distributePartition}} function. After the bucket has been chosen by the 
sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
{{DiskBlockObjectWriter}} write out both the (temporary) key and value to the 
spacified partition. On the next stage, after reading we take only the values 
with {{PairRDDFunctions}}.

This certainly has a performance impact, as we unnecessarily write/read 
{{Int}}s and transform the data.

  was:
It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle files) 
temporary keys used on the shuffle code path. Consider the following code:

{code:title=RDD.scala|borderStyle=solid}
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions).values
} else {
{code}

{{ShuffledRDD}} will hash using {{position}}s as keys as in the 
{{distributePartition}} function. After the bucket has been chosen by the 
sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
{{DiskBlockObjectWriter}} write out both the (temporary) key and value to the 
spacified partition. On the next stage, after reading we take only the values 
with {{PairRDDFunctions}}.

This certainly has a performance impact, as we unnecessarily write/read 
{{Int}}s and transform the data.


> RDD's coalesce should not write out the temporary key
> -
>
> Key: SPARK-10797
> URL: https://issues.apache.org/jira/browse/SPARK-10797
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Zoltán Zvara
>
> It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle 
> files) temporary keys used on the shuffle code path. Consider the following 
> code:
> {code:title=RDD.scala|borderStyle=solid}
> if (shuffle) {
>   /** Distributes elements evenly across output partitions, starting from 
> a random partition. */
>   val distributePartition = (index: Int, items: Iterator[T]) => {
> var position = (new Random(index)).nextInt(numPartitions)
> items.map { t =>
>   // Note that the hash code of the key will just be the key itself. 
> The HashPartitioner
>   // will mod it with the number of total partitions.
>   position = position + 1
>   (position, t)
> }
>   } : Iterator[(Int, T)]
>   // include a shuffle step so that our upstream tasks are still 
> distributed
>   new CoalescedRDD(
> new ShuffledRDD[Int, T, 
> T](mapPartitionsWithIndex(distributePartition),
> new HashPartitioner(numPartitions)),
> numPartitions).values
> } else {
> {code}
> {{ShuffledRDD}} will hash using {{position}} as keys as in the 
> {{distributePartition}} function. After the bucket has been chosen by the 
> sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the 
> {{DiskBlockObjectWriter}} write out both the (temporary) key and value to the 
> spacified partition. On the

[jira] [Commented] (SPARK-10790) Dynamic Allocation does not request any executors if first stage needs less than or equal to spark.dynamicAllocation.initialExecutors

2015-09-24 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906696#comment-14906696
 ] 

Saisai Shao commented on SPARK-10790:
-

Thanks [~srowen], let me check it.

> Dynamic Allocation does not request any executors if first stage needs less 
> than or equal to spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-10790
> URL: https://issues.apache.org/jira/browse/SPARK-10790
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>
> If you set spark.dynamicAllocation.initialExecutors > 0 (or 
> spark.dynamicAllocation.minExecutors, since 
> spark.dynamicAllocation.initialExecutors defaults to 
> spark.dynamicAllocation.minExecutors), and the number of tasks in the first 
> stage of your job is less than or equal to this min/init number of executors, 
> dynamic allocation won't actually request any executors and will just hang 
> indefinitely with the warning "Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources".
> The cause appears to be that ExecutorAllocationManager does not request any 
> executors while the application is still initializing, but it still sets the 
> initial value of numExecutorsTarget to 
> spark.dynamicAllocation.initialExecutors. Once the job is running and has 
> submitted its first task, if the first task does not need more than 
> spark.dynamicAllocation.initialExecutors, 
> ExecutorAllocationManager.updateAndSyncNumExecutorsTarget() does not think 
> that it needs to request any executors, so it doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10804) "LOCAL" in LOAD DATA LOCAL INPATH means "remote"

2015-09-24 Thread Antonio Piccolboni (JIRA)
Antonio Piccolboni created SPARK-10804:
--

 Summary: "LOCAL" in LOAD DATA LOCAL INPATH means "remote"
 Key: SPARK-10804
 URL: https://issues.apache.org/jira/browse/SPARK-10804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Antonio Piccolboni


Connecting with a remote thriftserver with a custom JDBC client or beeline, 
load data local inpath fails. Hiveserver2 docs explain in a quick comment that 
local now means local to the server. I think this is just a rationalization for 
a bug. When a user types "local" 

# it needs to be local to him, not some server 
# Failing 1., one needs to have a way to determine what local means and create 
a "local" item under the new definition. 

With the thirftserver, I have a host to connect to, but I don't have any way to 
create a file local to that host, at least in spark. It may not be desirable to 
create user directories on the thriftserver host or running file transfer 
services like scp. Moreover, it appears that this syntax is unique to Hive and 
Spark but its origin can be traced to  LOAD DATA LOCAL INFILE in Oracle and was 
adopted by mysql. In the latter docs we can read "If LOCAL is specified, the 
file is read by the client program on the client host and sent to the server. 
The file can be given as a full path name to specify its exact location. If 
given as a relative path name, the name is interpreted relative to the 
directory in which the client program was started". This is not to say that the 
spark or hive teams are bound to what Oracle and Mysql do, but to support the 
idea that the meaning of LOCAL is settled. For instance, the Impala 
documentation says: "Currently, the Impala LOAD DATA statement only imports 
files from HDFS, not from the local filesystem. It does not support the LOCAL 
keyword of the Hive LOAD DATA statement." I think this is a better solution. 
The way things are in thriftserver, I developed a client under the assumption 
that I could use LOAD DATA LOCAL INPATH and all tests where passing in 
standalone mode, only to find with the first distributed test that 

# LOCAL means "local to server", a.k.a. "remote"
# INSERT INTO ... VALUES is not supported
# There is really no workaround unless one assumes access what data store spark 
is running against , like HDFS, and that the user can upload data to it. 


In the space of workarounds it is not terrible, but if you are trying to write 
a self-contained spark package, that's a defeat and makes writing tests 
particularly hard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10804) "LOCAL" in LOAD DATA LOCAL INPATH means "remote"

2015-09-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906723#comment-14906723
 ] 

Marcelo Vanzin commented on SPARK-10804:


This is really a Hive issue, which Spark just inherits since it calls the Hive 
code directly to handle that statement.

> "LOCAL" in LOAD DATA LOCAL INPATH means "remote"
> 
>
> Key: SPARK-10804
> URL: https://issues.apache.org/jira/browse/SPARK-10804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Antonio Piccolboni
>
> Connecting with a remote thriftserver with a custom JDBC client or beeline, 
> load data local inpath fails. Hiveserver2 docs explain in a quick comment 
> that local now means local to the server. I think this is just a 
> rationalization for a bug. When a user types "local" 
> # it needs to be local to him, not some server 
> # Failing 1., one needs to have a way to determine what local means and 
> create a "local" item under the new definition. 
> With the thirftserver, I have a host to connect to, but I don't have any way 
> to create a file local to that host, at least in spark. It may not be 
> desirable to create user directories on the thriftserver host or running file 
> transfer services like scp. Moreover, it appears that this syntax is unique 
> to Hive and Spark but its origin can be traced to  LOAD DATA LOCAL INFILE in 
> Oracle and was adopted by mysql. In the latter docs we can read "If LOCAL is 
> specified, the file is read by the client program on the client host and sent 
> to the server. The file can be given as a full path name to specify its exact 
> location. If given as a relative path name, the name is interpreted relative 
> to the directory in which the client program was started". This is not to say 
> that the spark or hive teams are bound to what Oracle and Mysql do, but to 
> support the idea that the meaning of LOCAL is settled. For instance, the 
> Impala documentation says: "Currently, the Impala LOAD DATA statement only 
> imports files from HDFS, not from the local filesystem. It does not support 
> the LOCAL keyword of the Hive LOAD DATA statement." I think this is a better 
> solution. The way things are in thriftserver, I developed a client under the 
> assumption that I could use LOAD DATA LOCAL INPATH and all tests where 
> passing in standalone mode, only to find with the first distributed test that 
> # LOCAL means "local to server", a.k.a. "remote"
> # INSERT INTO ... VALUES is not supported
> # There is really no workaround unless one assumes access what data store 
> spark is running against , like HDFS, and that the user can upload data to 
> it. 
> In the space of workarounds it is not terrible, but if you are trying to 
> write a self-contained spark package, that's a defeat and makes writing tests 
> particularly hard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10805) JSON Data Frame does not return correct string lengths

2015-09-24 Thread Jeff Li (JIRA)
Jeff Li created SPARK-10805:
---

 Summary: JSON Data Frame does not return correct string lengths
 Key: SPARK-10805
 URL: https://issues.apache.org/jira/browse/SPARK-10805
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1
Reporter: Jeff Li
Priority: Critical


Here is the sample code to run the test 

@Test
  public void runSchemaTest() throws Exception {
DataFrame jsonDataFrame = 
sqlContext.jsonFile("src/test/resources/jsontransform/json.sampledata.json");
jsonDataFrame.printSchema();

StructType jsonSchema = jsonDataFrame.schema();
StructField[] dataFields = jsonSchema.fields();
for ( int fieldIndex = 0; fieldIndex < dataFields.length;  
fieldIndex++) {
StructField aField = dataFields[fieldIndex];
DataType aType = aField.dataType();
System.out.println("name: " + aField.name() + " type: " 
+ aType.typeName()
+ " size: " +aType.defaultSize());
}
 }

name: _id type: string size: 4096
name: firstName type: string size: 4096
name: lastName type: string size: 4096

In my case, the _id: 1 character, first name: 4 characters, and last name: 7 
characters). 

The Spark JSON Data frame should have a way to tell the maximum length of each 
JSON String elements in the JSON document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10806) Following val redefinition, sometimes the old value is still visible

2015-09-24 Thread Boris Alexeev (JIRA)
Boris Alexeev created SPARK-10806:
-

 Summary: Following val redefinition, sometimes the old value is 
still visible
 Key: SPARK-10806
 URL: https://issues.apache.org/jira/browse/SPARK-10806
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.5.0
 Environment: on EC2, uname -a gives:
Linux ip-172-31-19-173 3.13.0-61-generic #100-Ubuntu SMP Wed Jul 29 11:21:34 
UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
spark-shell itself prints:
Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)

Reporter: Boris Alexeev


I am seeing odd behavior when I redefine a val in the REPL of the spark-shell 
of 1.5.0.  Here is my minimal test case:
   val a = 1
   def id(a:Int) = {a}
   val a = 2
   a
   id(a)

Specifically, if I run "~/spark/bin/spark-shell --master local" and
enter each of these five lines one-by-one (not in :paste mode, because
of the redefinition), I get the output at the end of my message below.

Expected behavior: both of the last two expressions evaluate to 2.
Observed behavior: "a" returns 2, but "id(a)" still returns 1.
Reproducible: always (for me) on Spark 1.5.0 but not 1.4.1.

I believe that the example is sensitive to the variable name use!  I
can also reproduce the problem with more complicated "dependencies" in
the variable name use, e.g. if I define id using b, but val b was
defined using a:
   val a = 1
   val b = a // this line is necessary for the problem!
   def id(b:Int) = {b}
   val a = 2
   a
   id(a)

I cannot reproduce this behavior in the Scala REPL directly for the
few versions and configurations that I've tried, but I may have not
been able to find the appropriate version (I have tried the obvious candidate). 
 That is, my Scala interactions have all had the expected behavior: they 
returned 2 for both of the last two expressions.  Similarly, I cannot reproduce 
this in Spark 1.4.1.

I believe this is a bug, but is this the desired behavior for some
reason?  Why does it happen in either case?

===

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.0
  /_/

Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val a = 1
a: Int = 1

scala> def id(a:Int) = {a}
id: (a: Int)Int

scala> val a = 2
a: Int = 2

scala> a
res0: Int = 2

scala> id(a)
res1: Int = 1




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10807) Add as.data.frame() as a synonym for collect()

2015-09-24 Thread Oscar D. Lara Yejas (JIRA)
Oscar D. Lara Yejas created SPARK-10807:
---

 Summary: Add as.data.frame() as a synonym for collect()
 Key: SPARK-10807
 URL: https://issues.apache.org/jira/browse/SPARK-10807
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Affects Versions: 1.5.0
Reporter: Oscar D. Lara Yejas
Priority: Minor
 Fix For: 1.5.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10807) Add as.data.frame() as a synonym for collect()

2015-09-24 Thread Oscar D. Lara Yejas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906760#comment-14906760
 ] 

Oscar D. Lara Yejas commented on SPARK-10807:
-

I'm working on this one.

Thanks,
Oscar

> Add as.data.frame() as a synonym for collect()
> --
>
> Key: SPARK-10807
> URL: https://issues.apache.org/jira/browse/SPARK-10807
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Oscar D. Lara Yejas
>Priority: Minor
> Fix For: 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10808) LDA user guide: discuss running time of LDA

2015-09-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10808:
-

 Summary: LDA user guide: discuss running time of LDA
 Key: SPARK-10808
 URL: https://issues.apache.org/jira/browse/SPARK-10808
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Joseph K. Bradley
Priority: Minor


Based on feedback like [SPARK-10791], we should discuss the computational and 
communication complexity of LDA and its optimizers in the MLlib Programming 
Guide.  E.g.:
* Online LDA can be faster than EM.
* To make online LDA run faster, you can use a smaller miniBatchFraction.
* Communication
** For EM, communication on each iteration is on the order of # topics * 
(vocabSize + # docs).
** For online LDA, communication on each iteration is on the order of # topics 
* vocabSize.
* Decreasing vocabSize and # topics can speed things up.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10808) LDA user guide: discuss running time of LDA

2015-09-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10808:
--
Description: 
Based on feedback like [SPARK-10791], we should discuss the computational and 
communication complexity of LDA and its optimizers in the MLlib Programming 
Guide.  E.g.:
* Online LDA can be faster than EM.
* To make online LDA run faster, you can use a smaller miniBatchFraction.
* Communication
** For EM, communication on each iteration is on the order of # topics * 
(vocabSize + # docs).
** For online LDA, communication on each iteration is on the order of # topics 
* vocabSize.
* Decreasing vocabSize and # topics can speed things up.  It's often fine to 
eliminate uncommon words, unless you are trying to create a very large number 
of topics.


  was:
Based on feedback like [SPARK-10791], we should discuss the computational and 
communication complexity of LDA and its optimizers in the MLlib Programming 
Guide.  E.g.:
* Online LDA can be faster than EM.
* To make online LDA run faster, you can use a smaller miniBatchFraction.
* Communication
** For EM, communication on each iteration is on the order of # topics * 
(vocabSize + # docs).
** For online LDA, communication on each iteration is on the order of # topics 
* vocabSize.
* Decreasing vocabSize and # topics can speed things up.



> LDA user guide: discuss running time of LDA
> ---
>
> Key: SPARK-10808
> URL: https://issues.apache.org/jira/browse/SPARK-10808
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Based on feedback like [SPARK-10791], we should discuss the computational and 
> communication complexity of LDA and its optimizers in the MLlib Programming 
> Guide.  E.g.:
> * Online LDA can be faster than EM.
> * To make online LDA run faster, you can use a smaller miniBatchFraction.
> * Communication
> ** For EM, communication on each iteration is on the order of # topics * 
> (vocabSize + # docs).
> ** For online LDA, communication on each iteration is on the order of # 
> topics * vocabSize.
> * Decreasing vocabSize and # topics can speed things up.  It's often fine to 
> eliminate uncommon words, unless you are trying to create a very large number 
> of topics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-24 Thread Ian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906768#comment-14906768
 ] 

Ian commented on SPARK-10741:
-

The org.apache.spark.sql.AnalysisException is fixed, but the write path seemed 
broken.
The "INSERT INTO/OVERWRITE" statement seems not populating data now.

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>Assignee: Wenchen Fan
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10809) Single-document topicDistributions method for LocalLDAModel

2015-09-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10809:
-

 Summary: Single-document topicDistributions method for 
LocalLDAModel
 Key: SPARK-10809
 URL: https://issues.apache.org/jira/browse/SPARK-10809
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


We could provide a single-document topicDistributions method for LocalLDAModel 
to allow for quick queries which avoid RDD operations.  Currently, the user 
must use an RDD of documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance

2015-09-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906776#comment-14906776
 ] 

Joseph K. Bradley commented on SPARK-10791:
---

This sounds like a question for the user list, not JIRA, but here are some 
thoughts:

Was this run on a single machine or in parallel?  MLlib is of course optimized 
to scale with parallelism, rather than be run on a single machine.

I suspect you could speed up training some.  Check out [SPARK-10808] for some 
thoughts.

The topicDistributions method could be improved for your use case, if your 
"input" is a small set of documents.  I just made [SPARK-10809] to track that.  
If you are using a big batch of documents, then parallelization should help.

I'll close this for now since I think the JIRAs I just made should cover the 
issues.

> Optimize MLlib LDA topic distribution query performance
> ---
>
> Key: SPARK-10791
> URL: https://issues.apache.org/jira/browse/SPARK-10791
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
> Environment: Ubuntu 13.10, Oracle Java 8
>Reporter: Marko Asplund
>
> I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size 
> and ~3.4 M documents using EMLDAOptimizer.
> Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit 
> training with the same data and on the same system set took ~5 minutes. 
> Loading the persisted model from disk (~2 minutes), as well as querying LDA 
> model topic distributions (~4 seconds for one document) are also quite slow 
> operations.
> Our application is querying LDA model topic distribution (for one doc at a 
> time) as part of end-user operation execution flow, so a ~4 second execution 
> time is very problematic.
> The log includes the following message, which AFAIK, should mean that 
> netlib-java is using machine optimised native implementation: 
> "com.github.fommil.jni.JniLoader - successfully loaded 
> /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so"
> My test code can be found here:
> https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57
> I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable 
> change in training performance. Model loading time was reduced to ~ 5 seconds 
> from ~ 2 minutes (now persisted as LocalLDAModel). However, query / 
> prediction time was unchanged.
> Unfortunately, this is the critical performance characteristic in our case.
> I did some profiling for my LDA prototype code that requests topic 
> distributions from a model. According to Java Mission Control more than 80 % 
> of execution time during sample interval is spent in the following methods:
> - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%
> - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91%
> - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50;
> 6.98%
> - java.lang.Double.valueOf(double); count: 31; 4.33%
> Is there any way of using the API more optimally?
> Are there any opportunities for optimising the "topicDistributions" code
> path in MLlib?
> My query test code looks like this essentially:
> // executed once
> val model = LocalLDAModel.load(ctx, ModelFileName)
> // executed four times
> val samples = Transformers.toSparseVectors(vocabularySize,
> ctx.parallelize(Seq(input))) // fast
> model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this
> seems to take about 4 seconds to execute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10791) Optimize MLlib LDA topic distribution query performance

2015-09-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-10791.
-
Resolution: Done

> Optimize MLlib LDA topic distribution query performance
> ---
>
> Key: SPARK-10791
> URL: https://issues.apache.org/jira/browse/SPARK-10791
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
> Environment: Ubuntu 13.10, Oracle Java 8
>Reporter: Marko Asplund
>
> I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size 
> and ~3.4 M documents using EMLDAOptimizer.
> Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit 
> training with the same data and on the same system set took ~5 minutes. 
> Loading the persisted model from disk (~2 minutes), as well as querying LDA 
> model topic distributions (~4 seconds for one document) are also quite slow 
> operations.
> Our application is querying LDA model topic distribution (for one doc at a 
> time) as part of end-user operation execution flow, so a ~4 second execution 
> time is very problematic.
> The log includes the following message, which AFAIK, should mean that 
> netlib-java is using machine optimised native implementation: 
> "com.github.fommil.jni.JniLoader - successfully loaded 
> /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so"
> My test code can be found here:
> https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57
> I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable 
> change in training performance. Model loading time was reduced to ~ 5 seconds 
> from ~ 2 minutes (now persisted as LocalLDAModel). However, query / 
> prediction time was unchanged.
> Unfortunately, this is the critical performance characteristic in our case.
> I did some profiling for my LDA prototype code that requests topic 
> distributions from a model. According to Java Mission Control more than 80 % 
> of execution time during sample interval is spent in the following methods:
> - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%
> - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91%
> - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50;
> 6.98%
> - java.lang.Double.valueOf(double); count: 31; 4.33%
> Is there any way of using the API more optimally?
> Are there any opportunities for optimising the "topicDistributions" code
> path in MLlib?
> My query test code looks like this essentially:
> // executed once
> val model = LocalLDAModel.load(ctx, ModelFileName)
> // executed four times
> val samples = Transformers.toSparseVectors(vocabularySize,
> ctx.parallelize(Seq(input))) // fast
> model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this
> seems to take about 4 seconds to execute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10790) Dynamic Allocation does not request any executors if first stage needs less than or equal to spark.dynamicAllocation.initialExecutors

2015-09-24 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906787#comment-14906787
 ] 

Saisai Shao commented on SPARK-10790:
-

Hi [~jonathak], let me trying to understand your scenario:

1. In your Spark cluster you have dynamic allocation enabled with minimum and 
initial number of executors set, for example:

spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.initialExecutors 3

2. You run a Spark job with resource requirements that is enough using current 
executors (don't need to request new executors), for example:

sc.parallelize(1 to 100, 1).collect()

Here this job will only have ONE task, so the current 2 executors (with 2 cores 
for each executors) can satisfy the resource requirement and no need to request 
new executors.

Is that the scenario you described?

I assume it is right. I tested locally in my environment, seems no such "hang" 
or "Initial job has not accepted any resources; check your cluster UI to ensure 
that workers are registered and have sufficient resources" as you mentioned.

1. Take this as example, Spark already has 2 executors with 4 cores, so 
submitting jobs will not occur "Initial job has not accepted any resources; 
check your cluster UI to ensure that workers are registered and have sufficient 
resources" problem, since resource is enough.
2. "ExecutorAllocationManager.updateAndSyncNumExecutorsTarget() does not think 
that it needs to request any executors" is expected, since your current 
resource is enough.
3. "ExecutorAllocationManager does not request any executors while the 
application is still initializing". The initializing state will be finished 
once you submitted a job. So here when you submit a job, actually 
ExecutorAllocationManager's internal state is not yet initializing, so it can 
bring up and ramp down executors according to load.

I'm not sure is this exactly your scenario, basically I cannot reproduce your 
problem, can you describe more specifically?






> Dynamic Allocation does not request any executors if first stage needs less 
> than or equal to spark.dynamicAllocation.initialExecutors
> -
>
> Key: SPARK-10790
> URL: https://issues.apache.org/jira/browse/SPARK-10790
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
>Reporter: Jonathan Kelly
>
> If you set spark.dynamicAllocation.initialExecutors > 0 (or 
> spark.dynamicAllocation.minExecutors, since 
> spark.dynamicAllocation.initialExecutors defaults to 
> spark.dynamicAllocation.minExecutors), and the number of tasks in the first 
> stage of your job is less than or equal to this min/init number of executors, 
> dynamic allocation won't actually request any executors and will just hang 
> indefinitely with the warning "Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources".
> The cause appears to be that ExecutorAllocationManager does not request any 
> executors while the application is still initializing, but it still sets the 
> initial value of numExecutorsTarget to 
> spark.dynamicAllocation.initialExecutors. Once the job is running and has 
> submitted its first task, if the first task does not need more than 
> spark.dynamicAllocation.initialExecutors, 
> ExecutorAllocationManager.updateAndSyncNumExecutorsTarget() does not think 
> that it needs to request any executors, so it doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-24 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906786#comment-14906786
 ] 

Yin Huai commented on SPARK-10741:
--

Any error?

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>Assignee: Wenchen Fan
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-24 Thread Ian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906801#comment-14906801
 ] 

Ian commented on SPARK-10741:
-

Two of my tests failed. 
The query returns nothing. 

{code}
  test("test insert overwrite parquet ") {
val ddl= List(
  "DROP TABLE IF EXISTS tmp",
  "DROP TABLE IF EXISTS test",
  "CREATE TABLE IF NOT EXISTS tmp ( c1 string, c2 int )",
  """INSERT INTO TABLE tmp select "test1" as c1, (count(*)+1) *10 as c2 
from tmp""",
  """INSERT INTO TABLE tmp select "test1" as c1, (count(*)+1) *10 as c2 
from tmp""",
  """INSERT INTO TABLE tmp select "test2" as c1, (count(*)+1) *10 as c2 
from tmp""",
  """INSERT INTO TABLE tmp select "test2" as c1, (count(*)+1) *10 as c2 
from tmp""",
  "CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET",
  "INSERT OVERWRITE TABLE test SELECT * FROM tmp"
)

ddl.foreach{ x =>
  TestHive.sql(x).collect()
}

val tmp = TestHive.sql("select c1, c2 from tmp").collect()
val test = TestHive.sql("select c1, c2 from test").collect()
assert(tmp === test)
  }

Array([test1,10], [test1,20], [test2,30], [test2,40]) did not equal Array()
ScalaTestFailureLocation: 
org.apache.spark.sql.hive.ParquetRelationTestSuite$$anonfun$1 at 
(ParquetRelationTestSuite.scala:29)
org.scalatest.exceptions.TestFailedException: Array([test1,10], [test1,20], 
[test2,30], [test2,40]) did not equal Array()
{code}

{code}
  test("test insert into parquet ") {
val ddl= List(
  "DROP TABLE IF EXISTS tmp",
  "DROP TABLE IF EXISTS test",
  "CREATE TABLE IF NOT EXISTS tmp ( c1 string, c2 int )",
  "CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET",
  """INSERT INTO TABLE test select "test1" as c1, (count(*)+1) *10 as c2 
from test""",
  """INSERT INTO TABLE test select "test1" as c1, (count(*)+1) *10 as c2 
from test""",
  """INSERT INTO TABLE test select "test2" as c1, (count(*)+1) *10 as c2 
from test""",
  """INSERT INTO TABLE test select "test2" as c1, (count(*)+1) *10 as c2 
from test"""
)
ddl.foreach{ x =>
  TestHive.sql(x).collect()
}

val test = TestHive.sql("select c1, c2 from test").collect()
assert(test.length == 4)
  }

Array() had length 0 instead of expected length 4
ScalaTestFailureLocation: 
org.apache.spark.sql.hive.ParquetRelationTestSuite$$anonfun$2 at 
(ParquetRelationTestSuite.scala:48)
org.scalatest.exceptions.TestFailedException: Array() had length 0 instead of 
expected length 4

{code}

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>Assignee: Wenchen Fan
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.a

[jira] [Closed] (SPARK-10487) MLlib model fitting causes DataFrame write to break with OutOfMemory exception

2015-09-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-10487.
-
Resolution: Not A Problem

> MLlib model fitting causes DataFrame write to break with OutOfMemory exception
> --
>
> Key: SPARK-10487
> URL: https://issues.apache.org/jira/browse/SPARK-10487
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tried in a centos-based 1-node YARN in docker and on a 
> real-world CDH5 cluster
> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest 
> nightly build)
> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn 
> -DzincPort=3034
> I'm using the default resource setup
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor 
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>Reporter: Zoltan Toth
>
> After fitting a _spark.ml_ or _mllib model_ in *cluster* deploy mode, no 
> dataframes can be written to hdfs. The driver receives an OutOfMemory 
> exception during the writing. It seems, however, that the file gets written 
> successfully.
>  * This happens both in SparkR and pyspark
>  * Only happens in cluster deploy mode
>  * The write fails regardless the size of the dataframe and whether the 
> dataframe is associated with the ml model.
> REPRO:
> {code}
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
> from pyspark.ml.classification import LogisticRegression
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vector, Vectors
> conf = SparkConf().setAppName("LogRegTest")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
> sqlContext.setConf("park.sql.parquet.compression.codec", "uncompressed")
> training = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
> df = training.toDF()
> reg = LogisticRegression().setMaxIter(10).setRegParam(0.01)
> model = reg.fit(df)
> # Note that this is a brand new dataframe:
> one_df = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF()
> one_df.write.mode("overwrite").parquet("/tmp/df.parquet")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10487) MLlib model fitting causes DataFrame write to break with OutOfMemory exception

2015-09-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906802#comment-14906802
 ] 

Joseph K. Bradley commented on SPARK-10487:
---

As far as I can tell, there isn't a huge change between R and Scala, or between 
fitting and and not.  I think it's because of the small amount of memory 
available.  I'll close this for now.

> MLlib model fitting causes DataFrame write to break with OutOfMemory exception
> --
>
> Key: SPARK-10487
> URL: https://issues.apache.org/jira/browse/SPARK-10487
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tried in a centos-based 1-node YARN in docker and on a 
> real-world CDH5 cluster
> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest 
> nightly build)
> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn 
> -DzincPort=3034
> I'm using the default resource setup
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor 
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any, 
> capability: )
>Reporter: Zoltan Toth
>
> After fitting a _spark.ml_ or _mllib model_ in *cluster* deploy mode, no 
> dataframes can be written to hdfs. The driver receives an OutOfMemory 
> exception during the writing. It seems, however, that the file gets written 
> successfully.
>  * This happens both in SparkR and pyspark
>  * Only happens in cluster deploy mode
>  * The write fails regardless the size of the dataframe and whether the 
> dataframe is associated with the ml model.
> REPRO:
> {code}
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
> from pyspark.ml.classification import LogisticRegression
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vector, Vectors
> conf = SparkConf().setAppName("LogRegTest")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
> sqlContext.setConf("park.sql.parquet.compression.codec", "uncompressed")
> training = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
> df = training.toDF()
> reg = LogisticRegression().setMaxIter(10).setRegParam(0.01)
> model = reg.fit(df)
> # Note that this is a brand new dataframe:
> one_df = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF()
> one_df.write.mode("overwrite").parquet("/tmp/df.parquet")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10773) Repartition operation failing on RDD with "argument type mismatch" error

2015-09-24 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906803#comment-14906803
 ] 

Andrew Or commented on SPARK-10773:
---

I believe this is fixed in 1.5.0:
https://issues.apache.org/jira/browse/SPARK-7527

[~dafox777] Could you verify?

> Repartition operation failing on RDD with "argument type mismatch" error
> 
>
> Key: SPARK-10773
> URL: https://issues.apache.org/jira/browse/SPARK-10773
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Da Fox
>
> Hello,
> Erorr occures in following Spark application:
> {code}
> object RunSpark {
> def main(args: Array[String]) {
> val sparkContext: SparkContext = new SparkContext()
> val data: RDD[String] = sparkContext.textFile("banana-big.tsv")
> val repartitioned: RDD[String] = data.repartition(5)
> val mean: Double = repartitioned
> .groupBy((s: String) => s.split("\t")(1))
> .mapValues((strings: Iterable[String]) =>strings.size)
> .values.mean()
> println(mean)
> }
> }
> {code}
> The exception:
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: argument type 
> mismatch
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$instantiateClass(ClosureCleaner.scala:330)
>   at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$22.apply(ClosureCleaner.scala:268)
>   at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$22.apply(ClosureCleaner.scala:262)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:262)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:1893)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:700)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:699)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:699)
>   at org.apache.spark.rdd.RDD$$anonfun$coalesce$1.apply(RDD.scala:381)
>   at org.apache.spark.rdd.RDD$$anonfun$coalesce$1.apply(RDD.scala:367)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
>   at org.apache.spark.rdd.RDD.coalesce(RDD.scala:366)
>   at org.apache.spark.rdd.RDD$$anonfun$repartition$1.apply(RDD.scala:342)
>   at org.apache.spark.rdd.RDD$$anonfun$repartition$1.apply(RDD.scala:342)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
>   at org.apache.spark.rdd.RDD.repartition(RDD.scala:341)
>   at repartitionissue.RunSpark$.main(RunSpark.scala:10)
>   at repartitionissue.RunSpark.main(RunSpark.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
>   at org.apache.spark.deploy.Spa

[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2015-09-24 Thread Neal Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906807#comment-14906807
 ] 

Neal Yin commented on SPARK-6028:
-

[~rxin] I am wondering why spark wants to remove AKKA dependence?   Is AKKA a 
bad usage for Spark's RPC implementation?

> Provide an alternative RPC implementation based on the network transport 
> module
> ---
>
> Key: SPARK-6028
> URL: https://issues.apache.org/jira/browse/SPARK-6028
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0
>
>
> Network transport module implements a low level RPC interface. We can build a 
> new RPC implementation on top of that to replace Akka's.
> Design document: 
> https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-24 Thread Ian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906801#comment-14906801
 ] 

Ian edited comment on SPARK-10741 at 9/24/15 6:46 PM:
--

Two of my tests failed. The query returns nothing. 
Also see 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42962/testReport/

these two could be related. 

{code}
  test("test insert overwrite parquet ") {
val ddl= List(
  "DROP TABLE IF EXISTS tmp",
  "DROP TABLE IF EXISTS test",
  "CREATE TABLE IF NOT EXISTS tmp ( c1 string, c2 int )",
  """INSERT INTO TABLE tmp select "test1" as c1, (count(*)+1) *10 as c2 
from tmp""",
  """INSERT INTO TABLE tmp select "test1" as c1, (count(*)+1) *10 as c2 
from tmp""",
  """INSERT INTO TABLE tmp select "test2" as c1, (count(*)+1) *10 as c2 
from tmp""",
  """INSERT INTO TABLE tmp select "test2" as c1, (count(*)+1) *10 as c2 
from tmp""",
  "CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET",
  "INSERT OVERWRITE TABLE test SELECT * FROM tmp"
)

ddl.foreach{ x =>
  TestHive.sql(x).collect()
}

val tmp = TestHive.sql("select c1, c2 from tmp").collect()
val test = TestHive.sql("select c1, c2 from test").collect()
assert(tmp === test)
  }

Array([test1,10], [test1,20], [test2,30], [test2,40]) did not equal Array()
ScalaTestFailureLocation: 
org.apache.spark.sql.hive.ParquetRelationTestSuite$$anonfun$1 at 
(ParquetRelationTestSuite.scala:29)
org.scalatest.exceptions.TestFailedException: Array([test1,10], [test1,20], 
[test2,30], [test2,40]) did not equal Array()
{code}

{code}
  test("test insert into parquet ") {
val ddl= List(
  "DROP TABLE IF EXISTS tmp",
  "DROP TABLE IF EXISTS test",
  "CREATE TABLE IF NOT EXISTS tmp ( c1 string, c2 int )",
  "CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET",
  """INSERT INTO TABLE test select "test1" as c1, (count(*)+1) *10 as c2 
from test""",
  """INSERT INTO TABLE test select "test1" as c1, (count(*)+1) *10 as c2 
from test""",
  """INSERT INTO TABLE test select "test2" as c1, (count(*)+1) *10 as c2 
from test""",
  """INSERT INTO TABLE test select "test2" as c1, (count(*)+1) *10 as c2 
from test"""
)
ddl.foreach{ x =>
  TestHive.sql(x).collect()
}

val test = TestHive.sql("select c1, c2 from test").collect()
assert(test.length == 4)
  }

Array() had length 0 instead of expected length 4
ScalaTestFailureLocation: 
org.apache.spark.sql.hive.ParquetRelationTestSuite$$anonfun$2 at 
(ParquetRelationTestSuite.scala:48)
org.scalatest.exceptions.TestFailedException: Array() had length 0 instead of 
expected length 4

{code}


was (Author: ianlcsd):
Two of my tests failed. 
The query returns nothing. 

{code}
  test("test insert overwrite parquet ") {
val ddl= List(
  "DROP TABLE IF EXISTS tmp",
  "DROP TABLE IF EXISTS test",
  "CREATE TABLE IF NOT EXISTS tmp ( c1 string, c2 int )",
  """INSERT INTO TABLE tmp select "test1" as c1, (count(*)+1) *10 as c2 
from tmp""",
  """INSERT INTO TABLE tmp select "test1" as c1, (count(*)+1) *10 as c2 
from tmp""",
  """INSERT INTO TABLE tmp select "test2" as c1, (count(*)+1) *10 as c2 
from tmp""",
  """INSERT INTO TABLE tmp select "test2" as c1, (count(*)+1) *10 as c2 
from tmp""",
  "CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET",
  "INSERT OVERWRITE TABLE test SELECT * FROM tmp"
)

ddl.foreach{ x =>
  TestHive.sql(x).collect()
}

val tmp = TestHive.sql("select c1, c2 from tmp").collect()
val test = TestHive.sql("select c1, c2 from test").collect()
assert(tmp === test)
  }

Array([test1,10], [test1,20], [test2,30], [test2,40]) did not equal Array()
ScalaTestFailureLocation: 
org.apache.spark.sql.hive.ParquetRelationTestSuite$$anonfun$1 at 
(ParquetRelationTestSuite.scala:29)
org.scalatest.exceptions.TestFailedException: Array([test1,10], [test1,20], 
[test2,30], [test2,40]) did not equal Array()
{code}

{code}
  test("test insert into parquet ") {
val ddl= List(
  "DROP TABLE IF EXISTS tmp",
  "DROP TABLE IF EXISTS test",
  "CREATE TABLE IF NOT EXISTS tmp ( c1 string, c2 int )",
  "CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS PARQUET",
  """INSERT INTO TABLE test select "test1" as c1, (count(*)+1) *10 as c2 
from test""",
  """INSERT INTO TABLE test select "test1" as c1, (count(*)+1) *10 as c2 
from test""",
  """INSERT INTO TABLE test select "test2" as c1, (count(*)+1) *10 as c2 
from test""",
  """INSERT INTO TABLE test select "test2" as c1, (count(*)+1) *10 as c2 
from test"""
)
ddl.foreach{ x =>
  TestHive.sql(x).collect()
}

val test = TestHive.sql("select c1, c2 from test").collect()
assert(test.length == 4)
  }

Array() had length 0 instead of expected

  1   2   3   >