[jira] [Commented] (SPARK-15086) Update Java API once the Scala one is finalized

2016-06-09 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322074#comment-15322074
 ] 

Weichen Xu commented on SPARK-15086:


If do so, only rename the java API in this type or rename scala API on the same 
way?

> Update Java API once the Scala one is finalized
> ---
>
> Key: SPARK-15086
> URL: https://issues.apache.org/jira/browse/SPARK-15086
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should make sure we update the Java API once the Scala one is finalized. 
> This includes adding the equivalent API in Java as well as deprecating the 
> old ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15086) Update Java API once the Scala one is finalized

2016-06-09 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322075#comment-15322075
 ] 

Reynold Xin commented on SPARK-15086:
-

I was suggesting renaming both, so the two would be consistent, and we still 
don't break old APIs.


> Update Java API once the Scala one is finalized
> ---
>
> Key: SPARK-15086
> URL: https://issues.apache.org/jira/browse/SPARK-15086
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should make sure we update the Java API once the Scala one is finalized. 
> This includes adding the equivalent API in Java as well as deprecating the 
> old ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15086) Update Java API once the Scala one is finalized

2016-06-09 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322082#comment-15322082
 ] 

Weichen Xu commented on SPARK-15086:


OK. [~srowen] What do you think about it?

> Update Java API once the Scala one is finalized
> ---
>
> Key: SPARK-15086
> URL: https://issues.apache.org/jira/browse/SPARK-15086
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should make sure we update the Java API once the Scala one is finalized. 
> This includes adding the equivalent API in Java as well as deprecating the 
> old ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15839) Maven doc JAR generation fails when JAVA_7_HOME is set

2016-06-09 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-15839:
--

 Summary: Maven doc JAR generation fails when JAVA_7_HOME is set
 Key: SPARK-15839
 URL: https://issues.apache.org/jira/browse/SPARK-15839
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Affects Versions: 2.0.0
Reporter: Josh Rosen
Assignee: Josh Rosen


It looks like the nightly Maven snapshots broke after we set JAVA_7_HOME in the 
build: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/1573/.
 It seems that passing {{-javabootclasspath}} to scalac using 
scala-maven-plugin ends up preventing the Scala library classes from being 
added to scalac's internal class path, causing compilation errors while 
building doc-jars.

There might be a principled fix to this inside of the scala-maven-plugin 
itself, but for now I propose that we simply omit the -javabootclasspath option 
during Maven doc-jar generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15839) Maven doc JAR generation fails when JAVA_7_HOME is set

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322103#comment-15322103
 ] 

Apache Spark commented on SPARK-15839:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/13573

> Maven doc JAR generation fails when JAVA_7_HOME is set
> --
>
> Key: SPARK-15839
> URL: https://issues.apache.org/jira/browse/SPARK-15839
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> It looks like the nightly Maven snapshots broke after we set JAVA_7_HOME in 
> the build: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/1573/.
>  It seems that passing {{-javabootclasspath}} to scalac using 
> scala-maven-plugin ends up preventing the Scala library classes from being 
> added to scalac's internal class path, causing compilation errors while 
> building doc-jars.
> There might be a principled fix to this inside of the scala-maven-plugin 
> itself, but for now I propose that we simply omit the -javabootclasspath 
> option during Maven doc-jar generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12712) test-dependencies.sh script fails when run against empty .m2 cache

2016-06-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-12712.

   Resolution: Fixed
Fix Version/s: 1.6.2
   2.0.0

Issue resolved by pull request 13568
[https://github.com/apache/spark/pull/13568]

> test-dependencies.sh script fails when run against empty .m2 cache
> --
>
> Key: SPARK-12712
> URL: https://issues.apache.org/jira/browse/SPARK-12712
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Stavros Kontopoulos
>Assignee: Josh Rosen
> Fix For: 2.0.0, 1.6.2
>
>
> Test-dependencies.sh script fails.
> This relates to this https://github.com/apache/spark/pull/10461
> Check failure here:
> https://ci.typesafe.com/job/ghprb-spark-multi-conf/label=Spark-Ora-JDK7-PV,scala_version=2.10/84/console
> My pr does not change dependencies shouldnt the pr manifest be generated with 
> full dependencies it seems empty. should i use replace-manifest?
> Reproducing it locally on that jenkins instance i get this: 
> Spark's published dependencies DO NOT MATCH the manifest file 
> (dev/spark-deps).
> To update the manifest file, run './dev/test-dependencies.sh 
> --replace-manifest'.
> diff --git a/dev/deps/spark-deps-hadoop-2.6 
> b/dev/pr-deps/spark-deps-hadoop-2.6
> index e703c7a..3aa2c38 100644
> --- a/dev/deps/spark-deps-hadoop-2.6
> +++ b/dev/pr-deps/spark-deps-hadoop-2.6
> @@ -1,190 +1,2 @@
> -JavaEWAH-0.3.2.jar
> -RoaringBitmap-0.5.11.jar
> -ST4-4.0.4.jar
> -activation-1.1.1.jar
> -akka-actor_2.10-2.3.11.jar
> -akka-remote_2.10-2.3.11.jar
> -akka-slf4j_2.10-2.3.11.jar
> -antlr-runtime-3.5.2.jar
> -aopalliance-1.0.jar
> -apache-log4j-extras-1.2.17.jar
> -apacheds-i18n-2.0.0-M15.jar
> -apacheds-kerberos-codec-2.0.0-M15.jar
> -api-asn1-api-1.0.0-M20.jar
> -api-util-1.0.0-M20.jar
> -arpack_combined_all-0.1.jar
> -asm-3.1.jar
> -asm-commons-3.1.jar
> -asm-tree-3.1.jar
> -avro-1.7.7.jar
> -avro-ipc-1.7.7-tests.jar
> -avro-ipc-1.7.7.jar
> -avro-mapred-1.7.7-hadoop2.jar
> -base64-2.3.8.jar
> -bcprov-jdk15on-1.51.jar
> -bonecp-0.8.0.RELEASE.jar
> -breeze-macros_2.10-0.11.2.jar
> -breeze_2.10-0.11.2.jar
> -calcite-avatica-1.2.0-incubating.jar
> -calcite-core-1.2.0-incubating.jar
> -calcite-linq4j-1.2.0-incubating.jar
> -chill-java-0.5.0.jar
> -chill_2.10-0.5.0.jar
> -commons-beanutils-1.7.0.jar
> -commons-beanutils-core-1.8.0.jar
> -commons-cli-1.2.jar
> -commons-codec-1.10.jar
> -commons-collections-3.2.2.jar
> -commons-compiler-2.7.6.jar
> -commons-compress-1.4.1.jar
> -commons-configuration-1.6.jar
> -commons-dbcp-1.4.jar
> -commons-digester-1.8.jar
> -commons-httpclient-3.1.jar
> -commons-io-2.4.jar
> -commons-lang-2.6.jar
> -commons-lang3-3.3.2.jar
> -commons-logging-1.1.3.jar
> -commons-math3-3.4.1.jar
> -commons-net-2.2.jar
> -commons-pool-1.5.4.jar
> -compress-lzf-1.0.3.jar
> -config-1.2.1.jar
> -core-1.1.2.jar
> -curator-client-2.6.0.jar
> -curator-framework-2.6.0.jar
> -curator-recipes-2.6.0.jar
> -datanucleus-api-jdo-3.2.6.jar
> -datanucleus-core-3.2.10.jar
> -datanucleus-rdbms-3.2.9.jar
> -derby-10.10.1.1.jar
> -eigenbase-properties-1.1.5.jar
> -geronimo-annotation_1.0_spec-1.1.1.jar
> -geronimo-jaspic_1.0_spec-1.0.jar
> -geronimo-jta_1.1_spec-1.1.1.jar
> -groovy-all-2.1.6.jar
> -gson-2.2.4.jar
> -guice-3.0.jar
> -guice-servlet-3.0.jar
> -hadoop-annotations-2.6.0.jar
> -hadoop-auth-2.6.0.jar
> -hadoop-client-2.6.0.jar
> -hadoop-common-2.6.0.jar
> -hadoop-hdfs-2.6.0.jar
> -hadoop-mapreduce-client-app-2.6.0.jar
> -hadoop-mapreduce-client-common-2.6.0.jar
> -hadoop-mapreduce-client-core-2.6.0.jar
> -hadoop-mapreduce-client-jobclient-2.6.0.jar
> -hadoop-mapreduce-client-shuffle-2.6.0.jar
> -hadoop-yarn-api-2.6.0.jar
> -hadoop-yarn-client-2.6.0.jar
> -hadoop-yarn-common-2.6.0.jar
> -hadoop-yarn-server-common-2.6.0.jar
> -hadoop-yarn-server-web-proxy-2.6.0.jar
> -htrace-core-3.0.4.jar
> -httpclient-4.3.2.jar
> -httpcore-4.3.2.jar
> -ivy-2.4.0.jar
> -jackson-annotations-2.4.4.jar
> -jackson-core-2.4.4.jar
> -jackson-core-asl-1.9.13.jar
> -jackson-databind-2.4.4.jar
> -jackson-jaxrs-1.9.13.jar
> -jackson-mapper-asl-1.9.13.jar
> -jackson-module-scala_2.10-2.4.4.jar
> -jackson-xc-1.9.13.jar
> -janino-2.7.8.jar
> -jansi-1.4.jar
> -java-xmlbuilder-1.0.jar
> -javax.inject-1.jar
> -javax.servlet-3.0.0.v201112011016.jar
> -javolution-5.5.1.jar
> -jaxb-api-2.2.2.jar
> -jaxb-impl-2.2.3-1.jar
> -jcl-over-slf4j-1.7.10.jar
> -jdo-api-3.0.1.jar
> -jersey-client-1.9.jar
> -jersey-core-1.9.jar
> -jersey-guice-1.9.jar
> -jersey-json-1.9.jar
> -jersey-server-1.9.jar
> -jets3t-0.9.3.jar
> -jettison-1.1.jar
> -jetty-6.1.26.jar
> -jetty-all-7.6.0.v20120127.jar
> -jetty-util-6.1.26.jar
> -jline-2.10.5.jar
> -jline-2.12.jar
> -joda-time-2.9.jar
> -jodd-core-3.5.2.jar
> -jpam-1.1.jar
> -json-20090211.jar
> 

[jira] [Created] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-09 Thread JIRA
Ernst Sjöstrand created SPARK-15840:
---

 Summary: New csv reader does not "determine the input schema"
 Key: SPARK-15840
 URL: https://issues.apache.org/jira/browse/SPARK-15840
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.0.0
Reporter: Ernst Sjöstrand


When testing the new csv reader I found that it would not determine the input 
schema as is stated in the documentation.
(I used this documentation: 
https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
 )

So either there is a bug in the implementation or in the documentation.

This also means that things like dateFormat are ignore it seems like.

Here's a quick test in pyspark (using Python3):

a = spark.read.csv("/home/ernst/test.csv")
a.printSchema()
print(a.dtypes)
a.show()

root
 |-- _c0: string (nullable = true)
[('_c0', 'string')]
+---+
|_c0|
+---+
|  1|
|  2|
|  3|
|  4|
+---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322127#comment-15322127
 ] 

Ernst Sjöstrand commented on SPARK-15840:
-

The old databricks csv had an option called inferSchema but that's gone now...

> New csv reader does not "determine the input schema"
> 
>
> Key: SPARK-15840
> URL: https://issues.apache.org/jira/browse/SPARK-15840
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Ernst Sjöstrand
>
> When testing the new csv reader I found that it would not determine the input 
> schema as is stated in the documentation.
> (I used this documentation: 
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
>  )
> So either there is a bug in the implementation or in the documentation.
> This also means that things like dateFormat are ignore it seems like.
> Here's a quick test in pyspark (using Python3):
> a = spark.read.csv("/home/ernst/test.csv")
> a.printSchema()
> print(a.dtypes)
> a.show()
> root
>  |-- _c0: string (nullable = true)
> [('_c0', 'string')]
> +---+
> |_c0|
> +---+
> |  1|
> |  2|
> |  3|
> |  4|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322137#comment-15322137
 ] 

Ernst Sjöstrand commented on SPARK-15840:
-

Perhaps related to SPARK-13667 ?

> New csv reader does not "determine the input schema"
> 
>
> Key: SPARK-15840
> URL: https://issues.apache.org/jira/browse/SPARK-15840
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Ernst Sjöstrand
>
> When testing the new csv reader I found that it would not determine the input 
> schema as is stated in the documentation.
> (I used this documentation: 
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
>  )
> So either there is a bug in the implementation or in the documentation.
> This also means that things like dateFormat are ignore it seems like.
> Here's a quick test in pyspark (using Python3):
> a = spark.read.csv("/home/ernst/test.csv")
> a.printSchema()
> print(a.dtypes)
> a.show()
> root
>  |-- _c0: string (nullable = true)
> [('_c0', 'string')]
> +---+
> |_c0|
> +---+
> |  1|
> |  2|
> |  3|
> |  4|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322140#comment-15322140
 ] 

Hyukjin Kwon commented on SPARK-15840:
--

There is {{inferSchema}} option but it seems it was missed by mistake. 
{{inferSchema}} should work. I will make a PR for this if you won't.

> New csv reader does not "determine the input schema"
> 
>
> Key: SPARK-15840
> URL: https://issues.apache.org/jira/browse/SPARK-15840
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Ernst Sjöstrand
>
> When testing the new csv reader I found that it would not determine the input 
> schema as is stated in the documentation.
> (I used this documentation: 
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
>  )
> So either there is a bug in the implementation or in the documentation.
> This also means that things like dateFormat are ignore it seems like.
> Here's a quick test in pyspark (using Python3):
> a = spark.read.csv("/home/ernst/test.csv")
> a.printSchema()
> print(a.dtypes)
> a.show()
> root
>  |-- _c0: string (nullable = true)
> [('_c0', 'string')]
> +---+
> |_c0|
> +---+
> |  1|
> |  2|
> |  3|
> |  4|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322142#comment-15322142
 ] 

Ernst Sjöstrand commented on SPARK-15840:
-

Also, the documentation implies that an inferSchema option is not necessary... 
Please, go ahead!

> New csv reader does not "determine the input schema"
> 
>
> Key: SPARK-15840
> URL: https://issues.apache.org/jira/browse/SPARK-15840
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Ernst Sjöstrand
>
> When testing the new csv reader I found that it would not determine the input 
> schema as is stated in the documentation.
> (I used this documentation: 
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
>  )
> So either there is a bug in the implementation or in the documentation.
> This also means that things like dateFormat are ignore it seems like.
> Here's a quick test in pyspark (using Python3):
> a = spark.read.csv("/home/ernst/test.csv")
> a.printSchema()
> print(a.dtypes)
> a.show()
> root
>  |-- _c0: string (nullable = true)
> [('_c0', 'string')]
> +---+
> |_c0|
> +---+
> |  1|
> |  2|
> |  3|
> |  4|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322140#comment-15322140
 ] 

Hyukjin Kwon edited comment on SPARK-15840 at 6/9/16 8:24 AM:
--

There is {{inferSchema}} option but it seems it was missed in the documentation 
by mistake. {{inferSchema}} should work. I will make a PR for this if you won't.


was (Author: hyukjin.kwon):
There is {{inferSchema}} option but it seems it was missed by mistake. 
{{inferSchema}} should work. I will make a PR for this if you won't.

> New csv reader does not "determine the input schema"
> 
>
> Key: SPARK-15840
> URL: https://issues.apache.org/jira/browse/SPARK-15840
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Ernst Sjöstrand
>
> When testing the new csv reader I found that it would not determine the input 
> schema as is stated in the documentation.
> (I used this documentation: 
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
>  )
> So either there is a bug in the implementation or in the documentation.
> This also means that things like dateFormat are ignore it seems like.
> Here's a quick test in pyspark (using Python3):
> a = spark.read.csv("/home/ernst/test.csv")
> a.printSchema()
> print(a.dtypes)
> a.show()
> root
>  |-- _c0: string (nullable = true)
> [('_c0', 'string')]
> +---+
> |_c0|
> +---+
> |  1|
> |  2|
> |  3|
> |  4|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322143#comment-15322143
 ] 

Ernst Sjöstrand commented on SPARK-15840:
-

I have only tested this for Python, not sure if it applies to Scala etc also.
Are there unit tests for type inferring, custom dateFormats etc?

> New csv reader does not "determine the input schema"
> 
>
> Key: SPARK-15840
> URL: https://issues.apache.org/jira/browse/SPARK-15840
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Ernst Sjöstrand
>
> When testing the new csv reader I found that it would not determine the input 
> schema as is stated in the documentation.
> (I used this documentation: 
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
>  )
> So either there is a bug in the implementation or in the documentation.
> This also means that things like dateFormat are ignore it seems like.
> Here's a quick test in pyspark (using Python3):
> a = spark.read.csv("/home/ernst/test.csv")
> a.printSchema()
> print(a.dtypes)
> a.show()
> root
>  |-- _c0: string (nullable = true)
> [('_c0', 'string')]
> +---+
> |_c0|
> +---+
> |  1|
> |  2|
> |  3|
> |  4|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ernst Sjöstrand updated SPARK-15840:

Description: 
When testing the new csv reader I found that it would not determine the input 
schema as is stated in the documentation.
(I used this documentation: 
https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
 )

So either there is a bug in the implementation or in the documentation.

This also means that things like dateFormat are ignore it seems like.

Here's a quick test in pyspark (using Python3):

a = spark.read.csv("/home/ernst/test.csv")
a.printSchema()
print(a.dtypes)
a.show()

 root
  |-- _c0: string (nullable = true)
 [('_c0', 'string')]
 +---+
 |_c0|
 +---+
 |  1|
 |  2|
 |  3|
 |  4|
 +---+

  was:
When testing the new csv reader I found that it would not determine the input 
schema as is stated in the documentation.
(I used this documentation: 
https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
 )

So either there is a bug in the implementation or in the documentation.

This also means that things like dateFormat are ignore it seems like.

Here's a quick test in pyspark (using Python3):

a = spark.read.csv("/home/ernst/test.csv")
a.printSchema()
print(a.dtypes)
a.show()

root
 |-- _c0: string (nullable = true)
[('_c0', 'string')]
+---+
|_c0|
+---+
|  1|
|  2|
|  3|
|  4|
+---+


> New csv reader does not "determine the input schema"
> 
>
> Key: SPARK-15840
> URL: https://issues.apache.org/jira/browse/SPARK-15840
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Ernst Sjöstrand
>
> When testing the new csv reader I found that it would not determine the input 
> schema as is stated in the documentation.
> (I used this documentation: 
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
>  )
> So either there is a bug in the implementation or in the documentation.
> This also means that things like dateFormat are ignore it seems like.
> Here's a quick test in pyspark (using Python3):
> a = spark.read.csv("/home/ernst/test.csv")
> a.printSchema()
> print(a.dtypes)
> a.show()
>  root
>   |-- _c0: string (nullable = true)
>  [('_c0', 'string')]
>  +---+
>  |_c0|
>  +---+
>  |  1|
>  |  2|
>  |  3|
>  |  4|
>  +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ernst Sjöstrand updated SPARK-15840:

Description: 
When testing the new csv reader I found that it would not determine the input 
schema as is stated in the documentation.
(I used this documentation: 
https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
 )

So either there is a bug in the implementation or in the documentation.

This also means that things like dateFormat are ignore it seems like.

Here's a quick test in pyspark (using Python3):

a = spark.read.csv("/home/ernst/test.csv")
a.printSchema()
print(a.dtypes)
a.show()

{noformat}
 root
  |-- _c0: string (nullable = true)
 [('_c0', 'string')]
 +---+
 |_c0|
 +---+
 |  1|
 |  2|
 |  3|
 |  4|
 +---+
{noformat}

  was:
When testing the new csv reader I found that it would not determine the input 
schema as is stated in the documentation.
(I used this documentation: 
https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
 )

So either there is a bug in the implementation or in the documentation.

This also means that things like dateFormat are ignore it seems like.

Here's a quick test in pyspark (using Python3):

a = spark.read.csv("/home/ernst/test.csv")
a.printSchema()
print(a.dtypes)
a.show()

 root
  |-- _c0: string (nullable = true)
 [('_c0', 'string')]
 +---+
 |_c0|
 +---+
 |  1|
 |  2|
 |  3|
 |  4|
 +---+


> New csv reader does not "determine the input schema"
> 
>
> Key: SPARK-15840
> URL: https://issues.apache.org/jira/browse/SPARK-15840
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Ernst Sjöstrand
>
> When testing the new csv reader I found that it would not determine the input 
> schema as is stated in the documentation.
> (I used this documentation: 
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
>  )
> So either there is a bug in the implementation or in the documentation.
> This also means that things like dateFormat are ignore it seems like.
> Here's a quick test in pyspark (using Python3):
> a = spark.read.csv("/home/ernst/test.csv")
> a.printSchema()
> print(a.dtypes)
> a.show()
> {noformat}
>  root
>   |-- _c0: string (nullable = true)
>  [('_c0', 'string')]
>  +---+
>  |_c0|
>  +---+
>  |  1|
>  |  2|
>  |  3|
>  |  4|
>  +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322147#comment-15322147
 ] 

Hyukjin Kwon commented on SPARK-15840:
--

For custom dateFormat, here there are,

https://github.com/apache/spark/blob/32f2f95dbdfb21491e46d4b608fd4e8ac7ab8973/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala#L450-L496

and for inferring type...

https://github.com/apache/spark/blob/32f2f95dbdfb21491e46d4b608fd4e8ac7ab8973/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala#L116-L151


> New csv reader does not "determine the input schema"
> 
>
> Key: SPARK-15840
> URL: https://issues.apache.org/jira/browse/SPARK-15840
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Ernst Sjöstrand
>
> When testing the new csv reader I found that it would not determine the input 
> schema as is stated in the documentation.
> (I used this documentation: 
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
>  )
> So either there is a bug in the implementation or in the documentation.
> This also means that things like dateFormat are ignore it seems like.
> Here's a quick test in pyspark (using Python3):
> a = spark.read.csv("/home/ernst/test.csv")
> a.printSchema()
> print(a.dtypes)
> a.show()
> {noformat}
>  root
>   |-- _c0: string (nullable = true)
>  [('_c0', 'string')]
>  +---+
>  |_c0|
>  +---+
>  |  1|
>  |  2|
>  |  3|
>  |  4|
>  +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11765) Avoid assign UI port between browser unsafe ports (or just 4045: lockd)

2016-06-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322180#comment-15322180
 ] 

Sean Owen commented on SPARK-11765:
---

That's how it works now.

> Avoid assign UI port between browser unsafe ports (or just 4045: lockd)
> ---
>
> Key: SPARK-11765
> URL: https://issues.apache.org/jira/browse/SPARK-11765
> Project: Spark
>  Issue Type: Improvement
>Reporter: Jungtaek Lim
>Priority: Minor
>
> Spark UI port starts on 4040, and UI port is incremented by 1 for every 
> confliction.
> In our use case, we have some drivers running at the same time, which makes 
> UI port to be assigned to 4045, which is treated to unsafe port for chrome 
> and mozilla.
> http://src.chromium.org/viewvc/chrome/trunk/src/net/base/net_util.cc?view=markup
> http://www-archive.mozilla.org/projects/netlib/PortBanning.html#portlist
> We would like to avoid assigning UI to these ports, or just avoid assigning 
> UI port to 4045 which is too close to default port.
> If we'd like to accept this idea, I'm happy to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15841) [SPARK REPL] REPLSuite has in correct env set for a couple of tests.

2016-06-09 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-15841:

Component/s: Spark Shell

> [SPARK REPL] REPLSuite has in correct env set for a couple of tests.
> 
>
> Key: SPARK-15841
> URL: https://issues.apache.org/jira/browse/SPARK-15841
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Prashant Sharma
>
> In ReplSuite, for a test that can be tested well on just local should not 
> really have to start a local-cluster. And similarly a test is in-sufficiently 
> run if it's actually fixing a problem related to a distributed run. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15841) [SPARK REPL] REPLSuite has in correct env set for a couple of tests.

2016-06-09 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-15841:
---

 Summary: [SPARK REPL] REPLSuite has in correct env set for a 
couple of tests.
 Key: SPARK-15841
 URL: https://issues.apache.org/jira/browse/SPARK-15841
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma


In ReplSuite, for a test that can be tested well on just local should not 
really have to start a local-cluster. And similarly a test is in-sufficiently 
run if it's actually fixing a problem related to a distributed run. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15837) PySpark ML Word2Vec should support maxSentenceLength

2016-06-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322195#comment-15322195
 ] 

Sean Owen commented on SPARK-15837:
---

Yeah, ideally we would have suggested and done this in the first PR. 

> PySpark ML Word2Vec should support maxSentenceLength
> 
>
> Key: SPARK-15837
> URL: https://issues.apache.org/jira/browse/SPARK-15837
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> SPARK-15793 adds maxSentenceLength for ML Word2Vec in Scala, we should also 
> add it in Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15841) [SPARK REPL] REPLSuite has incorrect env set for a couple of tests.

2016-06-09 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-15841:

Summary: [SPARK REPL] REPLSuite has incorrect env set for a couple of 
tests.  (was: [SPARK REPL] REPLSuite has in correct env set for a couple of 
tests.)

> [SPARK REPL] REPLSuite has incorrect env set for a couple of tests.
> ---
>
> Key: SPARK-15841
> URL: https://issues.apache.org/jira/browse/SPARK-15841
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Prashant Sharma
>
> In ReplSuite, for a test that can be tested well on just local should not 
> really have to start a local-cluster. And similarly a test is in-sufficiently 
> run if it's actually fixing a problem related to a distributed run. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15836) Spark 2.0/master maven snapshots are broken

2016-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15836.
---
  Resolution: Duplicate
Target Version/s:   (was: 2.0.0)

> Spark 2.0/master maven snapshots are broken
> ---
>
> Key: SPARK-15836
> URL: https://issues.apache.org/jira/browse/SPARK-15836
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Yin Huai
>
> See 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.0-maven-snapshots/
>  and 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15841) [SPARK REPL] REPLSuite has incorrect env set for a couple of tests.

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15841:


Assignee: Apache Spark

> [SPARK REPL] REPLSuite has incorrect env set for a couple of tests.
> ---
>
> Key: SPARK-15841
> URL: https://issues.apache.org/jira/browse/SPARK-15841
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Prashant Sharma
>Assignee: Apache Spark
>
> In ReplSuite, for a test that can be tested well on just local should not 
> really have to start a local-cluster. And similarly a test is in-sufficiently 
> run if it's actually fixing a problem related to a distributed run. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15841) [SPARK REPL] REPLSuite has incorrect env set for a couple of tests.

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322199#comment-15322199
 ] 

Apache Spark commented on SPARK-15841:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/13574

> [SPARK REPL] REPLSuite has incorrect env set for a couple of tests.
> ---
>
> Key: SPARK-15841
> URL: https://issues.apache.org/jira/browse/SPARK-15841
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Prashant Sharma
>
> In ReplSuite, for a test that can be tested well on just local should not 
> really have to start a local-cluster. And similarly a test is in-sufficiently 
> run if it's actually fixing a problem related to a distributed run. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15841) [SPARK REPL] REPLSuite has incorrect env set for a couple of tests.

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15841:


Assignee: (was: Apache Spark)

> [SPARK REPL] REPLSuite has incorrect env set for a couple of tests.
> ---
>
> Key: SPARK-15841
> URL: https://issues.apache.org/jira/browse/SPARK-15841
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Prashant Sharma
>
> In ReplSuite, for a test that can be tested well on just local should not 
> really have to start a local-cluster. And similarly a test is in-sufficiently 
> run if it's actually fixing a problem related to a distributed run. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15697) [SPARK REPL] unblock some of the useful repl commands.

2016-06-09 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-15697:

Description: 
"implicits", "javap", "power", "type", "kind" commands in repl are blocked. 
However, they work fine in most cases. It is clear we don't support them they 
are part of the scala repl. What is the harm in unblocking them, given they are 
useful ?

In previous versions of spark we disabled these commands because it was 
difficult to support them with out customization and the associated maintenance 
burden. 

Symantics of reset are to be discussed in a separate 

  was:
"implicits", "javap", "power", "type", "kind", "reset" commands in repl are 
blocked. However, they work fine in most cases. It is clear we don't support 
them they are part of the scala repl. What is the harm in unblocking them, 
given they are useful ?

In previous versions of spark we disabled these commands because it was 
difficult to support them with out customization.


> [SPARK REPL] unblock some of the useful repl commands.
> --
>
> Key: SPARK-15697
> URL: https://issues.apache.org/jira/browse/SPARK-15697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
>Reporter: Prashant Sharma
>Priority: Trivial
>
> "implicits", "javap", "power", "type", "kind" commands in repl are blocked. 
> However, they work fine in most cases. It is clear we don't support them they 
> are part of the scala repl. What is the harm in unblocking them, given they 
> are useful ?
> In previous versions of spark we disabled these commands because it was 
> difficult to support them with out customization and the associated 
> maintenance burden. 
> Symantics of reset are to be discussed in a separate 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15697) [SPARK REPL] unblock some of the useful repl commands.

2016-06-09 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-15697:

Description: 
"implicits", "javap", "power", "type", "kind" commands in repl are blocked. 
However, they work fine in most cases. It is clear we don't support them they 
are part of the scala repl. What is the harm in unblocking them, given they are 
useful ?

In previous versions of spark we disabled these commands because it was 
difficult to support them with out customization and the associated maintenance 
burden. 

Symantics of reset are to be discussed in a separate issue.

  was:
"implicits", "javap", "power", "type", "kind" commands in repl are blocked. 
However, they work fine in most cases. It is clear we don't support them they 
are part of the scala repl. What is the harm in unblocking them, given they are 
useful ?

In previous versions of spark we disabled these commands because it was 
difficult to support them with out customization and the associated maintenance 
burden. 

Symantics of reset are to be discussed in a separate 


> [SPARK REPL] unblock some of the useful repl commands.
> --
>
> Key: SPARK-15697
> URL: https://issues.apache.org/jira/browse/SPARK-15697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
>Reporter: Prashant Sharma
>Priority: Trivial
>
> "implicits", "javap", "power", "type", "kind" commands in repl are blocked. 
> However, they work fine in most cases. It is clear we don't support them they 
> are part of the scala repl. What is the harm in unblocking them, given they 
> are useful ?
> In previous versions of spark we disabled these commands because it was 
> difficult to support them with out customization and the associated 
> maintenance burden. 
> Symantics of reset are to be discussed in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15697) [SPARK REPL] unblock some of the useful repl commands.

2016-06-09 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-15697:

Description: 
"implicits", "javap", "power", "type", "kind" commands in repl are blocked. 
However, they work fine in all cases I have tried. It is clear we don't support 
them as they are part of the scala/scala repl project. What is the harm in 
unblocking them, given they are useful ?
In previous versions of spark we disabled these commands because it was 
difficult to support them without customization and the associated maintenance. 
Since the code base of scala repl was actually ported and maintained under 
spark source. Now that is not the situation and one can benefit from these 
commands in Spark REPL as much as in scala repl.

Symantics of reset are to be discussed in a separate issue.

  was:
"implicits", "javap", "power", "type", "kind" commands in repl are blocked. 
However, they work fine in most cases. It is clear we don't support them they 
are part of the scala repl. What is the harm in unblocking them, given they are 
useful ?

In previous versions of spark we disabled these commands because it was 
difficult to support them with out customization and the associated maintenance 
burden. 

Symantics of reset are to be discussed in a separate issue.


> [SPARK REPL] unblock some of the useful repl commands.
> --
>
> Key: SPARK-15697
> URL: https://issues.apache.org/jira/browse/SPARK-15697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
>Reporter: Prashant Sharma
>Priority: Trivial
>
> "implicits", "javap", "power", "type", "kind" commands in repl are blocked. 
> However, they work fine in all cases I have tried. It is clear we don't 
> support them as they are part of the scala/scala repl project. What is the 
> harm in unblocking them, given they are useful ?
> In previous versions of spark we disabled these commands because it was 
> difficult to support them without customization and the associated 
> maintenance. Since the code base of scala repl was actually ported and 
> maintained under spark source. Now that is not the situation and one can 
> benefit from these commands in Spark REPL as much as in scala repl.
> Symantics of reset are to be discussed in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15818) Upgrade to Hadoop 2.7.2

2016-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15818:
--
Assignee: Adam Roberts

> Upgrade to Hadoop 2.7.2
> ---
>
> Key: SPARK-15818
> URL: https://issues.apache.org/jira/browse/SPARK-15818
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Adam Roberts
>Assignee: Adam Roberts
>Priority: Minor
> Fix For: 2.0.0
>
>
> I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating 
> Hadoop 2.7.0 is not ready for production use
> https://hadoop.apache.org/docs/r2.7.0/ states
> "Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building 
> upon the previous stable release 2.6.0.
> This release is not yet ready for production use. Production users should use 
> 2.7.1 release and beyond."
> Hadoop 2.7.1 release notes:
> "Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building 
> upon the previous release 2.7.0. This is the next stable release after Apache 
> Hadoop 2.6.x."
> And then Hadoop 2.7.2 release notes:
> "Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building 
> upon the previous stable release 2.7.1."
> I've tested this is OK with Intel hardware and IBM Java 8 so let's test it 
> with OpenJDK, ideally this will be pushed to branch-2.0 and master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15818) Upgrade to Hadoop 2.7.2

2016-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15818.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13556
[https://github.com/apache/spark/pull/13556]

> Upgrade to Hadoop 2.7.2
> ---
>
> Key: SPARK-15818
> URL: https://issues.apache.org/jira/browse/SPARK-15818
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Adam Roberts
>Priority: Minor
> Fix For: 2.0.0
>
>
> I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating 
> Hadoop 2.7.0 is not ready for production use
> https://hadoop.apache.org/docs/r2.7.0/ states
> "Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building 
> upon the previous stable release 2.6.0.
> This release is not yet ready for production use. Production users should use 
> 2.7.1 release and beyond."
> Hadoop 2.7.1 release notes:
> "Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building 
> upon the previous release 2.7.0. This is the next stable release after Apache 
> Hadoop 2.6.x."
> And then Hadoop 2.7.2 release notes:
> "Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building 
> upon the previous stable release 2.7.1."
> I've tested this is OK with Intel hardware and IBM Java 8 so let's test it 
> with OpenJDK, ideally this will be pushed to branch-2.0 and master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15781) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config

2016-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15781:
--
Summary: Reduce spark.memory.fraction default to avoid overrunning old gen 
in JVM default config  (was: Misleading deprecated property in standalone 
cluster configuration documentation)

PS [~JonathanTaws] do you have some output from -verbose:gc that might confirm 
that the time being spent is in GCing the young generations? just to make sure 
we are solving the right problem.

> Reduce spark.memory.fraction default to avoid overrunning old gen in JVM 
> default config
> ---
>
> Key: SPARK-15781
> URL: https://issues.apache.org/jira/browse/SPARK-15781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> I am unsure if this is regarded as an issue or not, but in the 
> [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts]
>  documentation for the configuration to launch Spark in stand-alone cluster 
> mode, the following property is documented :
> |SPARK_WORKER_INSTANCES|  Number of worker instances to run on each 
> machine (default: 1). You can make this more than 1 if you have have very 
> large machines and would like multiple Spark worker processes. If you do set 
> this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores 
> per worker, or else each worker will try to use all the cores.| 
> However, once I launch Spark with the spark-submit utility and the property 
> {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following 
> deprecated warning : 
> {code}
> 16/06/06 16:38:28 WARN SparkConf: 
> SPARK_WORKER_INSTANCES was detected (set to '4').
> This is deprecated in Spark 1.0+.
> Please instead use:
>  - ./spark-submit with --num-executors to specify the number of executors
>  - Or set SPARK_EXECUTOR_INSTANCES
>  - spark.executor.instances to configure the number of instances in the spark 
> config.
> {code}
> Is this regarded as normal practice to have deprecated fields documented in 
> the documentation ? 
> I would have preferred to directly know about the --num-executors property 
> than to have to submit my application and find a deprecated warning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15802) SparkSQL connection fail using shell command "bin/beeline -u "jdbc:hive2://*.*.*.*:10000/default""

2016-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15802.
---
Resolution: Not A Problem

It looks like you show the answer in your question, not sure what you're 
looking for.

> SparkSQL connection fail using shell command "bin/beeline -u 
> "jdbc:hive2://*.*.*.*:1/default""
> --
>
> Key: SPARK-15802
> URL: https://issues.apache.org/jira/browse/SPARK-15802
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
>
> reproduce steps:
> 1. execute shell "sbin/start-thriftserver.sh --master yarn";
> 2. execute shell "bin/beeline -u "jdbc:hive2://*.*.*.*:1/default"";
> Actually result:
> SparkSQL connection failed and the log shows as follows:
> 16/06/07 14:49:18 WARN HttpParser: Illegal character 0x1 in state=START for 
> buffer 
> HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type:
>  application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00}
> 16/06/07 14:49:18 WARN HttpParser: badMessage: 400 Illegal character 0x1 for 
> HttpChannelOverHttp@718db102{r=0,c=false,a=IDLE,uri=}
> 16/06/07 14:49:19 WARN HttpParser: Illegal character 0x1 in state=START for 
> buffer 
> HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type:
>  application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00}
> 16/06/07 14:49:19 WARN HttpParser: badMessage: 400 Illegal character 0x1 for 
> HttpChannelOverHttp@195db217{r=0,c=false,a=IDLE,uri=}
> note:
> SparkSQL connection succeeded, if using shell command "bin/beeline -u 
> "jdbc:hive2://*.*.*.*:1/default;transportMode=http;httpPath=cliservice""
> Two parameters(transportMode&httpPath) have been added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15716.
---
Resolution: Not A Problem

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15801) spark-submit --num-executors switch also works without YARN

2016-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15801.
---
Resolution: Not A Problem

This much seems to be not a problem.

> spark-submit --num-executors switch also works without YARN
> ---
>
> Key: SPARK-15801
> URL: https://issues.apache.org/jira/browse/SPARK-15801
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] 
> regarding the SPARK_WORKER_INSTANCES property, I also found that the 
> {{--num-executors}} switch documented in the spark-submit help is partially 
> incorrect. 
> Here's one part of the output (produced by {{spark-submit --help}}): 
> {code}
> YARN-only:
>   --driver-cores NUM  Number of cores used by the driver, only in 
> cluster mode
>   (Default: 1).
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --num-executors NUM Number of executors to launch (Default: 2).
> {code}
> Correct me if I am wrong, but the num-executors switch also works in Spark 
> standalone mode *without YARN*.
> I tried by only launching a master and a worker with 4 executors specified, 
> and they were all successfully spawned. The master switch pointed to the 
> master's url, and not to the yarn value. 
> Here's the exact command : {{spark-submit --master spark://[local 
> machine]:7077 --num-executors 4 --executor-cores 2}}
> By default it is *1* executor per worker in Spark standalone mode without 
> YARN, but this option enables to specify the number of executors (per worker 
> ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe 
> it defaults to 2 in YARN mode. 
> I would propose to move the option from the *YARN-only* section to the *Spark 
> standalone and YARN only* section.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics

2016-06-09 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15823:
-
Summary: Add @property for 'accuracy' in MulticlassMetrics  (was: Add 
@property for 'property' in MulticlassMetrics)

> Add @property for 'accuracy' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15831) Kryo 2.21 TreeMap serialization bug causes random job failures with RDDs of HBase puts

2016-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15831:
--
Affects Version/s: 1.5.2
   1.6.1
 Target Version/s:   (was: 1.5.0)

> Kryo 2.21 TreeMap serialization bug causes random job failures with RDDs of 
> HBase puts
> --
>
> Key: SPARK-15831
> URL: https://issues.apache.org/jira/browse/SPARK-15831
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2, 1.6.1
>Reporter: Charles Gariépy-Ikeson
>
> This was found on Spark 1.5, but it seems that all Spark 1.x brings in the 
> problematic dependency in question.
> Kryo 2.21 has a bug when serializing TreeMap that causes intermittent 
> failures in Spark. This problem cause be seen especially when sinking data to 
> HBase using a RDD of HBase Puts (which internally have TreeMap).
> Kryo fixed the issue in 2.21.1. Current work around involves setting 
> "spark.kryo.referenceTracking" to false.
> For reference see:
> Kryo commit: 
> https://github.com/EsotericSoftware/kryo/commit/00ffc7ed443e022a8438d1e4c4f5b86fe4f9912b
> TreeMap Kryo Issue: https://github.com/EsotericSoftware/kryo/issues/112
> HBase Put Kryo Issue: https://github.com/EsotericSoftware/kryo/issues/428



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics

2016-06-09 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322308#comment-15322308
 ] 

zhengruifeng commented on SPARK-15823:
--

{MulticlassMetrics.confusionMatrix} may need {@property} too, but I am not sure.
Others seem ok.

> Add @property for 'accuracy' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics

2016-06-09 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322309#comment-15322309
 ] 

zhengruifeng commented on SPARK-15823:
--

{MulticlassMetrics.confusionMatrix} may need {@property} too, but I am not sure.
Others seem ok.

> Add @property for 'accuracy' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics

2016-06-09 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322308#comment-15322308
 ] 

zhengruifeng edited comment on SPARK-15823 at 6/9/16 10:20 AM:
---

{{MulticlassMetrics.confusionMatrix}} may need {{@property}} too, but I am not 
sure.
Others seem ok.


was (Author: podongfeng):
{MulticlassMetrics.confusionMatrix} may need {@property} too, but I am not sure.
Others seem ok.

> Add @property for 'accuracy' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15823) Add @property for 'accuracy' in MulticlassMetrics

2016-06-09 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-15823:
-
Comment: was deleted

(was: {MulticlassMetrics.confusionMatrix} may need {@property} too, but I am 
not sure.
Others seem ok.)

> Add @property for 'accuracy' in MulticlassMetrics
> -
>
> Key: SPARK-15823
> URL: https://issues.apache.org/jira/browse/SPARK-15823
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> 'accuracy' should be decorated with `@property` to keep step with other 
> methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, 
> `weightedRecall`, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15801) spark-submit --num-executors switch also works without YARN

2016-06-09 Thread Jonathan Taws (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322322#comment-15322322
 ] 

Jonathan Taws commented on SPARK-15801:
---

I don't think it is a problem, but it might be interesting to get a warning if 
the --num-executors option is used in stand alone mode to notify users that 
it's basically not doing anything, and recommend to use --executor-cores 
instead. 

> spark-submit --num-executors switch also works without YARN
> ---
>
> Key: SPARK-15801
> URL: https://issues.apache.org/jira/browse/SPARK-15801
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] 
> regarding the SPARK_WORKER_INSTANCES property, I also found that the 
> {{--num-executors}} switch documented in the spark-submit help is partially 
> incorrect. 
> Here's one part of the output (produced by {{spark-submit --help}}): 
> {code}
> YARN-only:
>   --driver-cores NUM  Number of cores used by the driver, only in 
> cluster mode
>   (Default: 1).
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --num-executors NUM Number of executors to launch (Default: 2).
> {code}
> Correct me if I am wrong, but the num-executors switch also works in Spark 
> standalone mode *without YARN*.
> I tried by only launching a master and a worker with 4 executors specified, 
> and they were all successfully spawned. The master switch pointed to the 
> master's url, and not to the yarn value. 
> Here's the exact command : {{spark-submit --master spark://[local 
> machine]:7077 --num-executors 4 --executor-cores 2}}
> By default it is *1* executor per worker in Spark standalone mode without 
> YARN, but this option enables to specify the number of executors (per worker 
> ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe 
> it defaults to 2 in YARN mode. 
> I would propose to move the option from the *YARN-only* section to the *Spark 
> standalone and YARN only* section.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15781) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config

2016-06-09 Thread Jonathan Taws (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322324#comment-15322324
 ] 

Jonathan Taws commented on SPARK-15781:
---

By launching a session with {{SPARK_WORKER_INSTANCES}} set or just a regular 
one ? 

> Reduce spark.memory.fraction default to avoid overrunning old gen in JVM 
> default config
> ---
>
> Key: SPARK-15781
> URL: https://issues.apache.org/jira/browse/SPARK-15781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> I am unsure if this is regarded as an issue or not, but in the 
> [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts]
>  documentation for the configuration to launch Spark in stand-alone cluster 
> mode, the following property is documented :
> |SPARK_WORKER_INSTANCES|  Number of worker instances to run on each 
> machine (default: 1). You can make this more than 1 if you have have very 
> large machines and would like multiple Spark worker processes. If you do set 
> this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores 
> per worker, or else each worker will try to use all the cores.| 
> However, once I launch Spark with the spark-submit utility and the property 
> {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following 
> deprecated warning : 
> {code}
> 16/06/06 16:38:28 WARN SparkConf: 
> SPARK_WORKER_INSTANCES was detected (set to '4').
> This is deprecated in Spark 1.0+.
> Please instead use:
>  - ./spark-submit with --num-executors to specify the number of executors
>  - Or set SPARK_EXECUTOR_INSTANCES
>  - spark.executor.instances to configure the number of instances in the spark 
> config.
> {code}
> Is this regarded as normal practice to have deprecated fields documented in 
> the documentation ? 
> I would have preferred to directly know about the --num-executors property 
> than to have to submit my application and find a deprecated warning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15472) Add partitioned `csv`, `json`, `text` format support for FileStreamSink

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322325#comment-15322325
 ] 

Apache Spark commented on SPARK-15472:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13575

> Add partitioned `csv`, `json`, `text` format support for FileStreamSink
> ---
>
> Key: SPARK-15472
> URL: https://issues.apache.org/jira/browse/SPARK-15472
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>
> Support for partitioned `parquet` format in FileStreamSink was added in 
> Spark-14716, now let's add support for partitioned `csv`, 'json', `text` 
> format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15472) Add support for writing in `csv`, `json`, `text` formats in Structured Streaming

2016-06-09 Thread Liwei Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-15472:
--
Summary: Add support for writing in `csv`, `json`, `text` formats in 
Structured Streaming  (was: Add partitioned `csv`, `json`, `text` format 
support for FileStreamSink)

> Add support for writing in `csv`, `json`, `text` formats in Structured 
> Streaming
> 
>
> Key: SPARK-15472
> URL: https://issues.apache.org/jira/browse/SPARK-15472
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>
> Support for partitioned `parquet` format in FileStreamSink was added in 
> Spark-14716, now let's add support for partitioned `csv`, 'json', `text` 
> format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1882) Support dynamic memory sharing in Mesos

2016-06-09 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322364#comment-15322364
 ] 

Stavros Kontopoulos commented on SPARK-1882:


Does dynamic allocation help with the fragmentation problem in heterogeneous 
machines in some way? 


> Support dynamic memory sharing in Mesos
> ---
>
> Key: SPARK-1882
> URL: https://issues.apache.org/jira/browse/SPARK-1882
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>
> Fine grained mode Mesos currently supports sharing CPUs very well, but 
> requires that memory be pre-partitioned according to the executor memory 
> parameter.  Mesos supports dynamic memory allocation in addition to dynamic 
> CPU allocation, so we should utilize this feature in Spark.
> See below where when the Mesos backend accepts a resource offer it only 
> checks that there's enough memory to cover sc.executorMemory, and doesn't 
> ever take a fraction of the memory available.  The memory offer is accepted 
> all or nothing from a pre-defined parameter.
> Coarse mode:
> https://github.com/apache/spark/blob/3ce526b168050c572a1feee8e0121e1426f7d9ee/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala#L208
> Fine mode:
> https://github.com/apache/spark/blob/a5150d199ca97ab2992bc2bb221a3ebf3d3450ba/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L114



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15781) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config

2016-06-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322425#comment-15322425
 ] 

Sean Owen commented on SPARK-15781:
---

Oh I'm sorry, I put this comment on entirely the wrong JIRA -- too many tabs at 
once. This doesn't make any sense. I'll revert my change.

> Reduce spark.memory.fraction default to avoid overrunning old gen in JVM 
> default config
> ---
>
> Key: SPARK-15781
> URL: https://issues.apache.org/jira/browse/SPARK-15781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> I am unsure if this is regarded as an issue or not, but in the 
> [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts]
>  documentation for the configuration to launch Spark in stand-alone cluster 
> mode, the following property is documented :
> |SPARK_WORKER_INSTANCES|  Number of worker instances to run on each 
> machine (default: 1). You can make this more than 1 if you have have very 
> large machines and would like multiple Spark worker processes. If you do set 
> this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores 
> per worker, or else each worker will try to use all the cores.| 
> However, once I launch Spark with the spark-submit utility and the property 
> {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following 
> deprecated warning : 
> {code}
> 16/06/06 16:38:28 WARN SparkConf: 
> SPARK_WORKER_INSTANCES was detected (set to '4').
> This is deprecated in Spark 1.0+.
> Please instead use:
>  - ./spark-submit with --num-executors to specify the number of executors
>  - Or set SPARK_EXECUTOR_INSTANCES
>  - spark.executor.instances to configure the number of instances in the spark 
> config.
> {code}
> Is this regarded as normal practice to have deprecated fields documented in 
> the documentation ? 
> I would have preferred to directly know about the --num-executors property 
> than to have to submit my application and find a deprecated warning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation

2016-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15781:
--
Comment: was deleted

(was: PS [~JonathanTaws] do you have some output from -verbose:gc that might 
confirm that the time being spent is in GCing the young generations? just to 
make sure we are solving the right problem.)

> Misleading deprecated property in standalone cluster configuration 
> documentation
> 
>
> Key: SPARK-15781
> URL: https://issues.apache.org/jira/browse/SPARK-15781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> I am unsure if this is regarded as an issue or not, but in the 
> [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts]
>  documentation for the configuration to launch Spark in stand-alone cluster 
> mode, the following property is documented :
> |SPARK_WORKER_INSTANCES|  Number of worker instances to run on each 
> machine (default: 1). You can make this more than 1 if you have have very 
> large machines and would like multiple Spark worker processes. If you do set 
> this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores 
> per worker, or else each worker will try to use all the cores.| 
> However, once I launch Spark with the spark-submit utility and the property 
> {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following 
> deprecated warning : 
> {code}
> 16/06/06 16:38:28 WARN SparkConf: 
> SPARK_WORKER_INSTANCES was detected (set to '4').
> This is deprecated in Spark 1.0+.
> Please instead use:
>  - ./spark-submit with --num-executors to specify the number of executors
>  - Or set SPARK_EXECUTOR_INSTANCES
>  - spark.executor.instances to configure the number of instances in the spark 
> config.
> {code}
> Is this regarded as normal practice to have deprecated fields documented in 
> the documentation ? 
> I would have preferred to directly know about the --num-executors property 
> than to have to submit my application and find a deprecated warning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation

2016-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15781:
--
Summary: Misleading deprecated property in standalone cluster configuration 
documentation  (was: Reduce spark.memory.fraction default to avoid overrunning 
old gen in JVM default config)

> Misleading deprecated property in standalone cluster configuration 
> documentation
> 
>
> Key: SPARK-15781
> URL: https://issues.apache.org/jira/browse/SPARK-15781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> I am unsure if this is regarded as an issue or not, but in the 
> [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts]
>  documentation for the configuration to launch Spark in stand-alone cluster 
> mode, the following property is documented :
> |SPARK_WORKER_INSTANCES|  Number of worker instances to run on each 
> machine (default: 1). You can make this more than 1 if you have have very 
> large machines and would like multiple Spark worker processes. If you do set 
> this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores 
> per worker, or else each worker will try to use all the cores.| 
> However, once I launch Spark with the spark-submit utility and the property 
> {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following 
> deprecated warning : 
> {code}
> 16/06/06 16:38:28 WARN SparkConf: 
> SPARK_WORKER_INSTANCES was detected (set to '4').
> This is deprecated in Spark 1.0+.
> Please instead use:
>  - ./spark-submit with --num-executors to specify the number of executors
>  - Or set SPARK_EXECUTOR_INSTANCES
>  - spark.executor.instances to configure the number of instances in the spark 
> config.
> {code}
> Is this regarded as normal practice to have deprecated fields documented in 
> the documentation ? 
> I would have preferred to directly know about the --num-executors property 
> than to have to submit my application and find a deprecated warning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config

2016-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15796:
--
Summary: Reduce spark.memory.fraction default to avoid overrunning old gen 
in JVM default config  (was: Spark 1.6 default memory settings can cause heavy 
GC when caching)

> Reduce spark.memory.fraction default to avoid overrunning old gen in JVM 
> default config
> ---
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html 
> explains that storageFraction is not an upper-limit but a lower limit-like 
> thing on the size of Spark's cache. The real upper limit is 
> spark.memory.fraction.
> To sum up my questions/issues:
> * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
> Maybe the old generation size should also be mentioned in configuration.html 
> near spark.memory.fraction.
> * Is it a goal for Spark to support heavy caching with default parameters and 
> without GC breakdown? If so, then better default values are needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

--

[jira] [Commented] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config

2016-06-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322426#comment-15322426
 ] 

Sean Owen commented on SPARK-15796:
---

PS [~gfeher] do you have some output from -verbose:gc that might confirm that 
the time being spent is in GCing the young generations? just to make sure we 
are solving the right problem.

> Reduce spark.memory.fraction default to avoid overrunning old gen in JVM 
> default config
> ---
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html 
> explains that storageFraction is not an upper-limit but a lower limit-like 
> thing on the size of Spark's cache. The real upper limit is 
> spark.memory.fraction.
> To sum up my questions/issues:
> * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
> Maybe the old generation size should also be mentioned in configuration.html 
> near spark.memory.fraction.
> * Is it a goal for Spark to support heavy caching with default parameters and 
> without GC breakdown? If so, then better default values are needed.



--
This message was sent by Atlassian J

[jira] [Created] (SPARK-15842) Add support for socket stream.

2016-06-09 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-15842:
---

 Summary: Add support for socket stream.
 Key: SPARK-15842
 URL: https://issues.apache.org/jira/browse/SPARK-15842
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Streaming
Reporter: Prashant Sharma


Streaming so far has an offset based source with all the available sources like 
file source and memory source that do not need additional capabilities to 
implement offset for any given range.

Socket stream at OS level has a very tiny buffer. Many message queues have the 
ability to keep the message lingering until it is read by the receiver end. 
ZeroMQ is one such example. However in the case of socket stream, this is not 
supported. 

The challenge here would be to implement a way to  buffer for a configurable 
amount of time and discuss strategies for overflow and underflow.

This JIRA will form the basis for implementing sources which do not have native 
support for lingering a message for any amount of time until it is read. It 
deals with design doc if necessary and supporting code to implement such 
sources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15842) Add support for socket stream.

2016-06-09 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-15842:

Description: 
Streaming so far has an offset based sources with all the available sources 
like file-source and memory-source that do not need additional capabilities to 
implement offset for any given range.

Socket stream at OS level has a very tiny buffer. Many message queues have the 
ability to keep the message lingering until it is read by the receiver end. 
ZeroMQ is one such example. However in the case of socket stream, this is not 
supported. 

The challenge here would be to implement a way to  buffer for a configurable 
amount of time and discuss strategies for overflow and underflow.

This JIRA will form the basis for implementing sources which do not have native 
support for lingering a message for any amount of time until it is read. It 
deals with design doc if necessary and supporting code to implement such 
sources.

  was:
Streaming so far has an offset based source with all the available sources like 
file source and memory source that do not need additional capabilities to 
implement offset for any given range.

Socket stream at OS level has a very tiny buffer. Many message queues have the 
ability to keep the message lingering until it is read by the receiver end. 
ZeroMQ is one such example. However in the case of socket stream, this is not 
supported. 

The challenge here would be to implement a way to  buffer for a configurable 
amount of time and discuss strategies for overflow and underflow.

This JIRA will form the basis for implementing sources which do not have native 
support for lingering a message for any amount of time until it is read. It 
deals with design doc if necessary and supporting code to implement such 
sources.


> Add support for socket stream.
> --
>
> Key: SPARK-15842
> URL: https://issues.apache.org/jira/browse/SPARK-15842
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Prashant Sharma
>
> Streaming so far has an offset based sources with all the available sources 
> like file-source and memory-source that do not need additional capabilities 
> to implement offset for any given range.
> Socket stream at OS level has a very tiny buffer. Many message queues have 
> the ability to keep the message lingering until it is read by the receiver 
> end. ZeroMQ is one such example. However in the case of socket stream, this 
> is not supported. 
> The challenge here would be to implement a way to  buffer for a configurable 
> amount of time and discuss strategies for overflow and underflow.
> This JIRA will form the basis for implementing sources which do not have 
> native support for lingering a message for any amount of time until it is 
> read. It deals with design doc if necessary and supporting code to implement 
> such sources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15842) Add support for socket stream.

2016-06-09 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma reassigned SPARK-15842:
---

Assignee: Prashant Sharma

> Add support for socket stream.
> --
>
> Key: SPARK-15842
> URL: https://issues.apache.org/jira/browse/SPARK-15842
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>
> Streaming so far has an offset based sources with all the available sources 
> like file-source and memory-source that do not need additional capabilities 
> to implement offset for any given range.
> Socket stream at OS level has a very tiny buffer. Many message queues have 
> the ability to keep the message lingering until it is read by the receiver 
> end. ZeroMQ is one such example. However in the case of socket stream, this 
> is not supported. 
> The challenge here would be to implement a way to  buffer for a configurable 
> amount of time and discuss strategies for overflow and underflow.
> This JIRA will form the basis for implementing sources which do not have 
> native support for lingering a message for any amount of time until it is 
> read. It deals with design doc if necessary and supporting code to implement 
> such sources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322440#comment-15322440
 ] 

Apache Spark commented on SPARK-15840:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/13576

> New csv reader does not "determine the input schema"
> 
>
> Key: SPARK-15840
> URL: https://issues.apache.org/jira/browse/SPARK-15840
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Ernst Sjöstrand
>
> When testing the new csv reader I found that it would not determine the input 
> schema as is stated in the documentation.
> (I used this documentation: 
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
>  )
> So either there is a bug in the implementation or in the documentation.
> This also means that things like dateFormat are ignore it seems like.
> Here's a quick test in pyspark (using Python3):
> a = spark.read.csv("/home/ernst/test.csv")
> a.printSchema()
> print(a.dtypes)
> a.show()
> {noformat}
>  root
>   |-- _c0: string (nullable = true)
>  [('_c0', 'string')]
>  +---+
>  |_c0|
>  +---+
>  |  1|
>  |  2|
>  |  3|
>  |  4|
>  +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15840:


Assignee: (was: Apache Spark)

> New csv reader does not "determine the input schema"
> 
>
> Key: SPARK-15840
> URL: https://issues.apache.org/jira/browse/SPARK-15840
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Ernst Sjöstrand
>
> When testing the new csv reader I found that it would not determine the input 
> schema as is stated in the documentation.
> (I used this documentation: 
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
>  )
> So either there is a bug in the implementation or in the documentation.
> This also means that things like dateFormat are ignore it seems like.
> Here's a quick test in pyspark (using Python3):
> a = spark.read.csv("/home/ernst/test.csv")
> a.printSchema()
> print(a.dtypes)
> a.show()
> {noformat}
>  root
>   |-- _c0: string (nullable = true)
>  [('_c0', 'string')]
>  +---+
>  |_c0|
>  +---+
>  |  1|
>  |  2|
>  |  3|
>  |  4|
>  +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15840:


Assignee: Apache Spark

> New csv reader does not "determine the input schema"
> 
>
> Key: SPARK-15840
> URL: https://issues.apache.org/jira/browse/SPARK-15840
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Ernst Sjöstrand
>Assignee: Apache Spark
>
> When testing the new csv reader I found that it would not determine the input 
> schema as is stated in the documentation.
> (I used this documentation: 
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
>  )
> So either there is a bug in the implementation or in the documentation.
> This also means that things like dateFormat are ignore it seems like.
> Here's a quick test in pyspark (using Python3):
> a = spark.read.csv("/home/ernst/test.csv")
> a.printSchema()
> print(a.dtypes)
> a.show()
> {noformat}
>  root
>   |-- _c0: string (nullable = true)
>  [('_c0', 'string')]
>  +---+
>  |_c0|
>  +---+
>  |  1|
>  |  2|
>  |  3|
>  |  4|
>  +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11765) Avoid assign UI port between browser unsafe ports (or just 4045: lockd)

2016-06-09 Thread Willy Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322442#comment-15322442
 ] 

Willy Lee commented on SPARK-11765:
---

As of what version? I'll want to have our team upgrade.

> Avoid assign UI port between browser unsafe ports (or just 4045: lockd)
> ---
>
> Key: SPARK-11765
> URL: https://issues.apache.org/jira/browse/SPARK-11765
> Project: Spark
>  Issue Type: Improvement
>Reporter: Jungtaek Lim
>Priority: Minor
>
> Spark UI port starts on 4040, and UI port is incremented by 1 for every 
> confliction.
> In our use case, we have some drivers running at the same time, which makes 
> UI port to be assigned to 4045, which is treated to unsafe port for chrome 
> and mozilla.
> http://src.chromium.org/viewvc/chrome/trunk/src/net/base/net_util.cc?view=markup
> http://www-archive.mozilla.org/projects/netlib/PortBanning.html#portlist
> We would like to avoid assigning UI to these ports, or just avoid assigning 
> UI port to 4045 which is too close to default port.
> If we'd like to accept this idea, I'm happy to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11765) Avoid assign UI port between browser unsafe ports (or just 4045: lockd)

2016-06-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322446#comment-15322446
 ] 

Sean Owen commented on SPARK-11765:
---

For as long as I can remember it has iterated through several next available 
ports. This is not what this JIRA is about.

> Avoid assign UI port between browser unsafe ports (or just 4045: lockd)
> ---
>
> Key: SPARK-11765
> URL: https://issues.apache.org/jira/browse/SPARK-11765
> Project: Spark
>  Issue Type: Improvement
>Reporter: Jungtaek Lim
>Priority: Minor
>
> Spark UI port starts on 4040, and UI port is incremented by 1 for every 
> confliction.
> In our use case, we have some drivers running at the same time, which makes 
> UI port to be assigned to 4045, which is treated to unsafe port for chrome 
> and mozilla.
> http://src.chromium.org/viewvc/chrome/trunk/src/net/base/net_util.cc?view=markup
> http://www-archive.mozilla.org/projects/netlib/PortBanning.html#portlist
> We would like to avoid assigning UI to these ports, or just avoid assigning 
> UI port to 4045 which is too close to default port.
> If we'd like to accept this idea, I'm happy to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15842) Add support for socket stream.

2016-06-09 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-15842:

Description: 
Streaming so far has offset based sources with all the available sources like 
file-source and memory-source that do not need additional capabilities to 
implement offset for any given range.

Socket stream at OS level has a very tiny buffer. Many message queues have the 
ability to keep the message lingering until it is read by the receiver end. 
ZeroMQ is one such example. However in the case of socket stream, this is not 
supported. 

The challenge here would be to implement a way to  buffer for a configurable 
amount of time and discuss strategies for overflow and underflow.

This JIRA will form the basis for implementing sources which do not have native 
support for lingering a message for any amount of time until it is read. It 
deals with design doc if necessary and supporting code to implement such 
sources.

  was:
Streaming so far has an offset based sources with all the available sources 
like file-source and memory-source that do not need additional capabilities to 
implement offset for any given range.

Socket stream at OS level has a very tiny buffer. Many message queues have the 
ability to keep the message lingering until it is read by the receiver end. 
ZeroMQ is one such example. However in the case of socket stream, this is not 
supported. 

The challenge here would be to implement a way to  buffer for a configurable 
amount of time and discuss strategies for overflow and underflow.

This JIRA will form the basis for implementing sources which do not have native 
support for lingering a message for any amount of time until it is read. It 
deals with design doc if necessary and supporting code to implement such 
sources.


> Add support for socket stream.
> --
>
> Key: SPARK-15842
> URL: https://issues.apache.org/jira/browse/SPARK-15842
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>
> Streaming so far has offset based sources with all the available sources like 
> file-source and memory-source that do not need additional capabilities to 
> implement offset for any given range.
> Socket stream at OS level has a very tiny buffer. Many message queues have 
> the ability to keep the message lingering until it is read by the receiver 
> end. ZeroMQ is one such example. However in the case of socket stream, this 
> is not supported. 
> The challenge here would be to implement a way to  buffer for a configurable 
> amount of time and discuss strategies for overflow and underflow.
> This JIRA will form the basis for implementing sources which do not have 
> native support for lingering a message for any amount of time until it is 
> read. It deals with design doc if necessary and supporting code to implement 
> such sources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11765) Avoid assign UI port between browser unsafe ports (or just 4045: lockd)

2016-06-09 Thread Willy Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322453#comment-15322453
 ] 

Willy Lee commented on SPARK-11765:
---

I'm sorry, I must not have been clear. I think it's broken behavior that the 
default creates an unreachable UI on the fifth iteration. There are defaults we 
could choose that would avoid this behavior for much longer. I understand there 
is a work around, but that's not the same as having a fix.

> Avoid assign UI port between browser unsafe ports (or just 4045: lockd)
> ---
>
> Key: SPARK-11765
> URL: https://issues.apache.org/jira/browse/SPARK-11765
> Project: Spark
>  Issue Type: Improvement
>Reporter: Jungtaek Lim
>Priority: Minor
>
> Spark UI port starts on 4040, and UI port is incremented by 1 for every 
> confliction.
> In our use case, we have some drivers running at the same time, which makes 
> UI port to be assigned to 4045, which is treated to unsafe port for chrome 
> and mozilla.
> http://src.chromium.org/viewvc/chrome/trunk/src/net/base/net_util.cc?view=markup
> http://www-archive.mozilla.org/projects/netlib/PortBanning.html#portlist
> We would like to avoid assigning UI to these ports, or just avoid assigning 
> UI port to 4045 which is too close to default port.
> If we'd like to accept this idea, I'm happy to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15843) Spark RAM issue

2016-06-09 Thread Sreetej Lakkam (JIRA)
Sreetej Lakkam created SPARK-15843:
--

 Summary: Spark RAM issue
 Key: SPARK-15843
 URL: https://issues.apache.org/jira/browse/SPARK-15843
 Project: Spark
  Issue Type: Question
  Components: Spark Shell
Affects Versions: 1.6.1
 Environment: 
RASPBIAN JESSIE
Full desktop image based on Debian Jessie
Reporter: Sreetej Lakkam


Trying to run Spark 1.6.1 on Hadoop 2.7.1 over Raspberry Pi - 3. On submitting 
spark-shell 
sudo /opt/spark-1.6.1-bin-hadoop2.6/bin/spark-shell

produces an error
Java HotSpot(TM) Client VM warning: INFO: os::commit_memory(0x4954, 
715915264, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 715915264 bytes for committing 
reserved memory.
# An error report file with more information is saved as:
# /home/pi/hs_err_pid2179.log


Find the log file hs_err_pid2179.log

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 715915264 bytes for committing 
reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2627), pid=2179, tid=1983399008
#
# JRE version:  (8.0_65-b17) (build )
# Java VM: Java HotSpot(TM) Client VM (25.65-b01 mixed mode, sharing linux-arm )
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#

---  T H R E A D  ---

Current thread (0x76207400):  JavaThread "Unknown thread" [_thread_in_vm, 
id=2200, stack(0x76335000,0x76385000)]

Stack: [0x76335000,0x76385000]

---  P R O C E S S  ---

Java Threads: ( => current thread )

Other Threads:

=>0x76207400 (exited) JavaThread "Unknown thread" [_thread_in_vm, id=2200, 
stack(0x76335000,0x76385000)]

VM state:not at safepoint (not fully initialized)

VM Mutex/Monitor currently owned by a thread: None

GC Heap History (0 events):
No events

Deoptimization events (0 events):
No events

Internal exceptions (0 events):
No events

Events (0 events):
No events


Dynamic libraries:
8000-9000 r-xp  b3:02 22725  
/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/bin/java
0001-00011000 rw-p  b3:02 22725  
/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/bin/java
003e5000-00406000 rw-p  00:00 0  [heap]
33dff000-33eaa000 rw-p  00:00 0 
33eaa000-33fff000 ---p  00:00 0 
33fff000-4954 rw-p  00:00 0 
740c3000-740c4000 rw-p  00:00 0 
740c4000-74143000 ---p  00:00 0 
74143000-7416b000 rwxp  00:00 0 
7416b000-76143000 ---p  00:00 0 
76143000-7615a000 r-xp  b3:02 22812  
/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/lib/arm/libzip.so
7615a000-76161000 ---p 00017000 b3:02 22812  
/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/lib/arm/libzip.so
76161000-76162000 rw-p 00016000 b3:02 22812  
/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/lib/arm/libzip.so
76162000-7616d000 r-xp  b3:02 3196   
/lib/arm-linux-gnueabihf/libnss_files-2.19.so
7616d000-7617c000 ---p b000 b3:02 3196   
/lib/arm-linux-gnueabihf/libnss_files-2.19.so
7617c000-7617d000 r--p a000 b3:02 3196   
/lib/arm-linux-gnueabihf/libnss_files-2.19.so
7617d000-7617e000 rw-p b000 b3:02 3196   
/lib/arm-linux-gnueabihf/libnss_files-2.19.so
7617e000-76187000 r-xp  b3:02 3204   
/lib/arm-linux-gnueabihf/libnss_nis-2.19.so
76187000-76196000 ---p 9000 b3:02 3204   
/lib/arm-linux-gnueabihf/libnss_nis-2.19.so
76196000-76197000 r--p 8000 b3:02 3204   
/lib/arm-linux-gnueabihf/libnss_nis-2.19.so
76197000-76198000 rw-p 9000 b3:02 3204   
/lib/arm-linux-gnueabihf/libnss_nis-2.19.so
76198000-761a9000 r-xp  b3:02 3193   
/lib/arm-linux-gnueabihf/libnsl-2.19.so
761a9000-761b8000 ---p 00011000 b3:02 3193   
/lib/arm-linux-gnueabihf/libnsl-2.19.so
761b8000-761b9000 r--p 0001 b3:02 3193   
/lib/arm-linux-gnueabihf/libnsl-2.19.so
761b9000-761ba000 rw-p 00011000 b3:02 3193   
/lib/arm-linux-gnueabihf/libnsl-2.19.so
761ba000-761bc000 rw-p  00:00 0 
761bc000-761c3000 r-xp  b3:02 3194   
/lib/arm-linux-gnueabihf/libnss_compat-2.19.so
761c3000-761d2000 ---p 7000 b3:02 3194   
/lib/arm-linux

[jira] [Comment Edited] (SPARK-15801) spark-submit --num-executors switch also works without YARN

2016-06-09 Thread Jonathan Taws (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322322#comment-15322322
 ] 

Jonathan Taws edited comment on SPARK-15801 at 6/9/16 1:08 PM:
---

I don't think it is a problem, but it might be interesting to get a warning if 
the --num-executors option is used in stand alone mode to notify users that 
it's basically not doing anything, and recommend to use --executor-cores 
instead. 
I am available to do the fix if this seems necessary. 


was (Author: jonathantaws):
I don't think it is a problem, but it might be interesting to get a warning if 
the --num-executors option is used in stand alone mode to notify users that 
it's basically not doing anything, and recommend to use --executor-cores 
instead. 
I am available to the fix if this seems necessary. 

> spark-submit --num-executors switch also works without YARN
> ---
>
> Key: SPARK-15801
> URL: https://issues.apache.org/jira/browse/SPARK-15801
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] 
> regarding the SPARK_WORKER_INSTANCES property, I also found that the 
> {{--num-executors}} switch documented in the spark-submit help is partially 
> incorrect. 
> Here's one part of the output (produced by {{spark-submit --help}}): 
> {code}
> YARN-only:
>   --driver-cores NUM  Number of cores used by the driver, only in 
> cluster mode
>   (Default: 1).
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --num-executors NUM Number of executors to launch (Default: 2).
> {code}
> Correct me if I am wrong, but the num-executors switch also works in Spark 
> standalone mode *without YARN*.
> I tried by only launching a master and a worker with 4 executors specified, 
> and they were all successfully spawned. The master switch pointed to the 
> master's url, and not to the yarn value. 
> Here's the exact command : {{spark-submit --master spark://[local 
> machine]:7077 --num-executors 4 --executor-cores 2}}
> By default it is *1* executor per worker in Spark standalone mode without 
> YARN, but this option enables to specify the number of executors (per worker 
> ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe 
> it defaults to 2 in YARN mode. 
> I would propose to move the option from the *YARN-only* section to the *Spark 
> standalone and YARN only* section.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15801) spark-submit --num-executors switch also works without YARN

2016-06-09 Thread Jonathan Taws (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322322#comment-15322322
 ] 

Jonathan Taws edited comment on SPARK-15801 at 6/9/16 1:08 PM:
---

I don't think it is a problem, but it might be interesting to get a warning if 
the --num-executors option is used in stand alone mode to notify users that 
it's basically not doing anything, and recommend to use --executor-cores 
instead. 
I am available to the fix if this seems necessary. 


was (Author: jonathantaws):
I don't think it is a problem, but it might be interesting to get a warning if 
the --num-executors option is used in stand alone mode to notify users that 
it's basically not doing anything, and recommend to use --executor-cores 
instead. 

> spark-submit --num-executors switch also works without YARN
> ---
>
> Key: SPARK-15801
> URL: https://issues.apache.org/jira/browse/SPARK-15801
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] 
> regarding the SPARK_WORKER_INSTANCES property, I also found that the 
> {{--num-executors}} switch documented in the spark-submit help is partially 
> incorrect. 
> Here's one part of the output (produced by {{spark-submit --help}}): 
> {code}
> YARN-only:
>   --driver-cores NUM  Number of cores used by the driver, only in 
> cluster mode
>   (Default: 1).
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --num-executors NUM Number of executors to launch (Default: 2).
> {code}
> Correct me if I am wrong, but the num-executors switch also works in Spark 
> standalone mode *without YARN*.
> I tried by only launching a master and a worker with 4 executors specified, 
> and they were all successfully spawned. The master switch pointed to the 
> master's url, and not to the yarn value. 
> Here's the exact command : {{spark-submit --master spark://[local 
> machine]:7077 --num-executors 4 --executor-cores 2}}
> By default it is *1* executor per worker in Spark standalone mode without 
> YARN, but this option enables to specify the number of executors (per worker 
> ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe 
> it defaults to 2 in YARN mode. 
> I would propose to move the option from the *YARN-only* section to the *Spark 
> standalone and YARN only* section.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15843) Spark RAM issue

2016-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15843.
---
  Resolution: Invalid
Target Version/s:   (was: 1.6.1)

Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and ask 
questions at user@. 

This means your JVM ran out of OS memory, probably because you asked for more 
heap than the OS can allocate and perhaps swap is off. It's not a Spark issue.

> Spark RAM issue
> ---
>
> Key: SPARK-15843
> URL: https://issues.apache.org/jira/browse/SPARK-15843
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell
>Affects Versions: 1.6.1
> Environment: RASPBIAN JESSIE
> Full desktop image based on Debian Jessie
>Reporter: Sreetej Lakkam
>  Labels: beginner, newbie
>
> Trying to run Spark 1.6.1 on Hadoop 2.7.1 over Raspberry Pi - 3. On 
> submitting spark-shell 
> sudo /opt/spark-1.6.1-bin-hadoop2.6/bin/spark-shell
> produces an error
> Java HotSpot(TM) Client VM warning: INFO: os::commit_memory(0x4954, 
> 715915264, 0) failed; error='Cannot allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (mmap) failed to map 715915264 bytes for 
> committing reserved memory.
> # An error report file with more information is saved as:
> # /home/pi/hs_err_pid2179.log
> Find the log file hs_err_pid2179.log
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (mmap) failed to map 715915264 bytes for 
> committing reserved memory.
> # Possible reasons:
> #   The system is out of physical RAM or swap space
> #   In 32 bit mode, the process size limit was hit
> # Possible solutions:
> #   Reduce memory load on the system
> #   Increase physical memory or swap space
> #   Check if swap backing store is full
> #   Use 64 bit Java on a 64 bit OS
> #   Decrease Java heap size (-Xmx/-Xms)
> #   Decrease number of Java threads
> #   Decrease Java thread stack sizes (-Xss)
> #   Set larger code cache with -XX:ReservedCodeCacheSize=
> # This output file may be truncated or incomplete.
> #
> #  Out of Memory Error (os_linux.cpp:2627), pid=2179, tid=1983399008
> #
> # JRE version:  (8.0_65-b17) (build )
> # Java VM: Java HotSpot(TM) Client VM (25.65-b01 mixed mode, sharing 
> linux-arm )
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> ---  T H R E A D  ---
> Current thread (0x76207400):  JavaThread "Unknown thread" [_thread_in_vm, 
> id=2200, stack(0x76335000,0x76385000)]
> Stack: [0x76335000,0x76385000]
> ---  P R O C E S S  ---
> Java Threads: ( => current thread )
> Other Threads:
> =>0x76207400 (exited) JavaThread "Unknown thread" [_thread_in_vm, id=2200, 
> stack(0x76335000,0x76385000)]
> VM state:not at safepoint (not fully initialized)
> VM Mutex/Monitor currently owned by a thread: None
> GC Heap History (0 events):
> No events
> Deoptimization events (0 events):
> No events
> Internal exceptions (0 events):
> No events
> Events (0 events):
> No events
> Dynamic libraries:
> 8000-9000 r-xp  b3:02 22725  
> /usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/bin/java
> 0001-00011000 rw-p  b3:02 22725  
> /usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/bin/java
> 003e5000-00406000 rw-p  00:00 0  [heap]
> 33dff000-33eaa000 rw-p  00:00 0 
> 33eaa000-33fff000 ---p  00:00 0 
> 33fff000-4954 rw-p  00:00 0 
> 740c3000-740c4000 rw-p  00:00 0 
> 740c4000-74143000 ---p  00:00 0 
> 74143000-7416b000 rwxp  00:00 0 
> 7416b000-76143000 ---p  00:00 0 
> 76143000-7615a000 r-xp  b3:02 22812  
> /usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/lib/arm/libzip.so
> 7615a000-76161000 ---p 00017000 b3:02 22812  
> /usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/lib/arm/libzip.so
> 76161000-76162000 rw-p 00016000 b3:02 22812  
> /usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/lib/arm/libzip.so
> 76162000-7616d000 r-xp  b3:02 3196   
> /lib/arm-linux-gnueabihf/libnss_files-2.19.so
> 7616d000-7617c000 ---p b000 b3:02 3196   
> /lib/arm-linux-gnueabihf/libnss_files-2.19.so
> 7617c000-7617d000 r--p a000 b3:02 3196   
> /lib/arm-linux-gnueabihf/libnss_files-2.19.so
> 7617d000-7617e000 rw-p b000 b3:02 3196   
> /lib/arm-linux-gnueabihf/libnss_files-2.19.so
> 7617e000-76187000 r-xp  b3:02 3204   
> /lib/arm-linux-gnueabihf/libnss_nis-2.19.so
> 76187000-76196000 ---p 9000 b3:02 3204   
> /lib/arm-linux-gnueabihf/libnss_nis-2.19.so
> 76196000-76197000 r--p 8000 b3:02 3204   
> /lib/arm-linux-gnueabihf/li

[jira] [Commented] (SPARK-11765) Avoid assign UI port between browser unsafe ports (or just 4045: lockd)

2016-06-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322494#comment-15322494
 ] 

Sean Owen commented on SPARK-11765:
---

You can do two things -- pick another starting port, or limit the number of 
retries (spark.port.maxRetries). Maybe it should default to 5, which would 
tidily happen to address this problem for the default of 4040 (which we 
probably shouldn't change at this point). Or, passing 0 should cause it to bind 
to any available port.

> Avoid assign UI port between browser unsafe ports (or just 4045: lockd)
> ---
>
> Key: SPARK-11765
> URL: https://issues.apache.org/jira/browse/SPARK-11765
> Project: Spark
>  Issue Type: Improvement
>Reporter: Jungtaek Lim
>Priority: Minor
>
> Spark UI port starts on 4040, and UI port is incremented by 1 for every 
> confliction.
> In our use case, we have some drivers running at the same time, which makes 
> UI port to be assigned to 4045, which is treated to unsafe port for chrome 
> and mozilla.
> http://src.chromium.org/viewvc/chrome/trunk/src/net/base/net_util.cc?view=markup
> http://www-archive.mozilla.org/projects/netlib/PortBanning.html#portlist
> We would like to avoid assigning UI to these ports, or just avoid assigning 
> UI port to 4045 which is too close to default port.
> If we'd like to accept this idea, I'm happy to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11765) Avoid assign UI port between browser unsafe ports (or just 4045: lockd)

2016-06-09 Thread Willy Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322503#comment-15322503
 ] 

Willy Lee commented on SPARK-11765:
---

The problem as I see it is that 4045 is an unused port on most machines, it's 
just that many browsers refuse to connect to it expecting a malicious site 
trying to avoid firewall rules by using the port commonly used for lockd.

Chrome, Safari and Firefox all refused to connect to a perfectly fine port 4045 
on our Spark infrastructure.

> Avoid assign UI port between browser unsafe ports (or just 4045: lockd)
> ---
>
> Key: SPARK-11765
> URL: https://issues.apache.org/jira/browse/SPARK-11765
> Project: Spark
>  Issue Type: Improvement
>Reporter: Jungtaek Lim
>Priority: Minor
>
> Spark UI port starts on 4040, and UI port is incremented by 1 for every 
> confliction.
> In our use case, we have some drivers running at the same time, which makes 
> UI port to be assigned to 4045, which is treated to unsafe port for chrome 
> and mozilla.
> http://src.chromium.org/viewvc/chrome/trunk/src/net/base/net_util.cc?view=markup
> http://www-archive.mozilla.org/projects/netlib/PortBanning.html#portlist
> We would like to avoid assigning UI to these ports, or just avoid assigning 
> UI port to 4045 which is too close to default port.
> If we'd like to accept this idea, I'm happy to work on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value

2016-06-09 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322670#comment-15322670
 ] 

Takeshi Yamamuro commented on SPARK-15585:
--

I'm afraid the `sep` option for `csv` overrides the `delimiter` option.
On the other hand, the original description describes the `quote` option and it 
seems the `quote` one is not related to the `sep` one.


> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15772) Improve Scala API docs

2016-06-09 Thread nirav patel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322709#comment-15322709
 ] 

nirav patel commented on SPARK-15772:
-

I can't point you to every individual functions which needs docs. But following 
are some lists of classe for which you can add explanation of parameters; 
examples if it's complex api.
PairRDDFunctions
OrderedRDDFunctions
AsyncRDDActions
RangePartitioner



  

> Improve Scala API docs 
> ---
>
> Key: SPARK-15772
> URL: https://issues.apache.org/jira/browse/SPARK-15772
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Documentation
>Reporter: nirav patel
>
> Hi, I just found out that spark python APIs are much more elaborate then 
> scala counterpart. e.g. 
> https://spark.apache.org/docs/1.4.1/api/python/pyspark.html?highlight=treereduce#pyspark.RDD.treeReduce
> https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=treereduce#pyspark.RDD
> There are clear explanations of parameters; there are examples as well . I 
> think this would be great improvement on Scala API as well. It will make API 
> more friendly in first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15837) PySpark ML Word2Vec should support maxSentenceLength

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322720#comment-15322720
 ] 

Apache Spark commented on SPARK-15837:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/13578

> PySpark ML Word2Vec should support maxSentenceLength
> 
>
> Key: SPARK-15837
> URL: https://issues.apache.org/jira/browse/SPARK-15837
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> SPARK-15793 adds maxSentenceLength for ML Word2Vec in Scala, we should also 
> add it in Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15837) PySpark ML Word2Vec should support maxSentenceLength

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15837:


Assignee: Apache Spark

> PySpark ML Word2Vec should support maxSentenceLength
> 
>
> Key: SPARK-15837
> URL: https://issues.apache.org/jira/browse/SPARK-15837
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-15793 adds maxSentenceLength for ML Word2Vec in Scala, we should also 
> add it in Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15837) PySpark ML Word2Vec should support maxSentenceLength

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15837:


Assignee: (was: Apache Spark)

> PySpark ML Word2Vec should support maxSentenceLength
> 
>
> Key: SPARK-15837
> URL: https://issues.apache.org/jira/browse/SPARK-15837
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> SPARK-15793 adds maxSentenceLength for ML Word2Vec in Scala, we should also 
> add it in Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2016-06-09 Thread Sandeep (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322739#comment-15322739
 ] 

Sandeep commented on SPARK-2984:


I tried with spark.speculation=false as well and it still gives the same 
failures

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> -- Chen Song at 
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html
> {noformat}
> I am running a Spark Streaming job that uses saveAsTextFiles to save results 
> into hdfs files. However, it has an exception after 20 batches
> result-140631234/_temporary/0/task_201407251119__m_03 does not 
> exist.
> {noformat}
> and
> {noformat}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. 
> Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2946)
>   at 
> org.apache.hadoop.hdfs.ser

[jira] [Created] (SPARK-15844) HistoryServer doesn't come up if spark.authenticate = true

2016-06-09 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-15844:
--

 Summary: HistoryServer doesn't come up if spark.authenticate = true
 Key: SPARK-15844
 URL: https://issues.apache.org/jira/browse/SPARK-15844
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
 Environment: cluster with spark.authenticate  = true
Reporter: Steve Loughran
Priority: Minor


If the configuration used to start the history server has 
{{spark.authenticate}} set, then the server doesn't come up: there's no secret 
for the {{SecurityManager}}. —even though that secret is used for the secure 
shuffle, which the history server doesn't go anywhere near



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15844) HistoryServer doesn't come up if spark.authenticate = true

2016-06-09 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322774#comment-15322774
 ] 

Steve Loughran commented on SPARK-15844:


Stack.
{code}
16/05/31 22:46:25 INFO SecurityManager: Changing view acls to: spark
16/05/31 22:46:25 INFO SecurityManager: Changing modify acls to: spark
Exception in thread "main" java.lang.IllegalArgumentException: Error: a secret 
key must be specified via the spark.authenticate.secret config
at 
org.apache.spark.SecurityManager.generateSecretKey(SecurityManager.scala:397)
at org.apache.spark.SecurityManager.(SecurityManager.scala:219)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:250)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
{code}

> HistoryServer doesn't come up if spark.authenticate = true
> --
>
> Key: SPARK-15844
> URL: https://issues.apache.org/jira/browse/SPARK-15844
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: cluster with spark.authenticate  = true
>Reporter: Steve Loughran
>Priority: Minor
>
> If the configuration used to start the history server has 
> {{spark.authenticate}} set, then the server doesn't come up: there's no 
> secret for the {{SecurityManager}}. —even though that secret is used for the 
> secure shuffle, which the history server doesn't go anywhere near



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15828) YARN is not aware of Spark's External Shuffle Service

2016-06-09 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322788#comment-15322788
 ] 

Saisai Shao commented on SPARK-15828:
-

I think this issue is not related to dynamic allocation, if Spark application 
is using external shuffle service it will meet this issue when NM is down.

>From my understanding this behavior is expected, once task is failed with 
>fetch failure, Spark will either rerun the failed tasks, also if the same task 
>failed more the 4 times by default, the job will be failed.

> YARN is not aware of Spark's External Shuffle Service
> -
>
> Key: SPARK-15828
> URL: https://issues.apache.org/jira/browse/SPARK-15828
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
> Environment: EMR
>Reporter: Miles Crawford
>
> When using Spark with dynamic allocation, it is common for all containers on a
> particular YARN node to be released.  This is generally okay because of the
> external shuffle service.
> The problem arises when YARN is attempting to downsize the cluster - once all
> containers on the node are gone, YARN will decommission the node, regardless 
> of
> whether the external shuffle service is still required!
> The once the node is shut down, jobs begin failing with messages such as:
> {code}
> 2016-06-07 18:56:40,016 ERROR o.a.s.n.shuffle.RetryingBlockFetcher: Exception 
> while beginning fetch of 13 outstanding blocks
> java.io.IOException: Failed to connect to 
> ip-10-12-32-67.us-west-2.compute.internal/10.12.32.67:7337
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>  
> ~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>  
> ~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>  
> ~[2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:152)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:316)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:263)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:112)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:43)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.RDD.computeOrReadCheckpo

[jira] [Assigned] (SPARK-15844) HistoryServer doesn't come up if spark.authenticate = true

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15844:


Assignee: Apache Spark

> HistoryServer doesn't come up if spark.authenticate = true
> --
>
> Key: SPARK-15844
> URL: https://issues.apache.org/jira/browse/SPARK-15844
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: cluster with spark.authenticate  = true
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Minor
>
> If the configuration used to start the history server has 
> {{spark.authenticate}} set, then the server doesn't come up: there's no 
> secret for the {{SecurityManager}}. —even though that secret is used for the 
> secure shuffle, which the history server doesn't go anywhere near



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15844) HistoryServer doesn't come up if spark.authenticate = true

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322793#comment-15322793
 ] 

Apache Spark commented on SPARK-15844:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/13579

> HistoryServer doesn't come up if spark.authenticate = true
> --
>
> Key: SPARK-15844
> URL: https://issues.apache.org/jira/browse/SPARK-15844
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: cluster with spark.authenticate  = true
>Reporter: Steve Loughran
>Priority: Minor
>
> If the configuration used to start the history server has 
> {{spark.authenticate}} set, then the server doesn't come up: there's no 
> secret for the {{SecurityManager}}. —even though that secret is used for the 
> secure shuffle, which the history server doesn't go anywhere near



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15844) HistoryServer doesn't come up if spark.authenticate = true

2016-06-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15844:


Assignee: (was: Apache Spark)

> HistoryServer doesn't come up if spark.authenticate = true
> --
>
> Key: SPARK-15844
> URL: https://issues.apache.org/jira/browse/SPARK-15844
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: cluster with spark.authenticate  = true
>Reporter: Steve Loughran
>Priority: Minor
>
> If the configuration used to start the history server has 
> {{spark.authenticate}} set, then the server doesn't come up: there's no 
> secret for the {{SecurityManager}}. —even though that secret is used for the 
> secure shuffle, which the history server doesn't go anywhere near



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15801) spark-submit --num-executors switch also works without YARN

2016-06-09 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322800#comment-15322800
 ] 

Saisai Shao commented on SPARK-15801:
-

It has already been mentioned in {{spark-submit --help}}:

{noformat}
YARN-only:
  --driver-cores NUM  Number of cores used by the driver, only in 
cluster mode
  (Default: 1).
  --queue QUEUE_NAME  The YARN queue to submit to (Default: "default").
  --num-executors NUM Number of executors to launch (Default: 2).
{noformat}

It is a YARN-only configuration, not necessary to add new warning log.


> spark-submit --num-executors switch also works without YARN
> ---
>
> Key: SPARK-15801
> URL: https://issues.apache.org/jira/browse/SPARK-15801
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] 
> regarding the SPARK_WORKER_INSTANCES property, I also found that the 
> {{--num-executors}} switch documented in the spark-submit help is partially 
> incorrect. 
> Here's one part of the output (produced by {{spark-submit --help}}): 
> {code}
> YARN-only:
>   --driver-cores NUM  Number of cores used by the driver, only in 
> cluster mode
>   (Default: 1).
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --num-executors NUM Number of executors to launch (Default: 2).
> {code}
> Correct me if I am wrong, but the num-executors switch also works in Spark 
> standalone mode *without YARN*.
> I tried by only launching a master and a worker with 4 executors specified, 
> and they were all successfully spawned. The master switch pointed to the 
> master's url, and not to the yarn value. 
> Here's the exact command : {{spark-submit --master spark://[local 
> machine]:7077 --num-executors 4 --executor-cores 2}}
> By default it is *1* executor per worker in Spark standalone mode without 
> YARN, but this option enables to specify the number of executors (per worker 
> ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe 
> it defaults to 2 in YARN mode. 
> I would propose to move the option from the *YARN-only* section to the *Spark 
> standalone and YARN only* section.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15800) Accessing kerberised hdfs from Spark running with Resource Manager

2016-06-09 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322804#comment-15322804
 ] 

Saisai Shao commented on SPARK-15800:
-

{quote}
Spark is currently running using the Resource Manager, not on YARN.
{quote}

What's the meaning of "Resource Manager", are you referring to YARN's Resource 
Manager or kubernetes?

>From my understanding only Spark running on YARN can support accessing 
>Kerberized Hadoop environment.

> Accessing kerberised hdfs from Spark running with Resource Manager
> --
>
> Key: SPARK-15800
> URL: https://issues.apache.org/jira/browse/SPARK-15800
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Morgan Jones
>Priority: Minor
>
> Hi,
> I've been runing a Spark cluster in kubernetes and i'm trying to access data 
> in HDFS in a kerberised instance of hadoop. It seems that the Spark cluster 
> is unable to pass my tickets to the hadoop instance, is this by design?
> If so how much work would be involved to allow the cluster to communicate 
> with kerberised hdfs and is this currently in the roadmap?
> Edit: Spark is currently running using the Resource Manager, not on YARN.
> Cheers,
> Morgan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15828) YARN is not aware of Spark's External Shuffle Service

2016-06-09 Thread Miles Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322827#comment-15322827
 ] 

Miles Crawford commented on SPARK-15828:


Possibly could happen without dynamic allocation, but if the containers stay 
around for the full application runtime, the NM is unlikely to be 
decommissioned, so I think dynamic allocation greatly increases the surface of 
the bug.

The issue here is that, during times when very few containers are allocated, 
YARN can remove a very large percentage of the nodes - the application has to 
retry virtually everything, and even with an increased retry count our 
application cannot survive this.  The retry might be effective for isolated 
host removals or failures, but not systematic ones, as when YARN is waiting to 
decommission hosts for a scale-down.

In short, YARN thinks an application is "done" with a host when it has no 
containers.  For Spark with dynamic allocation, this is not true.

> YARN is not aware of Spark's External Shuffle Service
> -
>
> Key: SPARK-15828
> URL: https://issues.apache.org/jira/browse/SPARK-15828
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
> Environment: EMR
>Reporter: Miles Crawford
>
> When using Spark with dynamic allocation, it is common for all containers on a
> particular YARN node to be released.  This is generally okay because of the
> external shuffle service.
> The problem arises when YARN is attempting to downsize the cluster - once all
> containers on the node are gone, YARN will decommission the node, regardless 
> of
> whether the external shuffle service is still required!
> The once the node is shut down, jobs begin failing with messages such as:
> {code}
> 2016-06-07 18:56:40,016 ERROR o.a.s.n.shuffle.RetryingBlockFetcher: Exception 
> while beginning fetch of 13 outstanding blocks
> java.io.IOException: Failed to connect to 
> ip-10-12-32-67.us-west-2.compute.internal/10.12.32.67:7337
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>  
> ~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>  
> ~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>  
> ~[2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:152)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:316)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:263)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:112)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:43)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.

[jira] [Updated] (SPARK-15845) Expose metrics for sub-stage transformations and action

2016-06-09 Thread nirav patel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nirav patel updated SPARK-15845:

Summary: Expose metrics for sub-stage transformations and action   (was: 
Expose metrics for sub-task transformations and action )

> Expose metrics for sub-stage transformations and action 
> 
>
> Key: SPARK-15845
> URL: https://issues.apache.org/jira/browse/SPARK-15845
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: nirav patel
>
> Spark optimizes DAG processing by efficiently selecting stage boundaries.  
> This makes spark stage a sequence of multiple transformation and one or zero 
> action. As Aa result stage that spark is currently running can be internally 
> series of (map -> shuffle -> map -> map -> collect) Notice here that it goes 
> pass shuffle dependency and includes the next transformations and actions 
> into same stage. So any task of this stage is essentially doing all those 
> transformation/actions as a Unit and there is no further visibility inside 
> it. Basically network read, populating partitions, compute, shuffle write, 
> shuffle read, compute, writing final partitions to disk ALL happens within 
> one stage! Means all tasks of that stage is basically doing all those 
> operations on single partition as a unit. This takes away huge visibility 
> into users transformation and actions in terms of which one is taking longer 
> or which one is resource bottleneck and which one is failing.
> spark UI just shows its currently running some action stage. If job fails at 
> that point spark UI just says Action failed but in fact it could be any stage 
> in that lazy chain of evaluation. Looking at executor logs gives some 
> insights but that's not always straightforward. 
> I think we need more visibility into what's happening underneath a task 
> (series of spark transformations/actions that comprise a stage) so we can 
> easily troubleshoot as well as find bottlenecks and optimize our DAG.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15845) Expose metrics for sub-task steps

2016-06-09 Thread nirav patel (JIRA)
nirav patel created SPARK-15845:
---

 Summary: Expose metrics for sub-task steps 
 Key: SPARK-15845
 URL: https://issues.apache.org/jira/browse/SPARK-15845
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.5.2
Reporter: nirav patel


Spark optimizes DAG processing by efficiently selecting stage boundaries.  This 
makes spark stage a sequence of multiple transformation and one or zero action. 
As Aa result stage that spark is currently running can be internally series of 
(map -> shuffle -> map -> map -> collect) Notice here that it goes pass shuffle 
dependency and includes the next transformations and actions into same stage. 
So any task of this stage is essentially doing all those transformation/actions 
as a Unit and there is no further visibility inside it. Basically network read, 
populating partitions, compute, shuffle write, shuffle read, compute, writing 
final partitions to disk ALL happens within one stage! Means all tasks of that 
stage is basically doing all those operations on single partition as a unit. 
This takes away huge visibility into users transformation and actions in terms 
of which one is taking longer or which one is resource bottleneck and which one 
is failing.

spark UI just shows its currently running some action stage. If job fails at 
that point spark UI just says Action failed but in fact it could be any stage 
in that lazy chain of evaluation. Looking at executor logs gives some insights 
but that's not always straightforward. 

I think we need more visibility into what's happening underneath a task (series 
of spark transformations/actions that comprise a stage) so we can easily 
troubleshoot as well as find bottlenecks and optimize our DAG.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15845) Expose metrics for sub-task transformations and action

2016-06-09 Thread nirav patel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nirav patel updated SPARK-15845:

Summary: Expose metrics for sub-task transformations and action   (was: 
Expose metrics for sub-task steps )

> Expose metrics for sub-task transformations and action 
> ---
>
> Key: SPARK-15845
> URL: https://issues.apache.org/jira/browse/SPARK-15845
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: nirav patel
>
> Spark optimizes DAG processing by efficiently selecting stage boundaries.  
> This makes spark stage a sequence of multiple transformation and one or zero 
> action. As Aa result stage that spark is currently running can be internally 
> series of (map -> shuffle -> map -> map -> collect) Notice here that it goes 
> pass shuffle dependency and includes the next transformations and actions 
> into same stage. So any task of this stage is essentially doing all those 
> transformation/actions as a Unit and there is no further visibility inside 
> it. Basically network read, populating partitions, compute, shuffle write, 
> shuffle read, compute, writing final partitions to disk ALL happens within 
> one stage! Means all tasks of that stage is basically doing all those 
> operations on single partition as a unit. This takes away huge visibility 
> into users transformation and actions in terms of which one is taking longer 
> or which one is resource bottleneck and which one is failing.
> spark UI just shows its currently running some action stage. If job fails at 
> that point spark UI just says Action failed but in fact it could be any stage 
> in that lazy chain of evaluation. Looking at executor logs gives some 
> insights but that's not always straightforward. 
> I think we need more visibility into what's happening underneath a task 
> (series of spark transformations/actions that comprise a stage) so we can 
> easily troubleshoot as well as find bottlenecks and optimize our DAG.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15845) Expose metrics for sub-stage transformations and action

2016-06-09 Thread nirav patel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nirav patel updated SPARK-15845:

Description: 
Spark optimizes DAG processing by efficiently selecting stage boundaries.  This 
makes spark stage a sequence of multiple transformation and one or zero action. 
As Aa result stage that spark is currently running can be internally series of 
(map -> shuffle -> map -> map -> collect) Notice here that it goes pass shuffle 
dependency and includes the next transformations and actions into same stage. 
So any task of this stage is essentially doing all those transformation/actions 
as a Unit and there is no further visibility inside it. Basically network read, 
populating partitions, compute, shuffle write, shuffle read, compute, writing 
final partitions to disk ALL happens within one stage! Means all tasks of that 
stage is basically doing all those operations on single partition as a unit. 
This takes away huge visibility into users transformation and actions in terms 
of which one is taking longer or which one is resource bottleneck and which one 
is failing.

spark UI just shows its currently running some action stage. If job fails at 
that point spark UI just says Action failed but in fact it could be any stage 
in that lazy chain of evaluation. Looking at executor logs gives some insights 
but that's not always straightforward. 

I think we need more visibility into what's happening underneath a task (series 
of spark transformations/actions that comprise a stage) so we can easily 
troubleshoot as well as find bottlenecks and optimize our DAG.  

PS - Had a positive feedback about this from DataBricks dev team member at 
SparkSummit. 

  was:
Spark optimizes DAG processing by efficiently selecting stage boundaries.  This 
makes spark stage a sequence of multiple transformation and one or zero action. 
As Aa result stage that spark is currently running can be internally series of 
(map -> shuffle -> map -> map -> collect) Notice here that it goes pass shuffle 
dependency and includes the next transformations and actions into same stage. 
So any task of this stage is essentially doing all those transformation/actions 
as a Unit and there is no further visibility inside it. Basically network read, 
populating partitions, compute, shuffle write, shuffle read, compute, writing 
final partitions to disk ALL happens within one stage! Means all tasks of that 
stage is basically doing all those operations on single partition as a unit. 
This takes away huge visibility into users transformation and actions in terms 
of which one is taking longer or which one is resource bottleneck and which one 
is failing.

spark UI just shows its currently running some action stage. If job fails at 
that point spark UI just says Action failed but in fact it could be any stage 
in that lazy chain of evaluation. Looking at executor logs gives some insights 
but that's not always straightforward. 

I think we need more visibility into what's happening underneath a task (series 
of spark transformations/actions that comprise a stage) so we can easily 
troubleshoot as well as find bottlenecks and optimize our DAG.  


> Expose metrics for sub-stage transformations and action 
> 
>
> Key: SPARK-15845
> URL: https://issues.apache.org/jira/browse/SPARK-15845
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: nirav patel
>
> Spark optimizes DAG processing by efficiently selecting stage boundaries.  
> This makes spark stage a sequence of multiple transformation and one or zero 
> action. As Aa result stage that spark is currently running can be internally 
> series of (map -> shuffle -> map -> map -> collect) Notice here that it goes 
> pass shuffle dependency and includes the next transformations and actions 
> into same stage. So any task of this stage is essentially doing all those 
> transformation/actions as a Unit and there is no further visibility inside 
> it. Basically network read, populating partitions, compute, shuffle write, 
> shuffle read, compute, writing final partitions to disk ALL happens within 
> one stage! Means all tasks of that stage is basically doing all those 
> operations on single partition as a unit. This takes away huge visibility 
> into users transformation and actions in terms of which one is taking longer 
> or which one is resource bottleneck and which one is failing.
> spark UI just shows its currently running some action stage. If job fails at 
> that point spark UI just says Action failed but in fact it could be any stage 
> in that lazy chain of evaluation. Looking at executor logs gives some 
> insights but that's not always straightforward. 
> I think we need more visibility into what's happening undern

[jira] [Resolved] (SPARK-15804) Manually added metadata not saving with parquet

2016-06-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15804.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13555
[https://github.com/apache/spark/pull/13555]

> Manually added metadata not saving with parquet
> ---
>
> Key: SPARK-15804
> URL: https://issues.apache.org/jira/browse/SPARK-15804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
> Fix For: 2.0.0
>
>
> Adding metadata with col().as(_, metadata) then saving the resultant 
> dataframe does not save the metadata. No error is thrown. Only see the schema 
> contains the metadata before saving and does not contain the metadata after 
> saving and loading the dataframe. Was working fine with 1.6.1.
> {code}
> case class TestRow(a: String, b: Int)
> val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
> val df = spark.createDataFrame(rows)
> import org.apache.spark.sql.types.MetadataBuilder
> val md = new MetadataBuilder().putString("key", "value").build()
> val dfWithMeta = df.select(col("a"), col("b").as("b", md))
> println(dfWithMeta.schema.json)
> dfWithMeta.write.parquet("dfWithMeta")
> val dfWithMeta2 = spark.read.parquet("dfWithMeta")
> println(dfWithMeta2.schema.json)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15804) Manually added metadata not saving with parquet

2016-06-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15804:

Assignee: kevin yu

> Manually added metadata not saving with parquet
> ---
>
> Key: SPARK-15804
> URL: https://issues.apache.org/jira/browse/SPARK-15804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>Assignee: kevin yu
> Fix For: 2.0.0
>
>
> Adding metadata with col().as(_, metadata) then saving the resultant 
> dataframe does not save the metadata. No error is thrown. Only see the schema 
> contains the metadata before saving and does not contain the metadata after 
> saving and loading the dataframe. Was working fine with 1.6.1.
> {code}
> case class TestRow(a: String, b: Int)
> val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
> val df = spark.createDataFrame(rows)
> import org.apache.spark.sql.types.MetadataBuilder
> val md = new MetadataBuilder().putString("key", "value").build()
> val dfWithMeta = df.select(col("a"), col("b").as("b", md))
> println(dfWithMeta.schema.json)
> dfWithMeta.write.parquet("dfWithMeta")
> val dfWithMeta2 = spark.read.parquet("dfWithMeta")
> println(dfWithMeta2.schema.json)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15788) PySpark IDFModel missing "idf" property

2016-06-09 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15788.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13540
[https://github.com/apache/spark/pull/13540]

> PySpark IDFModel missing "idf" property
> ---
>
> Key: SPARK-15788
> URL: https://issues.apache.org/jira/browse/SPARK-15788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Nick Pentreath
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Scala {{IDFModel}} has a method {{def idf: Vector = idfModel.idf.asML}} - 
> this should be exposed on the Python side as a property



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables

2016-06-09 Thread Peter Halliday (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322857#comment-15322857
 ] 

Peter Halliday commented on SPARK-14560:


Is this going to be in 1.6.2 too?  Is there a timeline for when it might make 
it in 1.6.2?

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all available 15 MB.  Since the 15 
> MB of memory is sufficient, it will not spill, and will continue holding on 
> to all available memory.  But this leaves *no* memory available for the 
> shuffle-write side.  Since the shuffle-write side cannot request the 
> shuffle-read side to free up memory, this leads to an OOM.
> The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as 
> well, so RDDs can benefit from the cooperative memory management introduced 
> by SPARK-10342.
> Note that an additional improvement would be for the shuffle-read side to 
> simple release unused memory, without spilling, in case that would leave 
> enough memory, and only spill if that was inadequate.  However that can come 
> as a later improvement.
> *Workaround*:  You can set 
> {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to 
> occur every {{N}} elements, thus preventing the shuffle-read side from ever 
> grabbing all of the available memory.  However, this requires careful tuning 
> of {{N}} to specific workloads: too big, and you will still get an OOM; too 
> small, and there will be so much spilling that performance will suffer 
> drastically.  Furthermore, this workaround uses an *undocumented* 
> configuration with *no compatibility guarantees* for future versions of spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To u

[jira] [Commented] (SPARK-15828) YARN is not aware of Spark's External Shuffle Service

2016-06-09 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322865#comment-15322865
 ] 

Saisai Shao commented on SPARK-15828:
-

I see, but I don't clearly understand your scenario, are the NMs decommissioned 
automatically when there's no container running on it? From my understanding it 
should be triggered manually and occasionally.

This may not be a Spark-only problem, I think there should be some changes in 
the YARN side. Basically I think your usage of YARN may have some conflicts 
with dynamic allocation and it requires mutual operation between two sides.

> YARN is not aware of Spark's External Shuffle Service
> -
>
> Key: SPARK-15828
> URL: https://issues.apache.org/jira/browse/SPARK-15828
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
> Environment: EMR
>Reporter: Miles Crawford
>
> When using Spark with dynamic allocation, it is common for all containers on a
> particular YARN node to be released.  This is generally okay because of the
> external shuffle service.
> The problem arises when YARN is attempting to downsize the cluster - once all
> containers on the node are gone, YARN will decommission the node, regardless 
> of
> whether the external shuffle service is still required!
> The once the node is shut down, jobs begin failing with messages such as:
> {code}
> 2016-06-07 18:56:40,016 ERROR o.a.s.n.shuffle.RetryingBlockFetcher: Exception 
> while beginning fetch of 13 outstanding blocks
> java.io.IOException: Failed to connect to 
> ip-10-12-32-67.us-west-2.compute.internal/10.12.32.67:7337
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>  
> ~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>  
> ~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>  
> ~[2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:152)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:316)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:263)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:112)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:43)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.

[jira] [Commented] (SPARK-15828) YARN is not aware of Spark's External Shuffle Service

2016-06-09 Thread Miles Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322876#comment-15322876
 ] 

Miles Crawford commented on SPARK-15828:


This is Hadoop's standard means of resizing a cluster.  In this case, we are 
resizing it automatically to fit workloads. Say two jobs are running, one 
finishes, so we ask to remove hosts. The ones that get removed also cause the 
death of the continuing job.

> YARN is not aware of Spark's External Shuffle Service
> -
>
> Key: SPARK-15828
> URL: https://issues.apache.org/jira/browse/SPARK-15828
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
> Environment: EMR
>Reporter: Miles Crawford
>
> When using Spark with dynamic allocation, it is common for all containers on a
> particular YARN node to be released.  This is generally okay because of the
> external shuffle service.
> The problem arises when YARN is attempting to downsize the cluster - once all
> containers on the node are gone, YARN will decommission the node, regardless 
> of
> whether the external shuffle service is still required!
> The once the node is shut down, jobs begin failing with messages such as:
> {code}
> 2016-06-07 18:56:40,016 ERROR o.a.s.n.shuffle.RetryingBlockFetcher: Exception 
> while beginning fetch of 13 outstanding blocks
> java.io.IOException: Failed to connect to 
> ip-10-12-32-67.us-west-2.compute.internal/10.12.32.67:7337
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>  
> ~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>  
> ~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>  
> ~[2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:152)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:316)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:263)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:112)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:43)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.

[jira] [Commented] (SPARK-15828) YARN is not aware of Spark's External Shuffle Service

2016-06-09 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322892#comment-15322892
 ] 

Saisai Shao commented on SPARK-15828:
-

OK, I guess you're running on AWS or similar cloud environment, is that right? 
I guess there's no such requirement when running in a on-premise cluster. 
Anyway as I mentioned this requires mutual-operation between Spark and YARN. 

> YARN is not aware of Spark's External Shuffle Service
> -
>
> Key: SPARK-15828
> URL: https://issues.apache.org/jira/browse/SPARK-15828
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
> Environment: EMR
>Reporter: Miles Crawford
>
> When using Spark with dynamic allocation, it is common for all containers on a
> particular YARN node to be released.  This is generally okay because of the
> external shuffle service.
> The problem arises when YARN is attempting to downsize the cluster - once all
> containers on the node are gone, YARN will decommission the node, regardless 
> of
> whether the external shuffle service is still required!
> The once the node is shut down, jobs begin failing with messages such as:
> {code}
> 2016-06-07 18:56:40,016 ERROR o.a.s.n.shuffle.RetryingBlockFetcher: Exception 
> while beginning fetch of 13 outstanding blocks
> java.io.IOException: Failed to connect to 
> ip-10-12-32-67.us-west-2.compute.internal/10.12.32.67:7337
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>  
> ~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>  
> ~[d58092b50d2880a1c259cb51c6ed83955f97e34a4b75cedaa8ab00f89a09df50-spark-network-common_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>  
> ~[2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>  
> [2d5c6a1b64d0070faea2e852616885c0110121f4f5c3206cbde88946abce11c3-spark-network-shuffle_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:152)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:316)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:263)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:112)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:43)
>  
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
> [d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1]
> at org.ap

[jira] [Commented] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver

2016-06-09 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322893#comment-15322893
 ] 

Kay Ousterhout commented on SPARK-14485:


I don't think (a) is especially rare: that's the case anytime data is saved to 
HDFS, or a result is returned -- e.g., from SQL queries  that aggregate a 
result at the driver, rather than creating a result table.

My point for (c) was that it seems to only benefit a small fraction of cases: 
when both (1) the scheduler learned about the lost executor before learning 
about the successful task and (2) there weren't other previous tasks in the 
same stage that ran on the failed executor (in which case the other, previously 
completed tasks won't get re-run until there's a fetch failure in the next 
stage).

I'm also hesitant to add this special logic where we sometimes just ignore 
task-completed messages, because I'm worried about corner cases where this 
could lead to a job hanging because somehow the task never gets completed 
successfully.

Given all of the above, I'd advocate reverting this, and submitted a PR to do 
so: https://github.com/apache/spark/pull/13580

> Task finished cause fetch failure when its executor has already been removed 
> by driver 
> ---
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>Assignee: iward
> Fix For: 2.0.0
>
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not compl

[jira] [Commented] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver

2016-06-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322894#comment-15322894
 ] 

Apache Spark commented on SPARK-14485:
--

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/13580

> Task finished cause fetch failure when its executor has already been removed 
> by driver 
> ---
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>Assignee: iward
> Fix For: 2.0.0
>
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
> 2015-12-31 04:40:13 INFO 
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 11
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun

[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-09 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322912#comment-15322912
 ] 

Yan Chen commented on SPARK-15716:
--

Original problem comes from Hortonworks. We also tried to use community 
version. Behavior is the same. But we have already reached out to Hortonworks 
for them to investigate in this issue. Besides, we already found out that using 
version 1.6 from community with yarn 2.7.1 from Hortonworks does not have 
memory issue. So we are currently using this combination in our production. But 
version 1.4.1 still have the issue. btw, I don't know why this issue is closed. 

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more

[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-09 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322920#comment-15322920
 ] 

Yan Chen commented on SPARK-15716:
--

[~srowen] Could I know why this issue is closed?

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-09 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322912#comment-15322912
 ] 

Yan Chen edited comment on SPARK-15716 at 6/9/16 5:34 PM:
--

Original problem comes from Hortonworks. We also tried to use community 
version. Behavior is the same. But we have already reached out to Hortonworks 
for them to investigate in this issue. Besides, we already found out that using 
version 1.6 from community with yarn 2.7.1 from Hortonworks does not have 
memory issue. So we are currently using this combination in our production. But 
version 1.4.1 still have the issue. btw, I don't know why this issue is marked 
resolved. 


was (Author: yani.chen):
Original problem comes from Hortonworks. We also tried to use community 
version. Behavior is the same. But we have already reached out to Hortonworks 
for them to investigate in this issue. Besides, we already found out that using 
version 1.6 from community with yarn 2.7.1 from Hortonworks does not have 
memory issue. So we are currently using this combination in our production. But 
version 1.4.1 still have the issue. btw, I don't know why this issue is closed. 

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+Unl

[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-09 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322920#comment-15322920
 ] 

Yan Chen edited comment on SPARK-15716 at 6/9/16 5:34 PM:
--

[~srowen] Could I know why this issue is marked resolved?


was (Author: yani.chen):
[~srowen] Could I know why this issue is closed?

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h..

[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-09 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322930#comment-15322930
 ] 

Yan Chen commented on SPARK-15716:
--

Why it is "not a problem" even if it crashes the streaming process?

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15716:
--
Comment: was deleted

(was: [~yani.chen] the problem is you've described several different problems, 
and it's not clear what the problem is or whether it is due to Spark. You've 
described an increasing heap size (but no out of memory error), a stuck 
process, a crashed process, and some checkpoint problem. JIRAs aren't for 
open-ended discussion, but as much as possible about a specific isolated 
problem and its proposed resolution.)

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent

  1   2   3   >