date:20160508

[jira] [Commented] (SPARK-15200) Add documentaion and examples for GaussianMixture

2016-05-08 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275500#comment-15275500
 ] 

yuhao yang commented on SPARK-15200:


Hi [~BenFradet], just a reminder, this may be duplicated with 
https://issues.apache.org/jira/browse/SPARK-14434. 
Please close the jira once you confirmed.

> Add documentaion and examples for GaussianMixture
> -
>
> Key: SPARK-15200
> URL: https://issues.apache.org/jira/browse/SPARK-15200
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Benjamin Fradet
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2016-05-08 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12479.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12976
[https://github.com/apache/spark/pull/12976]

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
> Fix For: 2.0.0
>
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2016-05-08 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12479:
--
Assignee: Sun Rui

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15206) Add testcases for Distinct Aggregation in Having clause

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15206:


Assignee: Apache Spark

> Add testcases for Distinct Aggregation in Having clause
> ---
>
> Key: SPARK-15206
> URL: https://issues.apache.org/jira/browse/SPARK-15206
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>Assignee: Apache Spark
>
> This is the followup jira for https://github.com/apache/spark/pull/12974. We 
> will add test cases for including distinct aggregate function in having 
> clause. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15206) Add testcases for Distinct Aggregation in Having clause

2016-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275505#comment-15275505
 ] 

Apache Spark commented on SPARK-15206:
--

User 'xwu0226' has created a pull request for this issue:
https://github.com/apache/spark/pull/12984

> Add testcases for Distinct Aggregation in Having clause
> ---
>
> Key: SPARK-15206
> URL: https://issues.apache.org/jira/browse/SPARK-15206
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> This is the followup jira for https://github.com/apache/spark/pull/12974. We 
> will add test cases for including distinct aggregate function in having 
> clause. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15206) Add testcases for Distinct Aggregation in Having clause

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15206:


Assignee: (was: Apache Spark)

> Add testcases for Distinct Aggregation in Having clause
> ---
>
> Key: SPARK-15206
> URL: https://issues.apache.org/jira/browse/SPARK-15206
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> This is the followup jira for https://github.com/apache/spark/pull/12974. We 
> will add test cases for including distinct aggregate function in having 
> clause. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15200) Add documentaion and examples for GaussianMixture

2016-05-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15200.
---
Resolution: Duplicate

> Add documentaion and examples for GaussianMixture
> -
>
> Key: SPARK-15200
> URL: https://issues.apache.org/jira/browse/SPARK-15200
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Benjamin Fradet
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14261) Memory leak in Spark Thrift Server

2016-05-08 Thread Xiaochun Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275514#comment-15275514
 ] 

Xiaochun Liang commented on SPARK-14261:


Thanks for taking care of the issue. I will apply the patch and verify the 
changes.

> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-14261
> URL: https://issues.apache.org/jira/browse/SPARK-14261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xiaochun Liang
> Attachments: 16716_heapdump_64g.PNG, 16716_heapdump_80g.PNG, 
> MemorySnapshot.PNG
>
>
> I am running Spark Thrift server on Windows Server 2012. The Spark Thrift 
> server is launched as Yarn client mode. Its memory usage is increased 
> gradually with the queries in.  I am wondering there is memory leak in Spark 
> Thrift server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15200) Add documentaion and examples for GaussianMixture

2016-05-08 Thread Benjamin Fradet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275524#comment-15275524
 ] 

Benjamin Fradet commented on SPARK-15200:
-

woops, didnt see it linked to 15101

> Add documentaion and examples for GaussianMixture
> -
>
> Key: SPARK-15200
> URL: https://issues.apache.org/jira/browse/SPARK-15200
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Benjamin Fradet
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15205) Codegen can compile the same source code more than twice

2016-05-08 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-15205:
---
Summary: Codegen can compile the same source code more than twice  (was: 
Codegen can compile more than twice for the same source code)

> Codegen can compile the same source code more than twice
> 
>
> Key: SPARK-15205
> URL: https://issues.apache.org/jira/browse/SPARK-15205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> Sometimes, we have generated codes they are equal except for comments.
> One example is here.
> {code}
> val df = sc.parallelize(1 to 10).toDF
> df.selectExpr("value + 1").show // query1
> df.selectExpr("value + 2").show // query2
> {code}
> The following code is one of generated code when query1 above is executed.
> {code}
> /* 001 */ 
> /* 002 */ public java.lang.Object generate(Object[] references) {
> /* 003 */   return new SpecificSafeProjection(references);
> /* 004 */ }
> /* 005 */ 
> /* 006 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 007 */   
> /* 008 */   private Object[] references;
> /* 009 */   private MutableRow mutableRow;
> /* 010 */   private Object[] values;
> /* 011 */   private org.apache.spark.sql.types.StructType schema;
> /* 012 */   
> /* 013 */   
> /* 014 */   public SpecificSafeProjection(Object[] references) {
> /* 015 */ this.references = references;
> /* 016 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 017 */ 
> /* 018 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 019 */   }
> /* 020 */   
> /* 021 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 022 */ InternalRow i = (InternalRow) _i;
> /* 023 */ /* createexternalrow(if (isnull(input[0, int])) null else 
> input[0, int], StructField((value + 1),IntegerType,false)) */
> /* 024 */ values = new Object[1];
> /* 025 */ /* if (isnull(input[0, int])) null else input[0, int] */
> /* 026 */ /* isnull(input[0, int]) */
> /* 027 */ /* input[0, int] */
> /* 028 */ int value3 = i.getInt(0);
> /* 029 */ boolean isNull1 = false;
> /* 030 */ int value1 = -1;
> /* 031 */ if (!false && false) {
> /* 032 */   /* null */
> /* 033 */   final int value4 = -1;
> /* 034 */   isNull1 = true;
> /* 035 */   value1 = value4;
> /* 036 */ } else {
> /* 037 */   /* input[0, int] */
> /* 038 */   int value5 = i.getInt(0);
> /* 039 */   isNull1 = false;
> /* 040 */   value1 = value5;
> /* 041 */ }
> /* 042 */ if (isNull1) {
> /* 043 */   values[0] = null;
> /* 044 */ } else {
> /* 045 */   values[0] = value1;
> /* 046 */ }
> /* 047 */ 
> /* 048 */ final org.apache.spark.sql.Row value = new 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, 
> this.schema);
> /* 049 */ if (false) {
> /* 050 */   mutableRow.setNullAt(0);
> /* 051 */ } else {
> /* 052 */   
> /* 053 */   mutableRow.update(0, value);
> /* 054 */ }
> /* 055 */ 
> /* 056 */ return mutableRow;
> /* 057 */   }
> /* 058 */ }
> /* 059 */ 
> {code}
> On the other hand, the following code is for query2.
> {code}
> /* 001 */ 
> /* 002 */ public java.lang.Object generate(Object[] references) {
> /* 003 */   return new SpecificSafeProjection(references);
> /* 004 */ }
> /* 005 */ 
> /* 006 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 007 */   
> /* 008 */   private Object[] references;
> /* 009 */   private MutableRow mutableRow;
> /* 010 */   private Object[] values;
> /* 011 */   private org.apache.spark.sql.types.StructType schema;
> /* 012 */   
> /* 013 */   
> /* 014 */   public SpecificSafeProjection(Object[] references) {
> /* 015 */ this.references = references;
> /* 016 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 017 */ 
> /* 018 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 019 */   }
> /* 020 */   
> /* 021 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 022 */ InternalRow i = (InternalRow) _i;
> /* 023 */ /* createexternalrow(if (isnull(input[0, int])) null else 
> input[0, int], StructField((value + 2),IntegerType,false)) */
> /* 024 */ values = new Object[1];
> /* 025 */ /* if (isnull(input[0, int])) null else input[0, int] */
> /* 026 */ /* isnull(input[0, int]) */
> /* 027 */ /* input[0, int] */
> /* 028 */ int value3 = i.getInt(0);
> /* 029 */ boolean isNull1 = false;
> /* 030 */ int value1 = -1;
> /* 031 */ if (!false && false) {
> /* 032 */   /* null */
> /* 0

[jira] [Assigned] (SPARK-15067) YARN executors are launched with fixed perm gen size

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15067:


Assignee: (was: Apache Spark)

> YARN executors are launched with fixed perm gen size
> 
>
> Key: SPARK-15067
> URL: https://issues.apache.org/jira/browse/SPARK-15067
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Renato Falchi Brandão
>Priority: Minor
>
> It is impossible to change the executors max perm gen size using the property 
> "spark.executor.extraJavaOptions" when you are running on YARN.
> When the JVM option "-XX:MaxPermSize" is set through the property 
> "spark.executor.extraJavaOptions", Spark put it properly in the shell command 
> that will start the JVM container but, in the ending of command, it sets 
> again this option using a fixed value of 256m, as you can see in the log I've 
> extracted:
> 2016-04-30 17:20:12 INFO  ExecutorRunnable:58 -
> ===
> YARN executor launch context:
>   env:
> CLASSPATH -> 
> {{PWD}}{{PWD}}/__spark__.jar$HADOOP_CONF_DIR/usr/hdp/current/hadoop-client/*/usr/hdp/current/hadoop-client/lib/*/usr/hdp/current/hadoop-hdfs-client/*/usr/hdp/current/hadoop-hdfs-client/lib/*/usr/hdp/current/hadoop-yarn-client/*/usr/hdp/current/hadoop-yarn-client/lib/*/usr/hdp/mr-framework/hadoop/share/hadoop/mapreduce/*:/usr/hdp/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/common/*:/usr/hdp/mr-framework/hadoop/share/hadoop/common/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/yarn/*:/usr/hdp/mr-framework/hadoop/share/hadoop/yarn/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/hdfs/*:/usr/hdp/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/current/hadoop/lib/hadoop-lzo-0.6.0.jar:/etc/hadoop/conf/secure
> SPARK_LOG_URL_STDERR -> 
> http://x0668sl.x.br:8042/node/containerlogs/container_1456962126505_329993_01_02/h_loadbd/stderr?start=-4096
> SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1456962126505_329993
> SPARK_YARN_CACHE_FILES_FILE_SIZES -> 191719054,166
> SPARK_USER -> h_loadbd
> SPARK_YARN_CACHE_FILES_VISIBILITIES -> PUBLIC,PUBLIC
> SPARK_YARN_MODE -> true
> SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1459806496093,1459808508343
> SPARK_LOG_URL_STDOUT -> 
> http://x0668sl.x.br:8042/node/containerlogs/container_1456962126505_329993_01_02/h_loadbd/stdout?start=-4096
> SPARK_YARN_CACHE_FILES -> 
> hdfs://x/user/datalab/hdp/spark/lib/spark-assembly-1.6.0.2.3.4.1-10-hadoop2.7.1.2.3.4.1-10.jar#__spark__.jar,hdfs://tlvcluster/user/datalab/hdp/spark/conf/hive-site.xml#hive-site.xml
>   command:
> {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m 
> -Xmx6144m '-XX:+PrintGCDetails' '-XX:MaxPermSize=1024M' 
> '-XX:+PrintGCTimeStamps' -Djava.io.tmpdir={{PWD}}/tmp 
> '-Dspark.akka.timeout=30' '-Dspark.driver.port=62875' 
> '-Dspark.rpc.askTimeout=30' '-Dspark.rpc.lookupTimeout=30' 
> -Dspark.yarn.app.container.log.dir= -XX:MaxPermSize=256m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@10.125.81.42:62875 --executor-id 1 --hostname 
> x0668sl.x.br --cores 1 --app-id application_1456962126505_329993 
> --user-class-path file:$PWD/__app__.jar 1> /stdout 2> 
> /stderr
> Analyzing the code is possible to see that all the options set in the 
> property "spark.executor.extraJavaOptions" are enclosed, one by one, in 
> single quotes (ExecutorRunnable.scala:151) before the launcher take the 
> decision if a default value has to be provided or not for the option 
> "-XX:MaxPermSize" (ExecutorRunnable.scala:202).
> This decision is taken examining all the options set and looking for a string 
> starting with the value "-XX:MaxPermSize" (CommandBuilderUtils.java:328). If 
> that value is not found, the default value is set.
> A string option starting without single quote will never be found, then, a 
> default value will always be provided.
> A possible solution is change the source code of CommandBuilderUtils.java in 
> the line 328:
> From-> if (arg.startsWith("-XX:MaxPermSize="))
> To-> if (arg.indexOf("-XX:MaxPermSize=") > -1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15067) YARN executors are launched with fixed perm gen size

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15067:


Assignee: Apache Spark

> YARN executors are launched with fixed perm gen size
> 
>
> Key: SPARK-15067
> URL: https://issues.apache.org/jira/browse/SPARK-15067
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Renato Falchi Brandão
>Assignee: Apache Spark
>Priority: Minor
>
> It is impossible to change the executors max perm gen size using the property 
> "spark.executor.extraJavaOptions" when you are running on YARN.
> When the JVM option "-XX:MaxPermSize" is set through the property 
> "spark.executor.extraJavaOptions", Spark put it properly in the shell command 
> that will start the JVM container but, in the ending of command, it sets 
> again this option using a fixed value of 256m, as you can see in the log I've 
> extracted:
> 2016-04-30 17:20:12 INFO  ExecutorRunnable:58 -
> ===
> YARN executor launch context:
>   env:
> CLASSPATH -> 
> {{PWD}}{{PWD}}/__spark__.jar$HADOOP_CONF_DIR/usr/hdp/current/hadoop-client/*/usr/hdp/current/hadoop-client/lib/*/usr/hdp/current/hadoop-hdfs-client/*/usr/hdp/current/hadoop-hdfs-client/lib/*/usr/hdp/current/hadoop-yarn-client/*/usr/hdp/current/hadoop-yarn-client/lib/*/usr/hdp/mr-framework/hadoop/share/hadoop/mapreduce/*:/usr/hdp/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/common/*:/usr/hdp/mr-framework/hadoop/share/hadoop/common/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/yarn/*:/usr/hdp/mr-framework/hadoop/share/hadoop/yarn/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/hdfs/*:/usr/hdp/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/current/hadoop/lib/hadoop-lzo-0.6.0.jar:/etc/hadoop/conf/secure
> SPARK_LOG_URL_STDERR -> 
> http://x0668sl.x.br:8042/node/containerlogs/container_1456962126505_329993_01_02/h_loadbd/stderr?start=-4096
> SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1456962126505_329993
> SPARK_YARN_CACHE_FILES_FILE_SIZES -> 191719054,166
> SPARK_USER -> h_loadbd
> SPARK_YARN_CACHE_FILES_VISIBILITIES -> PUBLIC,PUBLIC
> SPARK_YARN_MODE -> true
> SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1459806496093,1459808508343
> SPARK_LOG_URL_STDOUT -> 
> http://x0668sl.x.br:8042/node/containerlogs/container_1456962126505_329993_01_02/h_loadbd/stdout?start=-4096
> SPARK_YARN_CACHE_FILES -> 
> hdfs://x/user/datalab/hdp/spark/lib/spark-assembly-1.6.0.2.3.4.1-10-hadoop2.7.1.2.3.4.1-10.jar#__spark__.jar,hdfs://tlvcluster/user/datalab/hdp/spark/conf/hive-site.xml#hive-site.xml
>   command:
> {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m 
> -Xmx6144m '-XX:+PrintGCDetails' '-XX:MaxPermSize=1024M' 
> '-XX:+PrintGCTimeStamps' -Djava.io.tmpdir={{PWD}}/tmp 
> '-Dspark.akka.timeout=30' '-Dspark.driver.port=62875' 
> '-Dspark.rpc.askTimeout=30' '-Dspark.rpc.lookupTimeout=30' 
> -Dspark.yarn.app.container.log.dir= -XX:MaxPermSize=256m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@10.125.81.42:62875 --executor-id 1 --hostname 
> x0668sl.x.br --cores 1 --app-id application_1456962126505_329993 
> --user-class-path file:$PWD/__app__.jar 1> /stdout 2> 
> /stderr
> Analyzing the code is possible to see that all the options set in the 
> property "spark.executor.extraJavaOptions" are enclosed, one by one, in 
> single quotes (ExecutorRunnable.scala:151) before the launcher take the 
> decision if a default value has to be provided or not for the option 
> "-XX:MaxPermSize" (ExecutorRunnable.scala:202).
> This decision is taken examining all the options set and looking for a string 
> starting with the value "-XX:MaxPermSize" (CommandBuilderUtils.java:328). If 
> that value is not found, the default value is set.
> A string option starting without single quote will never be found, then, a 
> default value will always be provided.
> A possible solution is change the source code of CommandBuilderUtils.java in 
> the line 328:
> From-> if (arg.startsWith("-XX:MaxPermSize="))
> To-> if (arg.indexOf("-XX:MaxPermSize=") > -1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15067) YARN executors are launched with fixed perm gen size

2016-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275527#comment-15275527
 ] 

Apache Spark commented on SPARK-15067:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/12985

> YARN executors are launched with fixed perm gen size
> 
>
> Key: SPARK-15067
> URL: https://issues.apache.org/jira/browse/SPARK-15067
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Renato Falchi Brandão
>Priority: Minor
>
> It is impossible to change the executors max perm gen size using the property 
> "spark.executor.extraJavaOptions" when you are running on YARN.
> When the JVM option "-XX:MaxPermSize" is set through the property 
> "spark.executor.extraJavaOptions", Spark put it properly in the shell command 
> that will start the JVM container but, in the ending of command, it sets 
> again this option using a fixed value of 256m, as you can see in the log I've 
> extracted:
> 2016-04-30 17:20:12 INFO  ExecutorRunnable:58 -
> ===
> YARN executor launch context:
>   env:
> CLASSPATH -> 
> {{PWD}}{{PWD}}/__spark__.jar$HADOOP_CONF_DIR/usr/hdp/current/hadoop-client/*/usr/hdp/current/hadoop-client/lib/*/usr/hdp/current/hadoop-hdfs-client/*/usr/hdp/current/hadoop-hdfs-client/lib/*/usr/hdp/current/hadoop-yarn-client/*/usr/hdp/current/hadoop-yarn-client/lib/*/usr/hdp/mr-framework/hadoop/share/hadoop/mapreduce/*:/usr/hdp/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/common/*:/usr/hdp/mr-framework/hadoop/share/hadoop/common/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/yarn/*:/usr/hdp/mr-framework/hadoop/share/hadoop/yarn/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/hdfs/*:/usr/hdp/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/current/hadoop/lib/hadoop-lzo-0.6.0.jar:/etc/hadoop/conf/secure
> SPARK_LOG_URL_STDERR -> 
> http://x0668sl.x.br:8042/node/containerlogs/container_1456962126505_329993_01_02/h_loadbd/stderr?start=-4096
> SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1456962126505_329993
> SPARK_YARN_CACHE_FILES_FILE_SIZES -> 191719054,166
> SPARK_USER -> h_loadbd
> SPARK_YARN_CACHE_FILES_VISIBILITIES -> PUBLIC,PUBLIC
> SPARK_YARN_MODE -> true
> SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1459806496093,1459808508343
> SPARK_LOG_URL_STDOUT -> 
> http://x0668sl.x.br:8042/node/containerlogs/container_1456962126505_329993_01_02/h_loadbd/stdout?start=-4096
> SPARK_YARN_CACHE_FILES -> 
> hdfs://x/user/datalab/hdp/spark/lib/spark-assembly-1.6.0.2.3.4.1-10-hadoop2.7.1.2.3.4.1-10.jar#__spark__.jar,hdfs://tlvcluster/user/datalab/hdp/spark/conf/hive-site.xml#hive-site.xml
>   command:
> {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m 
> -Xmx6144m '-XX:+PrintGCDetails' '-XX:MaxPermSize=1024M' 
> '-XX:+PrintGCTimeStamps' -Djava.io.tmpdir={{PWD}}/tmp 
> '-Dspark.akka.timeout=30' '-Dspark.driver.port=62875' 
> '-Dspark.rpc.askTimeout=30' '-Dspark.rpc.lookupTimeout=30' 
> -Dspark.yarn.app.container.log.dir= -XX:MaxPermSize=256m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@10.125.81.42:62875 --executor-id 1 --hostname 
> x0668sl.x.br --cores 1 --app-id application_1456962126505_329993 
> --user-class-path file:$PWD/__app__.jar 1> /stdout 2> 
> /stderr
> Analyzing the code is possible to see that all the options set in the 
> property "spark.executor.extraJavaOptions" are enclosed, one by one, in 
> single quotes (ExecutorRunnable.scala:151) before the launcher take the 
> decision if a default value has to be provided or not for the option 
> "-XX:MaxPermSize" (ExecutorRunnable.scala:202).
> This decision is taken examining all the options set and looking for a string 
> starting with the value "-XX:MaxPermSize" (CommandBuilderUtils.java:328). If 
> that value is not found, the default value is set.
> A string option starting without single quote will never be found, then, a 
> default value will always be provided.
> A possible solution is change the source code of CommandBuilderUtils.java in 
> the line 328:
> From-> if (arg.startsWith("-XX:MaxPermSize="))
> To-> if (arg.indexOf("-XX:MaxPermSize=") > -1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15211) Select features column from LibSVMRelation causes failure

2016-05-08 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-15211:
---

 Summary: Select features column from LibSVMRelation causes failure
 Key: SPARK-15211
 URL: https://issues.apache.org/jira/browse/SPARK-15211
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh



It will cause failure when trying to load data with LibSVMRelation and select 
features column:

{code}
val df2 = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> df2.select("features").show
java.lang.RuntimeException: Error while decoding: scala.MatchError: 19 (of 
class java.lang.Byte)
createexternalrow(if (isnull(input[0, vector])) null else newInstance(class 
org.apache.spark.mllib.linalg.VectorUDT).deserialize, 
StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true))
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates

2016-05-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15122:
--
Assignee: Herman van Hovell

> TPC-DS Qury 41 fails with The correlated scalar subquery can only contain 
> equality predicates
> -
>
> Key: SPARK-15122
> URL: https://issues.apache.org/jira/browse/SPARK-15122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Assignee: Herman van Hovell
>Priority: Critical
> Fix For: 2.0.0
>
>
> The official TPC-DS query 41 fails with the following error:
> {noformat}
> Error in query: The correlated scalar subquery can only contain equality 
> predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) 
> && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) 
> || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra 
> large || (((i_category#36 = Women) && ((i_color#41 = brown) || 
> (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && 
> ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && 
> ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || 
> (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || 
> (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = 
> cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 
> = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = 
> i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || 
> (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && 
> ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = 
> Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = 
> Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) 
> || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = 
> frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = 
> petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 
> = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = 
> Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large;
> {noformat}
> The output plans showed the following errors
> {noformat}
> == Parsed Logical Plan ==
> 'GlobalLimit 100
> +- 'LocalLimit 100
>+- 'Sort ['i_product_name ASC], true
>   +- 'Distinct
>  +- 'Project ['i_product_name]
> +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + 
> 40))) && (scalar-subquery#1 [] > 0))
>:  +- 'SubqueryAlias scalar-subquery#1 []
>: +- 'Project ['count(1) AS item_cnt#0]
>:+- 'Filter ((('i_manufact = 'i1.i_manufact) && 
> ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && 
> ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = 
> extra large || ((('i_category = Women) && (('i_color = brown) || 
> ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && 
> (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && 
> (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || 
> ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || 
> ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && 
> ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size 
> = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = 
> Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = 
> Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra 
> large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = 
> papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || 
> ('i_size = small) || 'i_category = Men) && (('i_color = orange) || 
> ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && 
> (('i_size = petite) || ('i_size = large || ((('i_category = Men) && 
> (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || 
> ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large
>:   +- 'UnresolvedRelation `item`, None
>+- 'UnresolvedRelation `item`, Some(i1)
> == Analyzed Logical Plan ==
> i_product_name: string
> GlobalLimit 100
> +- LocalLimit 100
>+- Sort [i_product_name#24 ASC], true
>   +- Distinct
>  +- Project [i_product_name#24]
> +- Filter (((i_manufact_id#16L >= cast(738 as bigint)) && 
> (i_manufact_id#16L <= cast((738

[jira] [Updated] (SPARK-15051) Aggregator with DataFrame does not allow Alias

2016-05-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15051:
--
Assignee: kevin yu

> Aggregator with DataFrame does not allow Alias
> --
>
> Key: SPARK-15051
> URL: https://issues.apache.org/jira/browse/SPARK-15051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark 2.0.0-SNAPSHOT
>Reporter: koert kuipers
>Assignee: kevin yu
> Fix For: 2.0.0
>
>
> this works:
> {noformat}
> object SimpleSum extends Aggregator[Row, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Row) = b + a.getInt(1)
>   def merge(b1: Int, b2: Int) = b1 + b2
>   def finish(b: Int) = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v")
> df.groupBy("k").agg(SimpleSum.toColumn).show
> {noformat}
> but it breaks when i try to give the new column a name:
> {noformat}
> df.groupBy("k").agg(SimpleSum.toColumn as "b").show
> {noformat}
> the error is:
> {noformat}
>org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
> [k#192], [k#192,(SimpleSum(unknown),mode=Complete,isDistinct=false) AS b#200];
>at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:54)
>at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:270)
>at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:51)
>at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
>at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:51)
>at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:54)
>at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
>at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
> {noformat}
> the reason it breaks is because Column.as(alias: String) returns a Column not 
> a TypedColumn, and as a result the method TypedColumn.withInputType does not 
> get called.
> P.S. The whole TypedColumn.withInputType seems actually rather fragile to me. 
> I wish Aggregators simply also kept the input encoder and that whole bit can 
> be removed about dynamically trying to insert the Encoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15211) Select features column from LibSVMRelation causes failure

2016-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275530#comment-15275530
 ] 

Apache Spark commented on SPARK-15211:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/12986

> Select features column from LibSVMRelation causes failure
> -
>
> Key: SPARK-15211
> URL: https://issues.apache.org/jira/browse/SPARK-15211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> It will cause failure when trying to load data with LibSVMRelation and select 
> features column:
> {code}
> val df2 = 
> spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> df2.select("features").show
> java.lang.RuntimeException: Error while decoding: scala.MatchError: 19 (of 
> class java.lang.Byte)
> createexternalrow(if (isnull(input[0, vector])) null else newInstance(class 
> org.apache.spark.mllib.linalg.VectorUDT).deserialize, 
> StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15211) Select features column from LibSVMRelation causes failure

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15211:


Assignee: Apache Spark

> Select features column from LibSVMRelation causes failure
> -
>
> Key: SPARK-15211
> URL: https://issues.apache.org/jira/browse/SPARK-15211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> It will cause failure when trying to load data with LibSVMRelation and select 
> features column:
> {code}
> val df2 = 
> spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> df2.select("features").show
> java.lang.RuntimeException: Error while decoding: scala.MatchError: 19 (of 
> class java.lang.Byte)
> createexternalrow(if (isnull(input[0, vector])) null else newInstance(class 
> org.apache.spark.mllib.linalg.VectorUDT).deserialize, 
> StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15211) Select features column from LibSVMRelation causes failure

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15211:


Assignee: (was: Apache Spark)

> Select features column from LibSVMRelation causes failure
> -
>
> Key: SPARK-15211
> URL: https://issues.apache.org/jira/browse/SPARK-15211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> It will cause failure when trying to load data with LibSVMRelation and select 
> features column:
> {code}
> val df2 = 
> spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> df2.select("features").show
> java.lang.RuntimeException: Error while decoding: scala.MatchError: 19 (of 
> class java.lang.Byte)
> createexternalrow(if (isnull(input[0, vector])) null else newInstance(class 
> org.apache.spark.mllib.linalg.VectorUDT).deserialize, 
> StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15208) Update spark examples with AccumulatorV2

2016-05-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15208:
--
Fix Version/s: (was: 2.0.0)

> Update spark examples with AccumulatorV2 
> -
>
> Key: SPARK-15208
> URL: https://issues.apache.org/jira/browse/SPARK-15208
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Examples, Spark Core
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>
> Let's update the codes & docs in the example module as well as the related 
> doc module, specifically:
> - [docs] streaming-programming-guide.md
> - [examples] RecoverableNetworkWordCount.scala
> - [examples] JavaRecoverableNetworkWordCount.java
> - [examples] recoverable_network_wordcount.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15212) CVS file reader when read file with first line schema do not filter blank in schema column name

2016-05-08 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-15212:
--

 Summary: CVS file reader when read file with first line schema do 
not filter blank in schema column name
 Key: SPARK-15212
 URL: https://issues.apache.org/jira/browse/SPARK-15212
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.1.0
Reporter: Weichen Xu


for example, run the following code in spark-shell,
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
var reader = sqlContext.read
reader.option("header", true)
var df = reader.csv("file:///diskext/tdata/spark/d1.csv")

when the csv data file contains：
--
col1, col2,col3,col4,col5
1997,Ford,E350,"ac, abs, moon",3000.00



the first line contains schema, the col2 has a blank before it,
then the generated DataFrame's schema column name contains the blank.

This may cause potential problem for example

df.select("col2") 
can't find the column, must use 
df.select(" col2") 

and if register the dataframe as a table, then do query, can't select col2.

df.registerTempTable("tab1");
sqlContext.sql("select col2 from tab1"); //will fail

must add a column name validate when load csv file with schema.









--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15212) CSV file reader when read file with first line schema do not filter blank in schema column name

2016-05-08 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-15212:
---
Summary: CSV file reader when read file with first line schema do not 
filter blank in schema column name  (was: CVS file reader when read file with 
first line schema do not filter blank in schema column name)

> CSV file reader when read file with first line schema do not filter blank in 
> schema column name
> ---
>
> Key: SPARK-15212
> URL: https://issues.apache.org/jira/browse/SPARK-15212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> for example, run the following code in spark-shell,
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> var reader = sqlContext.read
> reader.option("header", true)
> var df = reader.csv("file:///diskext/tdata/spark/d1.csv")
> when the csv data file contains：
> --
> col1, col2,col3,col4,col5
> 1997,Ford,E350,"ac, abs, moon",3000.00
> 
> 
> the first line contains schema, the col2 has a blank before it,
> then the generated DataFrame's schema column name contains the blank.
> This may cause potential problem for example
> df.select("col2") 
> can't find the column, must use 
> df.select(" col2") 
> and if register the dataframe as a table, then do query, can't select col2.
> df.registerTempTable("tab1");
> sqlContext.sql("select col2 from tab1"); //will fail
> must add a column name validate when load csv file with schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15213) Unify 'range' usages in PySpark

2016-05-08 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-15213:


 Summary: Unify 'range' usages in PySpark
 Key: SPARK-15213
 URL: https://issues.apache.org/jira/browse/SPARK-15213
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: zhengruifeng


Most python file directly use range ignoring the different implement between 
python 2 and 3.

1, use {{range}} in pyspark
2, for python2, {{range = xrange}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15213) Unify 'range' usages in PySpark

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15213:


Assignee: (was: Apache Spark)

> Unify 'range' usages in PySpark
> ---
>
> Key: SPARK-15213
> URL: https://issues.apache.org/jira/browse/SPARK-15213
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: zhengruifeng
>
> Most python file directly use range ignoring the different implement between 
> python 2 and 3.
> 1, use {{range}} in pyspark
> 2, for python2, {{range = xrange}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15213) Unify 'range' usages in PySpark

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15213:


Assignee: Apache Spark

> Unify 'range' usages in PySpark
> ---
>
> Key: SPARK-15213
> URL: https://issues.apache.org/jira/browse/SPARK-15213
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: zhengruifeng
>Assignee: Apache Spark
>
> Most python file directly use range ignoring the different implement between 
> python 2 and 3.
> 1, use {{range}} in pyspark
> 2, for python2, {{range = xrange}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15213) Unify 'range' usages in PySpark

2016-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275550#comment-15275550
 ] 

Apache Spark commented on SPARK-15213:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/12983

> Unify 'range' usages in PySpark
> ---
>
> Key: SPARK-15213
> URL: https://issues.apache.org/jira/browse/SPARK-15213
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: zhengruifeng
>
> Most python file directly use range ignoring the different implement between 
> python 2 and 3.
> 1, use {{range}} in pyspark
> 2, for python2, {{range = xrange}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15212) CSV file reader when read file with first line schema do not filter blank in schema column name

2016-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275563#comment-15275563
 ] 

Apache Spark commented on SPARK-15212:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/12987

> CSV file reader when read file with first line schema do not filter blank in 
> schema column name
> ---
>
> Key: SPARK-15212
> URL: https://issues.apache.org/jira/browse/SPARK-15212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> for example, run the following code in spark-shell,
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> var reader = sqlContext.read
> reader.option("header", true)
> var df = reader.csv("file:///diskext/tdata/spark/d1.csv")
> when the csv data file contains：
> --
> col1, col2,col3,col4,col5
> 1997,Ford,E350,"ac, abs, moon",3000.00
> 
> 
> the first line contains schema, the col2 has a blank before it,
> then the generated DataFrame's schema column name contains the blank.
> This may cause potential problem for example
> df.select("col2") 
> can't find the column, must use 
> df.select(" col2") 
> and if register the dataframe as a table, then do query, can't select col2.
> df.registerTempTable("tab1");
> sqlContext.sql("select col2 from tab1"); //will fail
> must add a column name validate when load csv file with schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15212) CSV file reader when read file with first line schema do not filter blank in schema column name

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15212:


Assignee: (was: Apache Spark)

> CSV file reader when read file with first line schema do not filter blank in 
> schema column name
> ---
>
> Key: SPARK-15212
> URL: https://issues.apache.org/jira/browse/SPARK-15212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> for example, run the following code in spark-shell,
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> var reader = sqlContext.read
> reader.option("header", true)
> var df = reader.csv("file:///diskext/tdata/spark/d1.csv")
> when the csv data file contains：
> --
> col1, col2,col3,col4,col5
> 1997,Ford,E350,"ac, abs, moon",3000.00
> 
> 
> the first line contains schema, the col2 has a blank before it,
> then the generated DataFrame's schema column name contains the blank.
> This may cause potential problem for example
> df.select("col2") 
> can't find the column, must use 
> df.select(" col2") 
> and if register the dataframe as a table, then do query, can't select col2.
> df.registerTempTable("tab1");
> sqlContext.sql("select col2 from tab1"); //will fail
> must add a column name validate when load csv file with schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15212) CSV file reader when read file with first line schema do not filter blank in schema column name

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15212:


Assignee: Apache Spark

> CSV file reader when read file with first line schema do not filter blank in 
> schema column name
> ---
>
> Key: SPARK-15212
> URL: https://issues.apache.org/jira/browse/SPARK-15212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> for example, run the following code in spark-shell,
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> var reader = sqlContext.read
> reader.option("header", true)
> var df = reader.csv("file:///diskext/tdata/spark/d1.csv")
> when the csv data file contains：
> --
> col1, col2,col3,col4,col5
> 1997,Ford,E350,"ac, abs, moon",3000.00
> 
> 
> the first line contains schema, the col2 has a blank before it,
> then the generated DataFrame's schema column name contains the blank.
> This may cause potential problem for example
> df.select("col2") 
> can't find the column, must use 
> df.select(" col2") 
> and if register the dataframe as a table, then do query, can't select col2.
> df.registerTempTable("tab1");
> sqlContext.sql("select col2 from tab1"); //will fail
> must add a column name validate when load csv file with schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15083) History Server would OOM due to unlimited TaskUIData in some stages

2016-05-08 Thread Zheng Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Tan updated SPARK-15083:
--
Summary: History Server would OOM due to unlimited TaskUIData in some 
stages  (was: JobProgressListener should limit memory usage of tasks in stages.)

> History Server would OOM due to unlimited TaskUIData in some stages
> ---
>
> Key: SPARK-15083
> URL: https://issues.apache.org/jira/browse/SPARK-15083
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
>Reporter: Zheng Tan
> Attachments: Screen Shot 2016-05-01 at 3.50.02 PM.png, Screen Shot 
> 2016-05-01 at 3.51.01 PM.png, Screen Shot 2016-05-01 at 3.51.59 PM.png, 
> Screen Shot 2016-05-01 at 3.55.30 PM.png
>
>
> History Server will load all tasks in a stage, which would cause memory leak 
> if tasks occupy too many memory. 
> In the following example, a single application would consume 1.1G memory of 
> History Sever. 
> I think we should limit tasks memory usages by adding spark.ui.retainedTasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14773) Enable the tests in HiveCompatibilitySuite for subquery

2016-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275567#comment-15275567
 ] 

Apache Spark commented on SPARK-14773:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/12988

> Enable the tests in HiveCompatibilitySuite for subquery
> ---
>
> Key: SPARK-14773
> URL: https://issues.apache.org/jira/browse/SPARK-14773
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Herman van Hovell
>
> There are a few test cases in HiveCompatibilitySuite  for subquery, we should 
> enable them to have better coverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14773) Enable the tests in HiveCompatibilitySuite for subquery

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14773:


Assignee: Herman van Hovell  (was: Apache Spark)

> Enable the tests in HiveCompatibilitySuite for subquery
> ---
>
> Key: SPARK-14773
> URL: https://issues.apache.org/jira/browse/SPARK-14773
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Herman van Hovell
>
> There are a few test cases in HiveCompatibilitySuite  for subquery, we should 
> enable them to have better coverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15179) Enable SQL generation for subqueries

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15179:


Assignee: Apache Spark

> Enable SQL generation for subqueries
> 
>
> Key: SPARK-15179
> URL: https://issues.apache.org/jira/browse/SPARK-15179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> SQL Generation for subqueries is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15179) Enable SQL generation for subqueries

2016-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275568#comment-15275568
 ] 

Apache Spark commented on SPARK-15179:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/12988

> Enable SQL generation for subqueries
> 
>
> Key: SPARK-15179
> URL: https://issues.apache.org/jira/browse/SPARK-15179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>
> SQL Generation for subqueries is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14773) Enable the tests in HiveCompatibilitySuite for subquery

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14773:


Assignee: Apache Spark  (was: Herman van Hovell)

> Enable the tests in HiveCompatibilitySuite for subquery
> ---
>
> Key: SPARK-14773
> URL: https://issues.apache.org/jira/browse/SPARK-14773
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> There are a few test cases in HiveCompatibilitySuite  for subquery, we should 
> enable them to have better coverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15179) Enable SQL generation for subqueries

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15179:


Assignee: (was: Apache Spark)

> Enable SQL generation for subqueries
> 
>
> Key: SPARK-15179
> URL: https://issues.apache.org/jira/browse/SPARK-15179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>
> SQL Generation for subqueries is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15202) add dapplyCollect() method for DataFrame in SparkR

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15202:


Assignee: (was: Apache Spark)

> add dapplyCollect() method for DataFrame in SparkR
> --
>
> Key: SPARK-15202
> URL: https://issues.apache.org/jira/browse/SPARK-15202
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> dapplyCollect() applies an R function on each partition of a SparkDataFrame 
> and collects the result back to R as a data.frame.
> The signature of dapplyCollect() is as follows:
> {code}
>   dapplyCollect(df, function(ldf) {...})
> {code}
> R function input: local data.frame from the partition on local node
> R function output: local data.frame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15202) add dapplyCollect() method for DataFrame in SparkR

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15202:


Assignee: Apache Spark

> add dapplyCollect() method for DataFrame in SparkR
> --
>
> Key: SPARK-15202
> URL: https://issues.apache.org/jira/browse/SPARK-15202
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>Assignee: Apache Spark
>
> dapplyCollect() applies an R function on each partition of a SparkDataFrame 
> and collects the result back to R as a data.frame.
> The signature of dapplyCollect() is as follows:
> {code}
>   dapplyCollect(df, function(ldf) {...})
> {code}
> R function input: local data.frame from the partition on local node
> R function output: local data.frame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15202) add dapplyCollect() method for DataFrame in SparkR

2016-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275570#comment-15275570
 ] 

Apache Spark commented on SPARK-15202:
--

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/12989

> add dapplyCollect() method for DataFrame in SparkR
> --
>
> Key: SPARK-15202
> URL: https://issues.apache.org/jira/browse/SPARK-15202
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> dapplyCollect() applies an R function on each partition of a SparkDataFrame 
> and collects the result back to R as a data.frame.
> The signature of dapplyCollect() is as follows:
> {code}
>   dapplyCollect(df, function(ldf) {...})
> {code}
> R function input: local data.frame from the partition on local node
> R function output: local data.frame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15214) Enable code generation for Generate

2016-05-08 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-15214:
-

 Summary: Enable code generation for Generate
 Key: SPARK-15214
 URL: https://issues.apache.org/jira/browse/SPARK-15214
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Herman van Hovell
Assignee: Herman van Hovell


{{Generate}} currently does not support code generation. Lets add support for 
CG and for it and its most important generators: {{explode}} and {{json_tuple}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15215) Fix Explain Parsing and Output

2016-05-08 Thread Xiao Li (JIRA)

Xiao Li created SPARK-15215:
---

 Summary: Fix Explain Parsing and Output
 Key: SPARK-15215
 URL: https://issues.apache.org/jira/browse/SPARK-15215
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


This PR is to address a few existing issues in Explain:
- The `Explain` options `LOGICAL | FORMATTED | EXTENDED | CODEGEN` should not 
be 0 or more match. It should 0 or one match.
- The option `LOGICAL` is not supported
- The output of `Explain` contains a weird empty line when the output of 
analyzed plan is empty. For example:
{noformat}
  == Parsed Logical Plan ==
  CreateTable 
CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.
  
HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None),
 false
  
  == Analyzed Logical Plan ==
  
  CreateTable 
CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.
  
HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None),
 false
  
  == Optimized Logical Plan ==
  CreateTable 
CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.
  
HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None),
 false
  ...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15215) Fix Explain Parsing and Output

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15215:


Assignee: Apache Spark

> Fix Explain Parsing and Output
> --
>
> Key: SPARK-15215
> URL: https://issues.apache.org/jira/browse/SPARK-15215
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> This PR is to address a few existing issues in Explain:
> - The `Explain` options `LOGICAL | FORMATTED | EXTENDED | CODEGEN` should not 
> be 0 or more match. It should 0 or one match.
> - The option `LOGICAL` is not supported
> - The output of `Explain` contains a weird empty line when the output of 
> analyzed plan is empty. For example:
> {noformat}
>   == Parsed Logical Plan ==
>   CreateTable 
> CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.
>   
> HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None),
>  false
>   
>   == Analyzed Logical Plan ==
>   
>   CreateTable 
> CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.
>   
> HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None),
>  false
>   
>   == Optimized Logical Plan ==
>   CreateTable 
> CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.
>   
> HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None),
>  false
>   ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15215) Fix Explain Parsing and Output

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15215:


Assignee: (was: Apache Spark)

> Fix Explain Parsing and Output
> --
>
> Key: SPARK-15215
> URL: https://issues.apache.org/jira/browse/SPARK-15215
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> This PR is to address a few existing issues in Explain:
> - The `Explain` options `LOGICAL | FORMATTED | EXTENDED | CODEGEN` should not 
> be 0 or more match. It should 0 or one match.
> - The option `LOGICAL` is not supported
> - The output of `Explain` contains a weird empty line when the output of 
> analyzed plan is empty. For example:
> {noformat}
>   == Parsed Logical Plan ==
>   CreateTable 
> CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.
>   
> HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None),
>  false
>   
>   == Analyzed Logical Plan ==
>   
>   CreateTable 
> CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.
>   
> HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None),
>  false
>   
>   == Optimized Logical Plan ==
>   CreateTable 
> CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.
>   
> HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None),
>  false
>   ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15215) Fix Explain Parsing and Output

2016-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275623#comment-15275623
 ] 

Apache Spark commented on SPARK-15215:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12991

> Fix Explain Parsing and Output
> --
>
> Key: SPARK-15215
> URL: https://issues.apache.org/jira/browse/SPARK-15215
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> This PR is to address a few existing issues in Explain:
> - The `Explain` options `LOGICAL | FORMATTED | EXTENDED | CODEGEN` should not 
> be 0 or more match. It should 0 or one match.
> - The option `LOGICAL` is not supported
> - The output of `Explain` contains a weird empty line when the output of 
> analyzed plan is empty. For example:
> {noformat}
>   == Parsed Logical Plan ==
>   CreateTable 
> CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.
>   
> HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None),
>  false
>   
>   == Analyzed Logical Plan ==
>   
>   CreateTable 
> CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.
>   
> HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None),
>  false
>   
>   == Optimized Logical Plan ==
>   CreateTable 
> CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.
>   
> HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None),
>  false
>   ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15216) Add a new Dataset API explainCodegen

2016-05-08 Thread Xiao Li (JIRA)

Xiao Li created SPARK-15216:
---

 Summary: Add a new Dataset API explainCodegen
 Key: SPARK-15216
 URL: https://issues.apache.org/jira/browse/SPARK-15216
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


{noformat}
val ds = Seq("a" -> 1, "a" -> 3, "b" -> 3).toDS().groupByKey(_._1).agg(
  expr("avg(_2)").as[Double],
  ComplexResultAgg.toColumn)
ds.explainCodegen()
{noformat}

Reading codegen output is important for developers to debug. So far, outputting 
codegen results is available in the SQL interface by `EXPLAIN CODEGEN`. 
However, in the Dataset/DataFrame APIs, we face the same issue. We can add a 
new API for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15216) Add a new Dataset API explainCodegen

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15216:


Assignee: (was: Apache Spark)

> Add a new Dataset API explainCodegen
> 
>
> Key: SPARK-15216
> URL: https://issues.apache.org/jira/browse/SPARK-15216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> {noformat}
> val ds = Seq("a" -> 1, "a" -> 3, "b" -> 3).toDS().groupByKey(_._1).agg(
>   expr("avg(_2)").as[Double],
>   ComplexResultAgg.toColumn)
> ds.explainCodegen()
> {noformat}
> Reading codegen output is important for developers to debug. So far, 
> outputting codegen results is available in the SQL interface by `EXPLAIN 
> CODEGEN`. However, in the Dataset/DataFrame APIs, we face the same issue. We 
> can add a new API for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15216) Add a new Dataset API explainCodegen

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15216:


Assignee: Apache Spark

> Add a new Dataset API explainCodegen
> 
>
> Key: SPARK-15216
> URL: https://issues.apache.org/jira/browse/SPARK-15216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> {noformat}
> val ds = Seq("a" -> 1, "a" -> 3, "b" -> 3).toDS().groupByKey(_._1).agg(
>   expr("avg(_2)").as[Double],
>   ComplexResultAgg.toColumn)
> ds.explainCodegen()
> {noformat}
> Reading codegen output is important for developers to debug. So far, 
> outputting codegen results is available in the SQL interface by `EXPLAIN 
> CODEGEN`. However, in the Dataset/DataFrame APIs, we face the same issue. We 
> can add a new API for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15216) Add a new Dataset API explainCodegen

2016-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275739#comment-15275739
 ] 

Apache Spark commented on SPARK-15216:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12992

> Add a new Dataset API explainCodegen
> 
>
> Key: SPARK-15216
> URL: https://issues.apache.org/jira/browse/SPARK-15216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> {noformat}
> val ds = Seq("a" -> 1, "a" -> 3, "b" -> 3).toDS().groupByKey(_._1).agg(
>   expr("avg(_2)").as[Double],
>   ComplexResultAgg.toColumn)
> ds.explainCodegen()
> {noformat}
> Reading codegen output is important for developers to debug. So far, 
> outputting codegen results is available in the SQL interface by `EXPLAIN 
> CODEGEN`. However, in the Dataset/DataFrame APIs, we face the same issue. We 
> can add a new API for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14209) Application failure during preemption.

2016-05-08 Thread Miles Crawford (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275780#comment-15275780
 ] 

Miles Crawford commented on SPARK-14209:


This issue looks a lot like https://issues.apache.org/jira/browse/SPARK-7703 - 
but that issue is marked as resolved...

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15083) History Server would OOM due to unlimited TaskUIData in some stages

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15083:


Assignee: Apache Spark

> History Server would OOM due to unlimited TaskUIData in some stages
> ---
>
> Key: SPARK-15083
> URL: https://issues.apache.org/jira/browse/SPARK-15083
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
>Reporter: Zheng Tan
>Assignee: Apache Spark
> Attachments: Screen Shot 2016-05-01 at 3.50.02 PM.png, Screen Shot 
> 2016-05-01 at 3.51.01 PM.png, Screen Shot 2016-05-01 at 3.51.59 PM.png, 
> Screen Shot 2016-05-01 at 3.55.30 PM.png
>
>
> History Server will load all tasks in a stage, which would cause memory leak 
> if tasks occupy too many memory. 
> In the following example, a single application would consume 1.1G memory of 
> History Sever. 
> I think we should limit tasks memory usages by adding spark.ui.retainedTasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15083) History Server would OOM due to unlimited TaskUIData in some stages

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15083:


Assignee: (was: Apache Spark)

> History Server would OOM due to unlimited TaskUIData in some stages
> ---
>
> Key: SPARK-15083
> URL: https://issues.apache.org/jira/browse/SPARK-15083
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
>Reporter: Zheng Tan
> Attachments: Screen Shot 2016-05-01 at 3.50.02 PM.png, Screen Shot 
> 2016-05-01 at 3.51.01 PM.png, Screen Shot 2016-05-01 at 3.51.59 PM.png, 
> Screen Shot 2016-05-01 at 3.55.30 PM.png
>
>
> History Server will load all tasks in a stage, which would cause memory leak 
> if tasks occupy too many memory. 
> In the following example, a single application would consume 1.1G memory of 
> History Sever. 
> I think we should limit tasks memory usages by adding spark.ui.retainedTasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15212) CSV file reader when read file with first line schema do not filter blank in schema column name

2016-05-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-15212:
-
Priority: Minor  (was: Major)

> CSV file reader when read file with first line schema do not filter blank in 
> schema column name
> ---
>
> Key: SPARK-15212
> URL: https://issues.apache.org/jira/browse/SPARK-15212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> for example, run the following code in spark-shell,
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> var reader = sqlContext.read
> reader.option("header", true)
> var df = reader.csv("file:///diskext/tdata/spark/d1.csv")
> when the csv data file contains：
> --
> col1, col2,col3,col4,col5
> 1997,Ford,E350,"ac, abs, moon",3000.00
> 
> 
> the first line contains schema, the col2 has a blank before it,
> then the generated DataFrame's schema column name contains the blank.
> This may cause potential problem for example
> df.select("col2") 
> can't find the column, must use 
> df.select(" col2") 
> and if register the dataframe as a table, then do query, can't select col2.
> df.registerTempTable("tab1");
> sqlContext.sql("select col2 from tab1"); //will fail
> must add a column name validate when load csv file with schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15212) CSV file reader when read file with first line schema do not filter blank in schema column name

2016-05-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-15212:
-
Affects Version/s: (was: 1.6.2)
   (was: 1.6.1)

> CSV file reader when read file with first line schema do not filter blank in 
> schema column name
> ---
>
> Key: SPARK-15212
> URL: https://issues.apache.org/jira/browse/SPARK-15212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> for example, run the following code in spark-shell,
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> var reader = sqlContext.read
> reader.option("header", true)
> var df = reader.csv("file:///diskext/tdata/spark/d1.csv")
> when the csv data file contains：
> --
> col1, col2,col3,col4,col5
> 1997,Ford,E350,"ac, abs, moon",3000.00
> 
> 
> the first line contains schema, the col2 has a blank before it,
> then the generated DataFrame's schema column name contains the blank.
> This may cause potential problem for example
> df.select("col2") 
> can't find the column, must use 
> df.select(" col2") 
> and if register the dataframe as a table, then do query, can't select col2.
> df.registerTempTable("tab1");
> sqlContext.sql("select col2 from tab1"); //will fail
> must add a column name validate when load csv file with schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15212) CSV file reader when read file with first line schema do not filter blank in schema column name

2016-05-08 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275801#comment-15275801
 ] 

Hyukjin Kwon commented on SPARK-15212:
--

Since Spark does not have CSV data source as a internal data source for 1.6.x, 
I took out the affected versions

> CSV file reader when read file with first line schema do not filter blank in 
> schema column name
> ---
>
> Key: SPARK-15212
> URL: https://issues.apache.org/jira/browse/SPARK-15212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> for example, run the following code in spark-shell,
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> var reader = sqlContext.read
> reader.option("header", true)
> var df = reader.csv("file:///diskext/tdata/spark/d1.csv")
> when the csv data file contains：
> --
> col1, col2,col3,col4,col5
> 1997,Ford,E350,"ac, abs, moon",3000.00
> 
> 
> the first line contains schema, the col2 has a blank before it,
> then the generated DataFrame's schema column name contains the blank.
> This may cause potential problem for example
> df.select("col2") 
> can't find the column, must use 
> df.select(" col2") 
> and if register the dataframe as a table, then do query, can't select col2.
> df.registerTempTable("tab1");
> sqlContext.sql("select col2 from tab1"); //will fail
> must add a column name validate when load csv file with schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15212) CSV file reader when read file with first line schema do not filter blank in schema column name

2016-05-08 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275802#comment-15275802
 ] 

Hyukjin Kwon commented on SPARK-15212:
--

Since there is a workaroud with `withColumnRenamed()` which can be "easily 
worked around" , I lowered the priority according to 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark.

> CSV file reader when read file with first line schema do not filter blank in 
> schema column name
> ---
>
> Key: SPARK-15212
> URL: https://issues.apache.org/jira/browse/SPARK-15212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> for example, run the following code in spark-shell,
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> var reader = sqlContext.read
> reader.option("header", true)
> var df = reader.csv("file:///diskext/tdata/spark/d1.csv")
> when the csv data file contains：
> --
> col1, col2,col3,col4,col5
> 1997,Ford,E350,"ac, abs, moon",3000.00
> 
> 
> the first line contains schema, the col2 has a blank before it,
> then the generated DataFrame's schema column name contains the blank.
> This may cause potential problem for example
> df.select("col2") 
> can't find the column, must use 
> df.select(" col2") 
> and if register the dataframe as a table, then do query, can't select col2.
> df.registerTempTable("tab1");
> sqlContext.sql("select col2 from tab1"); //will fail
> must add a column name validate when load csv file with schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13946) PySpark DataFrames allows you to silently use aggregate expressions derived from different table expressions

2016-05-08 Thread Niranjan Molkeri` (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275805#comment-15275805
 ] 

Niranjan Molkeri` commented on SPARK-13946:
---

I ran the following code with the improvement. 
{noformat}
import numpy as np
import pandas as pd

from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F

sc = SparkContext(appName="fooAPP")
sqlContext = SQLContext(sc)



df = pd.DataFrame({'foo': np.random.randn(100),'bar': 
np.random.randn(100)})

sdf = sqlContext.createDataFrame(df)

sdf2 = sdf[sdf.bar > 0]

sdf.agg(F.count(sdf2.foo)).show()
{noformat}

I also got this output

{noformat}
+--+
|count(foo)|
+--+
|   100|
+--+

{noformat}

Can you tell me why the out put should be a exception instead the above output?
May be this me be a issue of some versions. Can you tell me which version of 
spark are you running?

> PySpark DataFrames allows you to silently use aggregate expressions derived 
> from different table expressions
> 
>
> Key: SPARK-13946
> URL: https://issues.apache.org/jira/browse/SPARK-13946
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Wes McKinney
>
> In my opinion, this code should raise an exception rather than silently 
> discarding the predicate:
> {code}
> import numpy as np
> import pandas as pd
> df = pd.DataFrame({'foo': np.random.randn(100),
>'bar': np.random.randn(100)})
> sdf = sqlContext.createDataFrame(df)
> sdf2 = sdf[sdf.bar > 0]
> sdf.agg(F.count(sdf2.foo)).show()
> +--+
> |count(foo)|
> +--+
> |   100|
> +--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13946) PySpark DataFrames allows you to silently use aggregate expressions derived from different table expressions

2016-05-08 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275807#comment-15275807
 ] 

Wes McKinney commented on SPARK-13946:
--

The expression {{F.count(sdf2.foo)}} derives from a different logical table 
({{sdf2}}) than {{sdf}}. In my opinion this should result in an analysis error. 

> PySpark DataFrames allows you to silently use aggregate expressions derived 
> from different table expressions
> 
>
> Key: SPARK-13946
> URL: https://issues.apache.org/jira/browse/SPARK-13946
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Wes McKinney
>
> In my opinion, this code should raise an exception rather than silently 
> discarding the predicate:
> {code}
> import numpy as np
> import pandas as pd
> df = pd.DataFrame({'foo': np.random.randn(100),
>'bar': np.random.randn(100)})
> sdf = sqlContext.createDataFrame(df)
> sdf2 = sdf[sdf.bar > 0]
> sdf.agg(F.count(sdf2.foo)).show()
> +--+
> |count(foo)|
> +--+
> |   100|
> +--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15217) Always Case Insensitive in HiveSessionState

2016-05-08 Thread Xiao Li (JIRA)

Xiao Li created SPARK-15217:
---

 Summary: Always Case Insensitive in HiveSessionState
 Key: SPARK-15217
 URL: https://issues.apache.org/jira/browse/SPARK-15217
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li
Priority: Blocker


In a `HiveSessionState`, which is a given `SparkSession` backed by Hive, the 
analysis should not be case sensitive because the underlying Hive Metastore is 
case insensitive. 

For example, 
{noformat}
CREATE TABLE tab1 (C1 int);
SELECT C1 FROM tab1
{noformat}
In the current implementation, we will get the following error because the 
column name is always stored in lower case. 
{noformat}
cannot resolve '`C1`' given input columns: [c1]; line 1 pos 7
org.apache.spark.sql.AnalysisException: cannot resolve '`C1`' given input 
columns: [c1]; line 1 pos 7
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15217) Always Case Insensitive in HiveSessionState

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15217:


Assignee: Apache Spark

> Always Case Insensitive in HiveSessionState
> ---
>
> Key: SPARK-15217
> URL: https://issues.apache.org/jira/browse/SPARK-15217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Blocker
>
> In a `HiveSessionState`, which is a given `SparkSession` backed by Hive, the 
> analysis should not be case sensitive because the underlying Hive Metastore 
> is case insensitive. 
> For example, 
> {noformat}
> CREATE TABLE tab1 (C1 int);
> SELECT C1 FROM tab1
> {noformat}
> In the current implementation, we will get the following error because the 
> column name is always stored in lower case. 
> {noformat}
> cannot resolve '`C1`' given input columns: [c1]; line 1 pos 7
> org.apache.spark.sql.AnalysisException: cannot resolve '`C1`' given input 
> columns: [c1]; line 1 pos 7
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15217) Always Case Insensitive in HiveSessionState

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15217:


Assignee: (was: Apache Spark)

> Always Case Insensitive in HiveSessionState
> ---
>
> Key: SPARK-15217
> URL: https://issues.apache.org/jira/browse/SPARK-15217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> In a `HiveSessionState`, which is a given `SparkSession` backed by Hive, the 
> analysis should not be case sensitive because the underlying Hive Metastore 
> is case insensitive. 
> For example, 
> {noformat}
> CREATE TABLE tab1 (C1 int);
> SELECT C1 FROM tab1
> {noformat}
> In the current implementation, we will get the following error because the 
> column name is always stored in lower case. 
> {noformat}
> cannot resolve '`C1`' given input columns: [c1]; line 1 pos 7
> org.apache.spark.sql.AnalysisException: cannot resolve '`C1`' given input 
> columns: [c1]; line 1 pos 7
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15217) Always Case Insensitive in HiveSessionState

2016-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275828#comment-15275828
 ] 

Apache Spark commented on SPARK-15217:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12993

> Always Case Insensitive in HiveSessionState
> ---
>
> Key: SPARK-15217
> URL: https://issues.apache.org/jira/browse/SPARK-15217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> In a `HiveSessionState`, which is a given `SparkSession` backed by Hive, the 
> analysis should not be case sensitive because the underlying Hive Metastore 
> is case insensitive. 
> For example, 
> {noformat}
> CREATE TABLE tab1 (C1 int);
> SELECT C1 FROM tab1
> {noformat}
> In the current implementation, we will get the following error because the 
> column name is always stored in lower case. 
> {noformat}
> cannot resolve '`C1`' given input columns: [c1]; line 1 pos 7
> org.apache.spark.sql.AnalysisException: cannot resolve '`C1`' given input 
> columns: [c1]; line 1 pos 7
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14698) CREATE FUNCTION cloud not add function to hive metastore

2016-05-08 Thread poseidon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275854#comment-15275854
 ] 

poseidon commented on SPARK-14698:
--

spark 1.6.1
hive 1.2.1 
mysql 5.6 

just start thrift-server , and create udf as normal in hive . 

you can replicate this issue.

I've fixed this issue already. 
add to udf to metastore is just the fist step , you have to fix lookup udf in 
metastore when parse sql as well .

> CREATE FUNCTION cloud not add function to hive metastore
> 
>
> Key: SPARK-14698
> URL: https://issues.apache.org/jira/browse/SPARK-14698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: spark1.6.1
>Reporter: poseidon
>  Labels: easyfix
>
> build spark 1.6.1 , and run it with 1.2.1 hive version,config mysql as 
> metastore server. 
> Start a thrift server , then in beeline , try to CREATE FUNCTION as HIVE SQL 
> UDF. 
> find out , can not add this FUNCTION to mysql metastore,but the function 
> usage goes well.
> if you try to add it again , thrift server throw a alread Exist Exception.
> [SPARK-10151][SQL] Support invocation of hive macro 
> add a if condition when runSqlHive, which will exec create function in 
> hiveexec. caused this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14963) YarnShuffleService should use YARN getRecoveryPath() for leveldb location

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14963:


Assignee: (was: Apache Spark)

> YarnShuffleService should use YARN getRecoveryPath() for leveldb location
> -
>
> Key: SPARK-14963
> URL: https://issues.apache.org/jira/browse/SPARK-14963
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 1.6.1
>Reporter: Thomas Graves
>
> The YarnShuffleService, currently just picks a directly in the yarn local 
> dirs to store the leveldb file.  YARN added an interface in hadoop 2.5 
> getRecoverPath() to get the location where it should be storing this.
> We should change to use getRecoveryPath(). This does mean we will have to use 
> reflection or similar to check for its existence though since it doesn't 
> exist before hadoop 2.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14963) YarnShuffleService should use YARN getRecoveryPath() for leveldb location

2016-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275865#comment-15275865
 ] 

Apache Spark commented on SPARK-14963:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/12994

> YarnShuffleService should use YARN getRecoveryPath() for leveldb location
> -
>
> Key: SPARK-14963
> URL: https://issues.apache.org/jira/browse/SPARK-14963
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 1.6.1
>Reporter: Thomas Graves
>
> The YarnShuffleService, currently just picks a directly in the yarn local 
> dirs to store the leveldb file.  YARN added an interface in hadoop 2.5 
> getRecoverPath() to get the location where it should be storing this.
> We should change to use getRecoveryPath(). This does mean we will have to use 
> reflection or similar to check for its existence though since it doesn't 
> exist before hadoop 2.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14963) YarnShuffleService should use YARN getRecoveryPath() for leveldb location

2016-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14963:


Assignee: Apache Spark

> YarnShuffleService should use YARN getRecoveryPath() for leveldb location
> -
>
> Key: SPARK-14963
> URL: https://issues.apache.org/jira/browse/SPARK-14963
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 1.6.1
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> The YarnShuffleService, currently just picks a directly in the yarn local 
> dirs to store the leveldb file.  YARN added an interface in hadoop 2.5 
> getRecoverPath() to get the location where it should be storing this.
> We should change to use getRecoveryPath(). This does mean we will have to use 
> reflection or similar to check for its existence though since it doesn't 
> exist before hadoop 2.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15032) When we create a new JDBC session, we may need to create a new session of executionHive

2016-05-08 Thread Sagar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275874#comment-15275874
 ] 

Sagar commented on SPARK-15032:
---

At the time of JDBC creation, we use thrift server executioHive, so what you 
are proposing is to create a new session of executionHive right? 

> When we create a new JDBC session, we may need to create a new session of 
> executionHive
> ---
>
> Key: SPARK-15032
> URL: https://issues.apache.org/jira/browse/SPARK-15032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now, we only use executionHive in thriftserver. When we create a new 
> jdbc session, we probably need to create a new session of executionHive. I am 
> not sure what will break if we leave the code as is. But, I feel it will be 
> safer to create a new session of executionHive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15209) Web UI's timeline visualizations fails to render if descriptions contain single quotes

2016-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275908#comment-15275908
 ] 

Apache Spark commented on SPARK-15209:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/12995

> Web UI's timeline visualizations fails to render if descriptions contain 
> single quotes
> --
>
> Key: SPARK-15209
> URL: https://issues.apache.org/jira/browse/SPARK-15209
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> If a Spark job's job description contains a single quote (') then the driver 
> UI's job event timeline will fail to render due to Javascript errors. To 
> reproduce these symptoms, run
> {code}
> sc.setJobDescription("double quote: \" ")
> sc.parallelize(1 to 10).count()
> sc.setJobDescription("single quote: ' ")
> sc.parallelize(1 to 10).count() 
> {code}
> and browse to the driver UI. This will currently result in an "Uncaught 
> SyntaxError" because the single quote is not escaped and ends up closing a 
> Javascript string literal too early.
> I think that a simple fix may be to change the relevant JS to use double 
> quotes and then to use the existing XML escaping logic to escape the string's 
> contents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15185) InMemoryCatalog : Silent Removal of an Existent Table/Function/Partitions by Rename

2016-05-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15185.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12960
[https://github.com/apache/spark/pull/12960]

> InMemoryCatalog : Silent Removal of an Existent Table/Function/Partitions by 
> Rename
> ---
>
> Key: SPARK-15185
> URL: https://issues.apache.org/jira/browse/SPARK-15185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> So far, in the implementation of InMemoryCatalog, we do not check if the 
> new/destination table/function/partition exists or not. Thus, we could just 
> silently remove the existent table/function/partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15185) InMemoryCatalog : Silent Removal of an Existent Table/Function/Partitions by Rename

2016-05-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15185:

Assignee: Xiao Li

> InMemoryCatalog : Silent Removal of an Existent Table/Function/Partitions by 
> Rename
> ---
>
> Key: SPARK-15185
> URL: https://issues.apache.org/jira/browse/SPARK-15185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> So far, in the implementation of InMemoryCatalog, we do not check if the 
> new/destination table/function/partition exists or not. Thus, we could just 
> silently remove the existent table/function/partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15184) Silent Removal of an Existent Temp Table by Table Rename

2016-05-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15184.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12959
[https://github.com/apache/spark/pull/12959]

> Silent Removal of an Existent Temp Table by Table Rename 
> -
>
> Key: SPARK-15184
> URL: https://issues.apache.org/jira/browse/SPARK-15184
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> Currently, if we rename a temp table `Tab1` to another existent temp table 
> `Tab2`. `Tab2` will be silently removed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15184) Silent Removal of an Existent Temp Table by Table Rename

2016-05-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15184:

Assignee: Xiao Li

> Silent Removal of an Existent Temp Table by Table Rename 
> -
>
> Key: SPARK-15184
> URL: https://issues.apache.org/jira/browse/SPARK-15184
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> Currently, if we rename a temp table `Tab1` to another existent temp table 
> `Tab2`. `Tab2` will be silently removed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13902) Make DAGScheduler not to create duplicate stage.

2016-05-08 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-13902:
--
Description: 
{{DAGScheduler}} sometimes generate incorrect stage graph.

Suppose you have the following DAG (please see this in monospaced font):

{noformat}
[A] <--(s_A)-- [B] <--(s_B)-- [C] <--(s_C)-- [D]
\/
  <-
{noformat}

Note: [] means an RDD, () means a shuffle dependency.

Here, RDD {{B}} has a shuffle dependency on RDD {{A}}, and RDD {{C}} has 
shuffle dependency on both {{B}} and {{A}}. The shuffle dependency IDs are 
numbers in the {{DAGScheduler}}, but to make the example easier to understand, 
let's call the shuffled data from {{A}} shuffle dependency ID {{s_A}} and the 
shuffled data from {{B}} shuffle dependency ID {{s_B}}.
The {{getAncestorShuffleDependencies}} method in {{DAGScheduler}} (incorrectly) 
does not check for duplicates when it's adding ShuffleDependencies to the 
parents data structure, so for this DAG, when 
{{getAncestorShuffleDependencies}} gets called on {{C}} (previous of the final 
RDD), {{getAncestorShuffleDependencies}} will return {{s_A}}, {{s_A}}, {{s_B}} 
({{s_A}} gets added twice: once when the method "visit"s RDD {{C}}, and once 
when the method "visit"s RDD {{B}}). This is problematic because this line of 
code: 
https://github.com/apache/spark/blob/8ef3399/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L289
 then generates a new shuffle stage for each dependency returned by 
{{getAncestorShuffleDependencies}}, resulting in duplicate map stages that 
compute the map output from RDD A.

As a result, {{DAGScheduler}} generates the following stages and their parents 
for each shuffle:

|  | stage | parents |
| s_A | ShuffleMapStage 2 | List() |
| s_B | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
| s_C | ShuffleMapStage 3 | List(ShuffleMapStage 1, ShuffleMapStage 2) |
| \- | ResultStage 4 | List(ShuffleMapStage 3) |

The stage for {{s_A}} should be {{ShuffleMapStage 0}}, but the stage for 
{{s_A}} is generated twice as {{ShuffleMapStage 2}} and {{ShuffleMapStage 0}} 
is overwritten by {{ShuffleMapStage 2}}, and the stage {{ShuffleMap Stage1}} 
keeps referring the _old_ stage {{ShuffleMapStage 0}}.


  was:
{{DAGScheduler}} sometimes generate incorrect stage graph.
Some stages are generated for the same shuffleId twice or more and they are 
referenced by the child stages because the building order of the graph is not 
correct.

Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see this 
in {{monospaced}} font):

{noformat}
  <
/   \
[A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
   \   /
 <
{noformat}

Note: \[\] means an RDD, () means a shuffle dependency.

{{DAGScheduler}} generates the following stages and their parents for each 
shuffle:

|  | stage | parents |
| (1) | ShuffleMapStage 2 | List() |
| (2) | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
| (3) | ShuffleMapStage 3 | List(ShuffleMapStage 1) |
| (4) | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) |
| (5) | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) |
| \- | ResultStage 6 | List(ShuffleMapStage 5) |

The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage 
for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and 
{{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage 
{{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}.


Summary: Make DAGScheduler not to create duplicate stage.  (was: Make 
DAGScheduler.getAncestorShuffleDependencies() return in topological order to 
ensure building ancestor stages first.)

> Make DAGScheduler not to create duplicate stage.
> 
>
> Key: SPARK-13902
> URL: https://issues.apache.org/jira/browse/SPARK-13902
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Takuya Ueshin
>
> {{DAGScheduler}} sometimes generate incorrect stage graph.
> Suppose you have the following DAG (please see this in monospaced font):
> {noformat}
> [A] <--(s_A)-- [B] <--(s_B)-- [C] <--(s_C)-- [D]
> \/
>   <-
> {noformat}
> Note: [] means an RDD, () means a shuffle dependency.
> Here, RDD {{B}} has a shuffle dependency on RDD {{A}}, and RDD {{C}} has 
> shuffle dependency on both {{B}} and {{A}}. The shuffle dependency IDs are 
> numbers in the {{DAGScheduler}}, but to make the example easier to 
> understand, let's call the shuffled data from {{A}} shuffle dependency ID 
> {{s_A}} and the shuffled data from {{B}} shu

[jira] [Updated] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts

2016-05-08 Thread Devaraj K (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated SPARK-15142:
--
Attachment: 
spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out

> Spark Mesos dispatcher becomes unusable when the Mesos master restarts
> --
>
> Key: SPARK-15142
> URL: https://issues.apache.org/jira/browse/SPARK-15142
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Devaraj K
>Priority: Minor
> Attachments: 
> spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out
>
>
> While Spark Mesos dispatcher running if the Mesos master gets restarted then 
> Spark Mesos dispatcher will keep running and queues up all the submitted 
> applications and will not launch them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13902) Make DAGScheduler not to create duplicate stage.

2016-05-08 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-13902:
--
Description: 
{{DAGScheduler}} sometimes generate incorrect stage graph.

Suppose you have the following DAG (please see this in monospaced font):

{noformat}
[A] <--(s_A)-- [B] <--(s_B)-- [C] <--(s_C)-- [D]
\/
  <-
{noformat}

Note: [] means an RDD, () means a shuffle dependency.

Here, RDD {{B}} has a shuffle dependency on RDD {{A}}, and RDD {{C}} has 
shuffle dependency on both {{B}} and {{A}}. The shuffle dependency IDs are 
numbers in the {{DAGScheduler}}, but to make the example easier to understand, 
let's call the shuffled data from {{A}} shuffle dependency ID {{s_A}} and the 
shuffled data from {{B}} shuffle dependency ID {{s_B}}.
The {{getAncestorShuffleDependencies}} method in {{DAGScheduler}} (incorrectly) 
does not check for duplicates when it's adding ShuffleDependencies to the 
parents data structure, so for this DAG, when 
{{getAncestorShuffleDependencies}} gets called on {{C}} (previous of the final 
RDD), {{getAncestorShuffleDependencies}} will return {{s_A}}, {{s_B}}, {{s_A}} 
({{s_A}} gets added twice: once when the method "visit"s RDD {{C}}, and once 
when the method "visit"s RDD {{B}}). This is problematic because this line of 
code: 
https://github.com/apache/spark/blob/8ef3399/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L289
 then generates a new shuffle stage for each dependency returned by 
{{getAncestorShuffleDependencies}}, resulting in duplicate map stages that 
compute the map output from RDD A.

As a result, {{DAGScheduler}} generates the following stages and their parents 
for each shuffle:

|  | stage | parents |
| s_A | ShuffleMapStage 2 | List() |
| s_B | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
| s_C | ShuffleMapStage 3 | List(ShuffleMapStage 1, ShuffleMapStage 2) |
| \- | ResultStage 4 | List(ShuffleMapStage 3) |

The stage for {{s_A}} should be {{ShuffleMapStage 0}}, but the stage for 
{{s_A}} is generated twice as {{ShuffleMapStage 2}} and {{ShuffleMapStage 0}} 
is overwritten by {{ShuffleMapStage 2}}, and the stage {{ShuffleMap Stage1}} 
keeps referring the _old_ stage {{ShuffleMapStage 0}}.


  was:
{{DAGScheduler}} sometimes generate incorrect stage graph.

Suppose you have the following DAG (please see this in monospaced font):

{noformat}
[A] <--(s_A)-- [B] <--(s_B)-- [C] <--(s_C)-- [D]
\/
  <-
{noformat}

Note: [] means an RDD, () means a shuffle dependency.

Here, RDD {{B}} has a shuffle dependency on RDD {{A}}, and RDD {{C}} has 
shuffle dependency on both {{B}} and {{A}}. The shuffle dependency IDs are 
numbers in the {{DAGScheduler}}, but to make the example easier to understand, 
let's call the shuffled data from {{A}} shuffle dependency ID {{s_A}} and the 
shuffled data from {{B}} shuffle dependency ID {{s_B}}.
The {{getAncestorShuffleDependencies}} method in {{DAGScheduler}} (incorrectly) 
does not check for duplicates when it's adding ShuffleDependencies to the 
parents data structure, so for this DAG, when 
{{getAncestorShuffleDependencies}} gets called on {{C}} (previous of the final 
RDD), {{getAncestorShuffleDependencies}} will return {{s_A}}, {{s_A}}, {{s_B}} 
({{s_A}} gets added twice: once when the method "visit"s RDD {{C}}, and once 
when the method "visit"s RDD {{B}}). This is problematic because this line of 
code: 
https://github.com/apache/spark/blob/8ef3399/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L289
 then generates a new shuffle stage for each dependency returned by 
{{getAncestorShuffleDependencies}}, resulting in duplicate map stages that 
compute the map output from RDD A.

As a result, {{DAGScheduler}} generates the following stages and their parents 
for each shuffle:

|  | stage | parents |
| s_A | ShuffleMapStage 2 | List() |
| s_B | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
| s_C | ShuffleMapStage 3 | List(ShuffleMapStage 1, ShuffleMapStage 2) |
| \- | ResultStage 4 | List(ShuffleMapStage 3) |

The stage for {{s_A}} should be {{ShuffleMapStage 0}}, but the stage for 
{{s_A}} is generated twice as {{ShuffleMapStage 2}} and {{ShuffleMapStage 0}} 
is overwritten by {{ShuffleMapStage 2}}, and the stage {{ShuffleMap Stage1}} 
keeps referring the _old_ stage {{ShuffleMapStage 0}}.



> Make DAGScheduler not to create duplicate stage.
> 
>
> Key: SPARK-13902
> URL: https://issues.apache.org/jira/browse/SPARK-13902
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Takuya Ueshin
>
> {{DAGScheduler}} sometimes generate incorrect stage graph.
> Suppose you have the following DAG (please see this in monospaced font):
> {noformat}
> [A] <--(s_A)--

[jira] [Commented] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts

2016-05-08 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275978#comment-15275978
 ] 

Devaraj K commented on SPARK-15142:
---

bq. Can you include the dispatcher logs?

I have attached the dispatcher logs, but I don't see anything useful in those.

bq. Does restarting the dispatcher fix the problem?
Yes, It works fine after restarting the dispatcher.

I suspect the dispatcher is loosing the connection with the mesos master after 
the mesos master restart and stops receiving the resource offerings. I think 
the dispatcher may need to re-register with the mesos master for connection 
loss. I will try creating a PR to fix this issue.

> Spark Mesos dispatcher becomes unusable when the Mesos master restarts
> --
>
> Key: SPARK-15142
> URL: https://issues.apache.org/jira/browse/SPARK-15142
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Devaraj K
>Priority: Minor
> Attachments: 
> spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out
>
>
> While Spark Mesos dispatcher running if the Mesos master gets restarted then 
> Spark Mesos dispatcher will keep running and queues up all the submitted 
> applications and will not launch them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14057) sql time stamps do not respect time zones

2016-05-08 Thread Vijay Parmar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275987#comment-15275987
 ] 

Vijay Parmar commented on SPARK-14057:
--

I have few suggestions to make here after looking into the issue along with 
referring google and other sources:-

1. We can make use of the built-in java.time.package which is available in Java 
8 and higher versions 
(http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/Date.java).
In short, here a new instance would be created to adjust the Time-Zone.
   
   In java.util.date package the class in most of the cases ignores the 
Time-Zone.

  We could try implementing this package.


2.  This code snippet can be handy :-

ZoneId zoneLondon = ZoneId.of("London"); 
ZonedDateTime nowLondon = ZonedDateTime.now ( zoneLondon );

ZoneId zoneSingapore = ZoneId.of("Singapore"); 
ZonedDateTime nowSingapore = nowLondon.withZoneSameInstant( zoneSingapore );
ZonedDateTime nowUTC = nowLondon.withZoneSameInstant( ZoneOffset.UTC );

3. We need to look into the SQL side code also To have an understanding how the 
Time is getting captured and stored once it is received from this end.

I will still keep on looking iinto the issue and will update you. Meanwhile, I 
also wait for your comment(s) on my suggestions.  




> sql time stamps do not respect time zones
> -
>
> Key: SPARK-14057
> URL: https://issues.apache.org/jira/browse/SPARK-14057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Davidson
>Priority: Minor
>
> we have time stamp data. The time stamp data is UTC how ever when we load the 
> data into spark data frames, the system assume the time stamps are in the 
> local time zone. This causes problems for our data scientists. Often they 
> pull data from our data center into their local macs. The data centers run 
> UTC. There computers are typically in PST or EST.
> It is possible to hack around this problem
> This cause a lot of errors in their analysis
> A complete description of this issue can be found in the following mail msg
> https://www.mail-archive.com/user@spark.apache.org/msg48121.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15218) Error: Could not find or load main class org.apache.spark.launcher.Main when run from a directory containing colon ':'

2016-05-08 Thread Adam Cecile (JIRA)

Adam Cecile created SPARK-15218:
---

 Summary: Error: Could not find or load main class 
org.apache.spark.launcher.Main when run from a directory containing colon ':'
 Key: SPARK-15218
 URL: https://issues.apache.org/jira/browse/SPARK-15218
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Spark Shell, Spark Submit
Affects Versions: 1.6.1
Reporter: Adam Cecile


{noformat}
 mkdir /tmp/qwe:rtz
 cd /tmp/qwe:rtz
wget 
http://www-eu.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz
tar xvzf spark-1.6.1-bin-without-hadoop.tgz 
cd spark-1.6.1-bin-without-hadoop/
bin/spark-submit
{noformat}

Returns "Error: Could not find or load main class 
org.apache.spark.launcher.Main".

That would not be such an issue if Mesos executor did not have colon in the 
generated paths. It means withtout hacking (define relative SPARK_HOME path by 
myself) there's no way to run a spark-job insode a mesos job container...

Best regards, Adam.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15218) Error: Could not find or load main class org.apache.spark.launcher.Main when run from a directory containing colon ':'

2016-05-08 Thread Adam Cecile (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Cecile updated SPARK-15218:

Description: 
{noformat}
mkdir /tmp/qwe:rtz
cd /tmp/qwe:rtz
wget 
http://www-eu.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz
tar xvzf spark-1.6.1-bin-without-hadoop.tgz 
cd spark-1.6.1-bin-without-hadoop/
bin/spark-submit
{noformat}

Returns "Error: Could not find or load main class 
org.apache.spark.launcher.Main".

That would not be such an issue if Mesos executor did not have colon in the 
generated paths. It means withtout hacking (define relative SPARK_HOME path by 
myself) there's no way to run a spark-job insode a mesos job container...

Best regards, Adam.


  was:
{noformat}
 mkdir /tmp/qwe:rtz
 cd /tmp/qwe:rtz
wget 
http://www-eu.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz
tar xvzf spark-1.6.1-bin-without-hadoop.tgz 
cd spark-1.6.1-bin-without-hadoop/
bin/spark-submit
{noformat}

Returns "Error: Could not find or load main class 
org.apache.spark.launcher.Main".

That would not be such an issue if Mesos executor did not have colon in the 
generated paths. It means withtout hacking (define relative SPARK_HOME path by 
myself) there's no way to run a spark-job insode a mesos job container...

Best regards, Adam.



> Error: Could not find or load main class org.apache.spark.launcher.Main when 
> run from a directory containing colon ':'
> --
>
> Key: SPARK-15218
> URL: https://issues.apache.org/jira/browse/SPARK-15218
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 1.6.1
>Reporter: Adam Cecile
>  Labels: mesos
>
> {noformat}
> mkdir /tmp/qwe:rtz
> cd /tmp/qwe:rtz
> wget 
> http://www-eu.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz
> tar xvzf spark-1.6.1-bin-without-hadoop.tgz 
> cd spark-1.6.1-bin-without-hadoop/
> bin/spark-submit
> {noformat}
> Returns "Error: Could not find or load main class 
> org.apache.spark.launcher.Main".
> That would not be such an issue if Mesos executor did not have colon in the 
> generated paths. It means withtout hacking (define relative SPARK_HOME path 
> by myself) there's no way to run a spark-job insode a mesos job container...
> Best regards, Adam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15219) [Spark SQL] it don't support to detect runtime temporary table for enabling broadcast hash join optimization

2016-05-08 Thread Yi Zhou (JIRA)

Yi Zhou created SPARK-15219:
---

 Summary: [Spark SQL] it don't support to detect runtime temporary 
table for enabling broadcast hash join optimization
 Key: SPARK-15219
 URL: https://issues.apache.org/jira/browse/SPARK-15219
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yi Zhou


We observed an interesting thing about broadcast Hash join( similar to Map Join 
in Hive) when comparing the implementation by Hive on MR engine. The blew query 
is a multi-way join operation based on 3 tables including product_reviews, 2 
run-time temporary result tables(fsr and fwr) from ‘select’ query operation and 
also there is a two-way join(1 table and 1 run-time temporary table) in both 
'fsr' and 'fwr'. We investigated the difference between Spark SQL and Hive on 
MR engine and found that there are total of 5 map join tasks with tuned map 
join parameters in Hive on MR but there are only 2 broadcast hash join tasks in 
Spark SQL even if we set a larger threshold(e.g.,1GB) for broadcast hash join. 
From our investigation, it seems that if there is run-time temporary table in 
join operation in Spark SQL engine it will not detect such table for enabling 
broadcast hash join optimization. 

Core SQL snippet:
{code}
INSERT INTO TABLE q19_spark_sql_power_test_0_result
SELECT *
FROM
( --wrap in additional FROM(), because Sorting/distribute by with UDTF in 
select clause is not allowed
  SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS
  (
item_sk,
review_sentence,
sentiment,
sentiment_word
  )
  FROM product_reviews pr,
  (
--store returns in week ending given date
SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty
FROM store_returns sr,
(
  -- within the week ending a given date
  SELECT d1.d_date_sk
  FROM date_dim d1, date_dim d2
  WHERE d1.d_week_seq = d2.d_week_seq
  AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' )
) sr_dateFilter
WHERE sr.sr_returned_date_sk = d_date_sk
GROUP BY sr_item_sk --across all store and web channels
HAVING sr_item_qty > 0
  ) fsr,
  (
--web returns in week ending given date
SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty
FROM web_returns wr,
(
  -- within the week ending a given date
  SELECT d1.d_date_sk
  FROM date_dim d1, date_dim d2
  WHERE d1.d_week_seq = d2.d_week_seq
  AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' )
) wr_dateFilter
WHERE wr.wr_returned_date_sk = d_date_sk
GROUP BY wr_item_sk  --across all store and web channels
HAVING wr_item_qty > 0
  ) fwr
  WHERE fsr.sr_item_sk = fwr.wr_item_sk
  AND pr.pr_item_sk = fsr.sr_item_sk --extract product_reviews for found items
  -- equivalent across all store and web channels (within a tolerance of +/- 
10%)
  AND abs( (sr_item_qty-wr_item_qty)/ ((sr_item_qty+wr_item_qty)/2)) <= 0.1
)extractedSentiments
WHERE sentiment= 'NEG' --if there are any major negative reviews.
ORDER BY item_sk,review_sentence,sentiment,sentiment_word
;
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15219) [Spark SQL] it don't support to detect runtime temporary table for enabling broadcast hash join optimization

2016-05-08 Thread Yi Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-15219:

Description: 
We observed an interesting thing about broadcast Hash join( similar to Map Join 
in Hive) when comparing the implementation by Hive on MR engine. The blew query 
is a multi-way join operation based on 3 tables including product_reviews, 2 
run-time temporary result tables(fsr and fwr) from ‘select’ query operation and 
also there is a two-way join(1 table and 1 run-time temporary table) in both 
'fsr' and 'fwr',which cause slower performance than Hive on MR. We investigated 
the difference between Spark SQL and Hive on MR engine and found that there are 
total of 5 map join tasks with tuned map join parameters in Hive on MR but 
there are only 2 broadcast hash join tasks in Spark SQL even if we set a larger 
threshold(e.g.,1GB) for broadcast hash join. From our investigation, it seems 
that if there is run-time temporary table in join operation in Spark SQL engine 
it will not detect such table for enabling broadcast hash join optimization. 

Core SQL snippet:
{code}
INSERT INTO TABLE q19_spark_sql_power_test_0_result
SELECT *
FROM
( --wrap in additional FROM(), because Sorting/distribute by with UDTF in 
select clause is not allowed
  SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS
  (
item_sk,
review_sentence,
sentiment,
sentiment_word
  )
  FROM product_reviews pr,
  (
--store returns in week ending given date
SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty
FROM store_returns sr,
(
  -- within the week ending a given date
  SELECT d1.d_date_sk
  FROM date_dim d1, date_dim d2
  WHERE d1.d_week_seq = d2.d_week_seq
  AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' )
) sr_dateFilter
WHERE sr.sr_returned_date_sk = d_date_sk
GROUP BY sr_item_sk --across all store and web channels
HAVING sr_item_qty > 0
  ) fsr,
  (
--web returns in week ending given date
SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty
FROM web_returns wr,
(
  -- within the week ending a given date
  SELECT d1.d_date_sk
  FROM date_dim d1, date_dim d2
  WHERE d1.d_week_seq = d2.d_week_seq
  AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' )
) wr_dateFilter
WHERE wr.wr_returned_date_sk = d_date_sk
GROUP BY wr_item_sk  --across all store and web channels
HAVING wr_item_qty > 0
  ) fwr
  WHERE fsr.sr_item_sk = fwr.wr_item_sk
  AND pr.pr_item_sk = fsr.sr_item_sk --extract product_reviews for found items
  -- equivalent across all store and web channels (within a tolerance of +/- 
10%)
  AND abs( (sr_item_qty-wr_item_qty)/ ((sr_item_qty+wr_item_qty)/2)) <= 0.1
)extractedSentiments
WHERE sentiment= 'NEG' --if there are any major negative reviews.
ORDER BY item_sk,review_sentence,sentiment,sentiment_word
;
{code}

  was:
We observed an interesting thing about broadcast Hash join( similar to Map Join 
in Hive) when comparing the implementation by Hive on MR engine. The blew query 
is a multi-way join operation based on 3 tables including product_reviews, 2 
run-time temporary result tables(fsr and fwr) from ‘select’ query operation and 
also there is a two-way join(1 table and 1 run-time temporary table) in both 
'fsr' and 'fwr'. We investigated the difference between Spark SQL and Hive on 
MR engine and found that there are total of 5 map join tasks with tuned map 
join parameters in Hive on MR but there are only 2 broadcast hash join tasks in 
Spark SQL even if we set a larger threshold(e.g.,1GB) for broadcast hash join. 
From our investigation, it seems that if there is run-time temporary table in 
join operation in Spark SQL engine it will not detect such table for enabling 
broadcast hash join optimization. 

Core SQL snippet:
{code}
INSERT INTO TABLE q19_spark_sql_power_test_0_result
SELECT *
FROM
( --wrap in additional FROM(), because Sorting/distribute by with UDTF in 
select clause is not allowed
  SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS
  (
item_sk,
review_sentence,
sentiment,
sentiment_word
  )
  FROM product_reviews pr,
  (
--store returns in week ending given date
SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty
FROM store_returns sr,
(
  -- within the week ending a given date
  SELECT d1.d_date_sk
  FROM date_dim d1, date_dim d2
  WHERE d1.d_week_seq = d2.d_week_seq
  AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' )
) sr_dateFilter
WHERE sr.sr_returned_date_sk = d_date_sk
GROUP BY sr_item_sk --across all store and web channels
HAVING sr_item_qty > 0
  ) fsr,
  (
--web returns in week ending given date
SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty
FROM web_returns wr,
(
  -- within the week

80 matches

Mail list logo