[jira] [Updated] (SPARK-20152) Time zone is not respected while parsing csv for timeStampFormat "MM-dd-yyyy'T'HH:mm:ss.SSSZZ"

2017-03-29 Thread Navya Krishnappa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navya Krishnappa updated SPARK-20152:
-
Description: 
When reading the below mentioned time value by specifying the 
"timestampFormat": "MM-dd-'T'HH:mm:ss.SSSZZ", time zone is ignored.

Source File: 
TimeColumn
03-21-2017T03:30:02Z

Source code1:
Dataset dataset = getSqlContext().read()
.option(DAWBConstant.PARSER_LIB, "commons")
.option(INFER_SCHEMA, "true")
.option(DAWBConstant.DELIMITER, ",")
.option(DAWBConstant.QUOTE, "\"")
.option(DAWBConstant.ESCAPE, "\\")
.option("timestampFormat" , "MM-dd-'T'HH:mm:ss.SSSZZ")
.option(DAWBConstant.MODE, Mode.PERMISSIVE)
.csv(sourceFile);

Result: TimeColumn [ StringType] and value is "03-21-2017T03:30:02Z", but 
expected result is TimeCoumn should be of "TimestampType"  and should consider 
time zone for manipulation

Source code2:
Dataset dataset = getSqlContext().read()
.option(DAWBConstant.PARSER_LIB, "commons")
.option(INFER_SCHEMA, "true")
.option(DAWBConstant.DELIMITER, ",")
.option(DAWBConstant.QUOTE, "\"")
.option(DAWBConstant.ESCAPE, "\\")
.option("timestampFormat" , "MM-dd-'T'HH:mm:ss")
.option(DAWBConstant.MODE, Mode.PERMISSIVE)
.csv(sourceFile);

Result: TimeColumn [ TimestampType] and value is "2017-04-22 03:30:02.0", but 
expected result is TimeCoumn should consider time zone for manipulation

  was:
When reading the below mentioned time value by specifying the 
"timestampFormat": "MM-dd-'T'HH:mm:ss.SSSZZ", time zone is ignored.

Sample data: 
TimeColumn
03-21-2017T03:30:02Z


Result: TimeColumn [ StringType] and value is "03-21-2017T03:30:02Z"

Expected Result: TimeCoumn should be of "TimestampType" 


> Time zone is not respected while parsing csv for timeStampFormat 
> "MM-dd-'T'HH:mm:ss.SSSZZ"
> --
>
> Key: SPARK-20152
> URL: https://issues.apache.org/jira/browse/SPARK-20152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Navya Krishnappa
>
> When reading the below mentioned time value by specifying the 
> "timestampFormat": "MM-dd-'T'HH:mm:ss.SSSZZ", time zone is ignored.
> Source File: 
> TimeColumn
> 03-21-2017T03:30:02Z
> Source code1:
> Dataset dataset = getSqlContext().read()
> .option(DAWBConstant.PARSER_LIB, "commons")
> .option(INFER_SCHEMA, "true")
> .option(DAWBConstant.DELIMITER, ",")
> .option(DAWBConstant.QUOTE, "\"")
> .option(DAWBConstant.ESCAPE, "\\")
> .option("timestampFormat" , "MM-dd-'T'HH:mm:ss.SSSZZ")
> .option(DAWBConstant.MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> Result: TimeColumn [ StringType] and value is "03-21-2017T03:30:02Z", but 
> expected result is TimeCoumn should be of "TimestampType"  and should 
> consider time zone for manipulation
> Source code2:
> Dataset dataset = getSqlContext().read()
> .option(DAWBConstant.PARSER_LIB, "commons")
> .option(INFER_SCHEMA, "true")
> .option(DAWBConstant.DELIMITER, ",")
> .option(DAWBConstant.QUOTE, "\"")
> .option(DAWBConstant.ESCAPE, "\\")
> .option("timestampFormat" , "MM-dd-'T'HH:mm:ss")
> .option(DAWBConstant.MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> Result: TimeColumn [ TimestampType] and value is "2017-04-22 03:30:02.0", but 
> expected result is TimeCoumn should consider time zone for manipulation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20152) Time zone is not respected while parsing csv for timeStampFormat "MM-dd-yyyy'T'HH:mm:ss.SSSZZ"

2017-03-29 Thread Navya Krishnappa (JIRA)
Navya Krishnappa created SPARK-20152:


 Summary: Time zone is not respected while parsing csv for 
timeStampFormat "MM-dd-'T'HH:mm:ss.SSSZZ"
 Key: SPARK-20152
 URL: https://issues.apache.org/jira/browse/SPARK-20152
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Navya Krishnappa


When reading the below mentioned time value by specifying the 
"timestampFormat": "MM-dd-'T'HH:mm:ss.SSSZZ", time zone is ignored.

Sample data: 
TimeColumn
03-21-2017T03:30:02Z


Result: TimeColumn [ StringType] and value is "03-21-2017T03:30:02Z"

Expected Result: TimeCoumn should be of "TimestampType" 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20151) Account for partition pruning in scan metadataTime metrics

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20151:


Assignee: Reynold Xin  (was: Apache Spark)

> Account for partition pruning in scan metadataTime metrics
> --
>
> Key: SPARK-20151
> URL: https://issues.apache.org/jira/browse/SPARK-20151
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> After SPARK-20136, we report metadata timing metrics in scan operator. 
> However, that timing metric doesn't include one of the most important part of 
> metadata, which is partition pruning. This patch adds that time measurement 
> to the scan metrics.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20151) Account for partition pruning in scan metadataTime metrics

2017-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948422#comment-15948422
 ] 

Apache Spark commented on SPARK-20151:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17476

> Account for partition pruning in scan metadataTime metrics
> --
>
> Key: SPARK-20151
> URL: https://issues.apache.org/jira/browse/SPARK-20151
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> After SPARK-20136, we report metadata timing metrics in scan operator. 
> However, that timing metric doesn't include one of the most important part of 
> metadata, which is partition pruning. This patch adds that time measurement 
> to the scan metrics.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20151) Account for partition pruning in scan metadataTime metrics

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20151:


Assignee: Apache Spark  (was: Reynold Xin)

> Account for partition pruning in scan metadataTime metrics
> --
>
> Key: SPARK-20151
> URL: https://issues.apache.org/jira/browse/SPARK-20151
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> After SPARK-20136, we report metadata timing metrics in scan operator. 
> However, that timing metric doesn't include one of the most important part of 
> metadata, which is partition pruning. This patch adds that time measurement 
> to the scan metrics.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20151) Account for partition pruning in scan metadataTime metrics

2017-03-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-20151:

Summary: Account for partition pruning in scan metadataTime metrics  (was: 
Take partition pruning timing into account in scan metadataTime metrics)

> Account for partition pruning in scan metadataTime metrics
> --
>
> Key: SPARK-20151
> URL: https://issues.apache.org/jira/browse/SPARK-20151
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> After SPARK-20136, we report metadata timing metrics in scan operator. 
> However, that timing metric doesn't include one of the most important part of 
> metadata, which is partition pruning.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20151) Take partition pruning timing into account in scan metadataTime metrics

2017-03-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-20151:
---

 Summary: Take partition pruning timing into account in scan 
metadataTime metrics
 Key: SPARK-20151
 URL: https://issues.apache.org/jira/browse/SPARK-20151
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin


After SPARK-20136, we report metadata timing metrics in scan operator. However, 
that timing metric doesn't include one of the most important part of metadata, 
which is partition pruning.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20151) Account for partition pruning in scan metadataTime metrics

2017-03-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-20151:

Description: 
After SPARK-20136, we report metadata timing metrics in scan operator. However, 
that timing metric doesn't include one of the most important part of metadata, 
which is partition pruning. This patch adds that time measurement to the scan 
metrics.


  was:
After SPARK-20136, we report metadata timing metrics in scan operator. However, 
that timing metric doesn't include one of the most important part of metadata, 
which is partition pruning.



> Account for partition pruning in scan metadataTime metrics
> --
>
> Key: SPARK-20151
> URL: https://issues.apache.org/jira/browse/SPARK-20151
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> After SPARK-20136, we report metadata timing metrics in scan operator. 
> However, that timing metric doesn't include one of the most important part of 
> metadata, which is partition pruning. This patch adds that time measurement 
> to the scan metrics.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20148) Extend the file commit interface to allow subscribing to task commit messages

2017-03-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20148.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.2.0

> Extend the file commit interface to allow subscribing to task commit messages
> -
>
> Key: SPARK-20148
> URL: https://issues.apache.org/jira/browse/SPARK-20148
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.2.0
>
>
> The internal FileCommitProtocol interface returns all task commit messages in 
> bulk to the implementation when a job finishes. However, it is sometimes 
> useful to access those messages before the job completes, so that the driver 
> gets incremental progress updates before the job finishes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20150) Add permsize statistics for worker memory which may be very useful for the memory usage assessment

2017-03-29 Thread Jinhua Fu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinhua Fu updated SPARK-20150:
--
Summary: Add permsize statistics for worker memory which may be very useful 
for the memory usage assessment  (was: Can the spark add a mechanism for 
permsize statistics which may be very useful for the memory usage assessment)

> Add permsize statistics for worker memory which may be very useful for the 
> memory usage assessment
> --
>
> Key: SPARK-20150
> URL: https://issues.apache.org/jira/browse/SPARK-20150
> Project: Spark
>  Issue Type: Wish
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Jinhua Fu
>
> It seems worker memory only be assigned to executor heap which is usually not 
> enough for estimating the whole clauster memory usage,especially when memory 
> becomes a bottleneck of the clauster.In many case,we found a executor's real 
> memory usage was much larger than its heap size which make me have to check 
> for every application's real memory expenditure.
> This can be improved by adding a mechanism for Non-Heap(permsize) 
> statistics,only shown for extra memory usage which has no effect on the 
> current worker memory allocation and statistics.The permsize can be obtained 
> easily from executor java options.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not backwards compatible with earlier version

2017-03-29 Thread Chico Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948335#comment-15948335
 ] 

Chico Qi commented on SPARK-14492:
--

I had the same issue when I upgraded to Spark 2.1.0 and my Hive's version is 
1.1.0-cdh5.7.0.

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/03/30 16:07:30 WARN spark.SparkContext: Support for Java 7 is deprecated as 
of Spark 2.0.0
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':
  at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
  at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
  at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
  at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
  at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
  at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
  ... 47 elided
Caused by: java.lang.reflect.InvocationTargetException: 
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveExternalCatalog':
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)
  ... 58 more
Caused by: java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveExternalCatalog':
  at 
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:169)
  at org.apache.spark.sql.internal.SharedState.(SharedState.scala:86)
  at 
org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
  at 
org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:101)
  at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:100)
  at org.apache.spark.sql.internal.SessionState.(SessionState.scala:157)
  at 
org.apache.spark.sql.hive.HiveSessionState.(HiveSessionState.scala:32)
  ... 63 more
Caused by: java.lang.reflect.InvocationTargetException: 
java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at 
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:166)
  ... 71 more
Caused by: java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME
  at 
org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:194)
  at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:269)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:65)
  ... 76 more
:14: error: not found: value spark
   import spark.implicits._
  ^
:14: error: not found: value spark
   import spark.sql
  ^


> Spark SQL 1.6.0 does not work with Hive version lower than 1.2.0; its not 
> backwards compatible with earlier version
> ---
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this 

[jira] [Created] (SPARK-20150) Can the spark add a mechanism for permsize statistics which may be very useful for the memory usage assessment

2017-03-29 Thread Jinhua Fu (JIRA)
Jinhua Fu created SPARK-20150:
-

 Summary: Can the spark add a mechanism for permsize statistics 
which may be very useful for the memory usage assessment
 Key: SPARK-20150
 URL: https://issues.apache.org/jira/browse/SPARK-20150
 Project: Spark
  Issue Type: Wish
  Components: Web UI
Affects Versions: 2.0.2
Reporter: Jinhua Fu


It seems worker memory only be assigned to executor heap which is usually not 
enough for estimating the whole clauster memory usage,especially when memory 
becomes a bottleneck of the clauster.In many case,we found a executor's real 
memory usage was much larger than its heap size which make me have to check for 
every application's real memory expenditure.

This can be improved by adding a mechanism for Non-Heap(permsize) 
statistics,only shown for extra memory usage which has no effect on the current 
worker memory allocation and statistics.The permsize can be obtained easily 
from executor java options.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20136) Add num files and metadata operation timing to scan metrics

2017-03-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20136.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Add num files and metadata operation timing to scan metrics
> ---
>
> Key: SPARK-20136
> URL: https://issues.apache.org/jira/browse/SPARK-20136
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> We currently do not include explicitly metadata operation timing and number 
> of files in data source metrics. Those would be useful to include for 
> performance profiling.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20146) Column comment information is missing for Thrift Server's TableSchema

2017-03-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20146.
-
   Resolution: Fixed
 Assignee: Bo Meng
Fix Version/s: 2.2.0

> Column comment information is missing for Thrift Server's TableSchema
> -
>
> Key: SPARK-20146
> URL: https://issues.apache.org/jira/browse/SPARK-20146
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bo Meng
>Assignee: Bo Meng
>Priority: Minor
> Fix For: 2.2.0
>
>
> I found this issue while doing some tests against Thrift Server.
> The column comments information were missing while querying the TableSchema. 
> Currently, all the comments were ignored.
> I will post a fix shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node

2017-03-29 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-20104:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-16026

> Don't estimate IsNull or IsNotNull predicates for non-leaf node
> ---
>
> Key: SPARK-20104
> URL: https://issues.apache.org/jira/browse/SPARK-20104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>
> In current stage, we don't have advanced statistics such as sketches or 
> histograms. As a result, some operator can't estimate `nullCount` accurately. 
> E.g. left outer join estimation does not accurately update `nullCount` 
> currently. So for IsNull and IsNotNull predicates, we only estimate them when 
> the child is a leaf node, whose `nullCount` is accurate. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18692) Test Java 8 unidoc build on Jenkins master builder

2017-03-29 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948248#comment-15948248
 ] 

Hyukjin Kwon commented on SPARK-18692:
--

Thank you for asking this. Let me give a shot after testing/double-checking it.

> Test Java 8 unidoc build on Jenkins master builder
> --
>
> Key: SPARK-18692
> URL: https://issues.apache.org/jira/browse/SPARK-18692
> Project: Spark
>  Issue Type: Test
>  Components: Build, Documentation
>Reporter: Joseph K. Bradley
>  Labels: jenkins
>
> [SPARK-3359] fixed the unidoc build for Java 8, but it is easy to break.  It 
> would be great to add this build to the Spark master builder on Jenkins to 
> make it easier to identify PRs which break doc builds.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15427) Spark SQL doesn't support field case sensitive when load data use Phoenix

2017-03-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15427.
--
Resolution: Not A Problem

{{SELECT * FROM $table WHERE 1=0}} seems now changable via dialect in favour of 
SPARK-17614. I am resolving this. Please reopen this if I misunderstood.

I am also resolving this as it seems the related code path has been changed 
radically to me.

> Spark SQL doesn't support field case sensitive when load data use Phoenix
> -
>
> Key: SPARK-15427
> URL: https://issues.apache.org/jira/browse/SPARK-15427
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
>Reporter: deng
>  Labels: easyfix, features, newbie
>
> I use sparkSql load data from Apache Phoenix.
> SQLContext sqlContext = new SQLContext(sc);
>  Map options = new HashMap();
>  options.put("driver", driver);
>  options.put("url", PhoenixUtil.p.getProperty("phoenixURL"));
>   options.put("dbtable", "(select "value","name" from "user")");
>   DataFrame jdbcDF = sqlContext.load("jdbc", options);
> It always throws exception, like "can't find field VALUE". 
> I tracked the code and found  spark will use:
>   val rs = conn.prepareStatement(s"SELECT * FROM $table WHERE 
> 1=0").executeQuery()
> to get the field.But the field already be uppercased, like "value" to VALUE. 
> So it will always throws "can't find field VALUE";
> It didn't think of the the case when data loaded from source in which filed 
> is case sensitive. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20149) Audit PySpark code base for 2.6 specific work arounds

2017-03-29 Thread holdenk (JIRA)
holdenk created SPARK-20149:
---

 Summary: Audit PySpark code base for 2.6 specific work arounds
 Key: SPARK-20149
 URL: https://issues.apache.org/jira/browse/SPARK-20149
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.2.0
Reporter: holdenk


We should determine what the areas in PySpark are that have specific 2.6 work 
arounds and create issues for them. The audit can be started during 2.2.0, but 
cleaning up all the 2.6 specific code is likely too much to try and get in so 
the actual fixing should probably be considered for 2.2.1 or 2.3 (unless 2.2.0 
is delayed).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14657) RFormula output wrong features when formula w/o intercept

2017-03-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948142#comment-15948142
 ] 

Joseph K. Bradley commented on SPARK-14657:
---

I'm going to remove the target version, but please retarget if we can 
reactivate this.

> RFormula output wrong features when formula w/o intercept
> -
>
> Key: SPARK-14657
> URL: https://issues.apache.org/jira/browse/SPARK-14657
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> SparkR::glm output different features compared with R glm when fit w/o 
> intercept and having string/category features. Take the following example, 
> SparkR output three features compared with four features for native R.
> SparkR::glm
> {quote}
> training <- suppressWarnings(createDataFrame(sqlContext, iris))
> model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training)
> summary(model)
> Coefficients:
> Estimate  Std. Error  t value  Pr(>|t|)
> Sepal_Length0.67468   0.0093013   72.536   0
> Species_versicolor  -1.2349   0.07269 -16.989  0
> Species_virginica   -1.4708   0.077397-19.003  0
> {quote}
> stats::glm
> {quote}
> summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris))
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> Sepal.Length0.3499 0.0463   7.557 4.19e-12 ***
> Speciessetosa   1.6765 0.2354   7.123 4.46e-11 ***
> Speciesversicolor   0.6931 0.2779   2.494   0.0137 *  
> Speciesvirginica0.6690 0.3078   2.174   0.0313 *  
> {quote}
> The encoder for string/category feature is different. R did not drop any 
> category but SparkR drop the last one.
> I searched online and test some other cases, found when we fit R glm model(or 
> other models powered by R formula) w/o intercept on a dataset including 
> string/category features, one of the categories in the first category feature 
> is being used as reference category, we will not drop any category for that 
> feature.
> I think we should keep consistent semantics between Spark RFormula and R 
> formula.
> cc [~mengxr] 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept

2017-03-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14657:
--
Target Version/s:   (was: 2.2.0)

> RFormula output wrong features when formula w/o intercept
> -
>
> Key: SPARK-14657
> URL: https://issues.apache.org/jira/browse/SPARK-14657
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> SparkR::glm output different features compared with R glm when fit w/o 
> intercept and having string/category features. Take the following example, 
> SparkR output three features compared with four features for native R.
> SparkR::glm
> {quote}
> training <- suppressWarnings(createDataFrame(sqlContext, iris))
> model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training)
> summary(model)
> Coefficients:
> Estimate  Std. Error  t value  Pr(>|t|)
> Sepal_Length0.67468   0.0093013   72.536   0
> Species_versicolor  -1.2349   0.07269 -16.989  0
> Species_virginica   -1.4708   0.077397-19.003  0
> {quote}
> stats::glm
> {quote}
> summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris))
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> Sepal.Length0.3499 0.0463   7.557 4.19e-12 ***
> Speciessetosa   1.6765 0.2354   7.123 4.46e-11 ***
> Speciesversicolor   0.6931 0.2779   2.494   0.0137 *  
> Speciesvirginica0.6690 0.3078   2.174   0.0313 *  
> {quote}
> The encoder for string/category feature is different. R did not drop any 
> category but SparkR drop the last one.
> I searched online and test some other cases, found when we fit R glm model(or 
> other models powered by R formula) w/o intercept on a dataset including 
> string/category features, one of the categories in the first category feature 
> is being used as reference category, we will not drop any category for that 
> feature.
> I think we should keep consistent semantics between Spark RFormula and R 
> formula.
> cc [~mengxr] 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept

2017-03-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14657:
--
Shepherd:   (was: Xiangrui Meng)

> RFormula output wrong features when formula w/o intercept
> -
>
> Key: SPARK-14657
> URL: https://issues.apache.org/jira/browse/SPARK-14657
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> SparkR::glm output different features compared with R glm when fit w/o 
> intercept and having string/category features. Take the following example, 
> SparkR output three features compared with four features for native R.
> SparkR::glm
> {quote}
> training <- suppressWarnings(createDataFrame(sqlContext, iris))
> model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training)
> summary(model)
> Coefficients:
> Estimate  Std. Error  t value  Pr(>|t|)
> Sepal_Length0.67468   0.0093013   72.536   0
> Species_versicolor  -1.2349   0.07269 -16.989  0
> Species_virginica   -1.4708   0.077397-19.003  0
> {quote}
> stats::glm
> {quote}
> summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris))
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> Sepal.Length0.3499 0.0463   7.557 4.19e-12 ***
> Speciessetosa   1.6765 0.2354   7.123 4.46e-11 ***
> Speciesversicolor   0.6931 0.2779   2.494   0.0137 *  
> Speciesvirginica0.6690 0.3078   2.174   0.0313 *  
> {quote}
> The encoder for string/category feature is different. R did not drop any 
> category but SparkR drop the last one.
> I searched online and test some other cases, found when we fit R glm model(or 
> other models powered by R formula) w/o intercept on a dataset including 
> string/category features, one of the categories in the first category feature 
> is being used as reference category, we will not drop any category for that 
> feature.
> I think we should keep consistent semantics between Spark RFormula and R 
> formula.
> cc [~mengxr] 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20148) Extend the file commit interface to allow subscribing to task commit messages

2017-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948085#comment-15948085
 ] 

Apache Spark commented on SPARK-20148:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/17475

> Extend the file commit interface to allow subscribing to task commit messages
> -
>
> Key: SPARK-20148
> URL: https://issues.apache.org/jira/browse/SPARK-20148
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Eric Liang
>Priority: Minor
>
> The internal FileCommitProtocol interface returns all task commit messages in 
> bulk to the implementation when a job finishes. However, it is sometimes 
> useful to access those messages before the job completes, so that the driver 
> gets incremental progress updates before the job finishes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20148) Extend the file commit interface to allow subscribing to task commit messages

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20148:


Assignee: (was: Apache Spark)

> Extend the file commit interface to allow subscribing to task commit messages
> -
>
> Key: SPARK-20148
> URL: https://issues.apache.org/jira/browse/SPARK-20148
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Eric Liang
>Priority: Minor
>
> The internal FileCommitProtocol interface returns all task commit messages in 
> bulk to the implementation when a job finishes. However, it is sometimes 
> useful to access those messages before the job completes, so that the driver 
> gets incremental progress updates before the job finishes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20148) Extend the file commit interface to allow subscribing to task commit messages

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20148:


Assignee: Apache Spark

> Extend the file commit interface to allow subscribing to task commit messages
> -
>
> Key: SPARK-20148
> URL: https://issues.apache.org/jira/browse/SPARK-20148
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Minor
>
> The internal FileCommitProtocol interface returns all task commit messages in 
> bulk to the implementation when a job finishes. However, it is sometimes 
> useful to access those messages before the job completes, so that the driver 
> gets incremental progress updates before the job finishes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20148) Extend the file commit interface to allow subscribing to task commit messages

2017-03-29 Thread Eric Liang (JIRA)
Eric Liang created SPARK-20148:
--

 Summary: Extend the file commit interface to allow subscribing to 
task commit messages
 Key: SPARK-20148
 URL: https://issues.apache.org/jira/browse/SPARK-20148
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Eric Liang
Priority: Minor


The internal FileCommitProtocol interface returns all task commit messages in 
bulk to the implementation when a job finishes. However, it is sometimes useful 
to access those messages before the job completes, so that the driver gets 
incremental progress updates before the job finishes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18958) SparkR should support toJSON on DataFrame

2017-03-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18958:
--
Fix Version/s: 2.2.0

> SparkR should support toJSON on DataFrame
> -
>
> Key: SPARK-18958
> URL: https://issues.apache.org/jira/browse/SPARK-18958
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.2.0
>
>
> It makes it easier to interop with other component (esp. since R does not 
> have json support built in)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation

2017-03-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3723:
-
Target Version/s:   (was: 2.2.0)

> DecisionTree, RandomForest: Add more instrumentation
> 
>
> Key: SPARK-3723
> URL: https://issues.apache.org/jira/browse/SPARK-3723
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some simple instrumentation would help advanced users understand performance, 
> and to check whether parameters (such as maxMemoryInMB) need to be tuned.
> Most important instrumentation (simple):
> * min, avg, max nodes per group
> * number of groups (passes over data)
> More advanced instrumentation:
> * For each tree (or averaged over trees), training set accuracy after 
> training each level.  This would be useful for visualizing learning behavior 
> (to convince oneself that model selection was being done correctly).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation

2017-03-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3723:
-
Component/s: (was: MLlib)
 ML

> DecisionTree, RandomForest: Add more instrumentation
> 
>
> Key: SPARK-3723
> URL: https://issues.apache.org/jira/browse/SPARK-3723
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some simple instrumentation would help advanced users understand performance, 
> and to check whether parameters (such as maxMemoryInMB) need to be tuned.
> Most important instrumentation (simple):
> * min, avg, max nodes per group
> * number of groups (passes over data)
> More advanced instrumentation:
> * For each tree (or averaged over trees), training set accuracy after 
> training each level.  This would be useful for visualizing learning behavior 
> (to convince oneself that model selection was being done correctly).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation

2017-03-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3723:
-
Shepherd: Joseph K. Bradley

> DecisionTree, RandomForest: Add more instrumentation
> 
>
> Key: SPARK-3723
> URL: https://issues.apache.org/jira/browse/SPARK-3723
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some simple instrumentation would help advanced users understand performance, 
> and to check whether parameters (such as maxMemoryInMB) need to be tuned.
> Most important instrumentation (simple):
> * min, avg, max nodes per group
> * number of groups (passes over data)
> More advanced instrumentation:
> * For each tree (or averaged over trees), training set accuracy after 
> training each level.  This would be useful for visualizing learning behavior 
> (to convince oneself that model selection was being done correctly).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18570) Consider supporting other R formula operators

2017-03-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948056#comment-15948056
 ] 

Joseph K. Bradley commented on SPARK-18570:
---

Is this still targeted for 2.2, or shall we retarget it?

> Consider supporting other R formula operators
> -
>
> Key: SPARK-18570
> URL: https://issues.apache.org/jira/browse/SPARK-18570
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>Priority: Minor
>
> Such as
> {code}
> ∗ 
>  X∗Y include these variables and the interactions between them
> ^
>  (X + Z + W)^3 include these variables and all interactions up to three way
> |
>  X | Z conditioning: include x given z
> {code}
> Other includes, %in%, ` (backtick)
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2017-03-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948055#comment-15948055
 ] 

Joseph K. Bradley commented on SPARK-3181:
--

Is this still active, and should it be targeted at 2.2?

> Add Robust Regression Algorithm with Huber Estimator
> 
>
> Key: SPARK-3181
> URL: https://issues.apache.org/jira/browse/SPARK-3181
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Fan Jiang
>Assignee: Yanbo Liang
>  Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression  to employ a 
> fitting criterion that is not as vulnerable as least square.
> In 1973, Huber introduced M-estimation for regression which stands for 
> "maximum likelihood type". The method is resistant to outliers in the 
> response variable and has been widely used.
> The new feature for MLlib will contain 3 new files
> /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
> /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
> /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
> and one new class HuberRobustGradient in 
> /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14659) OneHotEncoder support drop first category alphabetically in the encoded vector

2017-03-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14659:
--
Target Version/s:   (was: 2.2.0)

> OneHotEncoder support drop first category alphabetically in the encoded 
> vector 
> ---
>
> Key: SPARK-14659
> URL: https://issues.apache.org/jira/browse/SPARK-14659
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> R formula drop the first category alphabetically when encode string/category 
> feature. Spark RFormula use OneHotEncoder to encode string/category feature 
> into vector, but only supporting "dropLast" by string/category frequencies. 
> This will cause SparkR produce different models compared with native R.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14659) OneHotEncoder support drop first category alphabetically in the encoded vector

2017-03-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948049#comment-15948049
 ] 

Joseph K. Bradley commented on SPARK-14659:
---

[~actuaryzhang] I'm sorry I haven't had time to check on this; there have just 
been too many other things.  I'll remove the target version until someone can 
shepherd it.

> OneHotEncoder support drop first category alphabetically in the encoded 
> vector 
> ---
>
> Key: SPARK-14659
> URL: https://issues.apache.org/jira/browse/SPARK-14659
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> R formula drop the first category alphabetically when encode string/category 
> feature. Spark RFormula use OneHotEncoder to encode string/category feature 
> into vector, but only supporting "dropLast" by string/category frequencies. 
> This will cause SparkR produce different models compared with native R.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18822) Support ML Pipeline in SparkR

2017-03-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18822:
--
Target Version/s:   (was: 2.2.0)

> Support ML Pipeline in SparkR
> -
>
> Key: SPARK-18822
> URL: https://issues.apache.org/jira/browse/SPARK-18822
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> From Joseph Bradley:
> "
> Supporting Pipelines and advanced use cases: There really needs to be more 
> design discussion around SparkR. Felix Cheung would you be interested in 
> leading some discussion? I'm envisioning something similar to what was done a 
> while back for Pipelines in Scala/Java/Python, where we consider several use 
> cases of MLlib: fitting a single model, creating and tuning a complex 
> Pipeline, and working with multiple languages. That should help inform what 
> APIs should look like in Spark R.
> "
> Certain ML model, such as OneVsRest, is harder to represent in a single call 
> R API. Having advanced API or Pipeline API like this could help to expose 
> that to our users.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18822) Support ML Pipeline in SparkR

2017-03-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948045#comment-15948045
 ] 

Joseph K. Bradley commented on SPARK-18822:
---

Since 2.2 will be cut soon (I presume), I'm going to untarget this.  Felix, 
please retarget if you like.

> Support ML Pipeline in SparkR
> -
>
> Key: SPARK-18822
> URL: https://issues.apache.org/jira/browse/SPARK-18822
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> From Joseph Bradley:
> "
> Supporting Pipelines and advanced use cases: There really needs to be more 
> design discussion around SparkR. Felix Cheung would you be interested in 
> leading some discussion? I'm envisioning something similar to what was done a 
> while back for Pipelines in Scala/Java/Python, where we consider several use 
> cases of MLlib: fitting a single model, creating and tuning a complex 
> Pipeline, and working with multiple languages. That should help inform what 
> APIs should look like in Spark R.
> "
> Certain ML model, such as OneVsRest, is harder to represent in a single call 
> R API. Having advanced API or Pipeline API like this could help to expose 
> that to our users.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint

2017-03-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20103:
-
Fix Version/s: 2.2.0

> Spark structured steaming from kafka - last message processed again after 
> resume from checkpoint
> 
>
> Key: SPARK-20103
> URL: https://issues.apache.org/jira/browse/SPARK-20103
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: Linux, Spark 2.10 
>Reporter: Rajesh Mutha
>  Labels: spark, streaming
> Fix For: 2.2.0
>
>
> When the application starts after a failure or a graceful shutdown, it is 
> consistently processing the last message of the previous batch even though it 
> was already processed correctly without failure.
> We are making sure database writes are idempotent using postgres 9.6 feature. 
> Is this the default behavior of spark? I added a code snippet with 2 
> streaming queries. One of the query is idempotent; since query2 is not 
> idempotent, we are seeing duplicate entries in table. 
> {code}
> object StructuredStreaming {
>   def main(args: Array[String]): Unit = {
> val db_url = 
> "jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password"
> val spark = SparkSession
>   .builder
>   .appName("StructuredKafkaReader")
>   .master("local[*]")
>   .getOrCreate()
> spark.conf.set("spark.sql.streaming.checkpointLocation", 
> "/tmp/checkpoint_research/")
> import spark.implicits._
> val server = "10.205.82.113:9092"
> val topic = "checkpoint"
> val subscribeType="subscribe"
> val lines = spark
>   .readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", server)
>   .option(subscribeType, topic)
>   .load().selectExpr("CAST(value AS STRING)").as[String]
> lines.printSchema()
> import org.apache.spark.sql.ForeachWriter
> val writer = new ForeachWriter[String] {
>def open(partitionId: Long, version: Long):  Boolean = {
>  println("After db props"); true
>}
>def process(value: String) = {
>  val conn = DriverManager.getConnection(db_url)
>  try{
>conn.createStatement().executeUpdate("INSERT INTO 
> PUBLIC.checkpoint1 VALUES ('"+value+"')")
>  }
>  finally {
>conn.close()
>  }
>   }
>def close(errorOrNull: Throwable) = {}
> }
> import scala.concurrent.duration._
> val query1 = lines.writeStream
>  .outputMode("append")
>  .queryName("checkpoint1")
>  .trigger(ProcessingTime(30.seconds))
>  .foreach(writer)
>  .start()
>  val writer2 = new ForeachWriter[String] {
>   def open(partitionId: Long, version: Long):  Boolean = {
> println("After db props"); true
>   }
>   def process(value: String) = {
> val conn = DriverManager.getConnection(db_url)
> try{
>   conn.createStatement().executeUpdate("INSERT INTO 
> PUBLIC.checkpoint2 VALUES ('"+value+"')")
> }
> finally {
>   conn.close()
> }
>}
>   def close(errorOrNull: Throwable) = {}
> }
> import scala.concurrent.duration._
> val query2 = lines.writeStream
>   .outputMode("append")
>   .queryName("checkpoint2")
>   .trigger(ProcessingTime(30.seconds))
>   .foreach(writer2)
>   .start()
> query2.awaitTermination()
> query1.awaitTermination()
> }}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint

2017-03-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948035#comment-15948035
 ] 

Michael Armbrust commented on SPARK-20103:
--

It is fixed in 2.2 but by [SPARK-19876].

> Spark structured steaming from kafka - last message processed again after 
> resume from checkpoint
> 
>
> Key: SPARK-20103
> URL: https://issues.apache.org/jira/browse/SPARK-20103
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: Linux, Spark 2.10 
>Reporter: Rajesh Mutha
>  Labels: spark, streaming
> Fix For: 2.2.0
>
>
> When the application starts after a failure or a graceful shutdown, it is 
> consistently processing the last message of the previous batch even though it 
> was already processed correctly without failure.
> We are making sure database writes are idempotent using postgres 9.6 feature. 
> Is this the default behavior of spark? I added a code snippet with 2 
> streaming queries. One of the query is idempotent; since query2 is not 
> idempotent, we are seeing duplicate entries in table. 
> {code}
> object StructuredStreaming {
>   def main(args: Array[String]): Unit = {
> val db_url = 
> "jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password"
> val spark = SparkSession
>   .builder
>   .appName("StructuredKafkaReader")
>   .master("local[*]")
>   .getOrCreate()
> spark.conf.set("spark.sql.streaming.checkpointLocation", 
> "/tmp/checkpoint_research/")
> import spark.implicits._
> val server = "10.205.82.113:9092"
> val topic = "checkpoint"
> val subscribeType="subscribe"
> val lines = spark
>   .readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", server)
>   .option(subscribeType, topic)
>   .load().selectExpr("CAST(value AS STRING)").as[String]
> lines.printSchema()
> import org.apache.spark.sql.ForeachWriter
> val writer = new ForeachWriter[String] {
>def open(partitionId: Long, version: Long):  Boolean = {
>  println("After db props"); true
>}
>def process(value: String) = {
>  val conn = DriverManager.getConnection(db_url)
>  try{
>conn.createStatement().executeUpdate("INSERT INTO 
> PUBLIC.checkpoint1 VALUES ('"+value+"')")
>  }
>  finally {
>conn.close()
>  }
>   }
>def close(errorOrNull: Throwable) = {}
> }
> import scala.concurrent.duration._
> val query1 = lines.writeStream
>  .outputMode("append")
>  .queryName("checkpoint1")
>  .trigger(ProcessingTime(30.seconds))
>  .foreach(writer)
>  .start()
>  val writer2 = new ForeachWriter[String] {
>   def open(partitionId: Long, version: Long):  Boolean = {
> println("After db props"); true
>   }
>   def process(value: String) = {
> val conn = DriverManager.getConnection(db_url)
> try{
>   conn.createStatement().executeUpdate("INSERT INTO 
> PUBLIC.checkpoint2 VALUES ('"+value+"')")
> }
> finally {
>   conn.close()
> }
>}
>   def close(errorOrNull: Throwable) = {}
> }
> import scala.concurrent.duration._
> val query2 = lines.writeStream
>   .outputMode("append")
>   .queryName("checkpoint2")
>   .trigger(ProcessingTime(30.seconds))
>   .foreach(writer2)
>   .start()
> query2.awaitTermination()
> query1.awaitTermination()
> }}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint

2017-03-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20103:
-
Description: 
When the application starts after a failure or a graceful shutdown, it is 
consistently processing the last message of the previous batch even though it 
was already processed correctly without failure.

We are making sure database writes are idempotent using postgres 9.6 feature. 
Is this the default behavior of spark? I added a code snippet with 2 streaming 
queries. One of the query is idempotent; since query2 is not idempotent, we are 
seeing duplicate entries in table. 

{code}
object StructuredStreaming {
  def main(args: Array[String]): Unit = {
val db_url = 
"jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password"
val spark = SparkSession
  .builder
  .appName("StructuredKafkaReader")
  .master("local[*]")
  .getOrCreate()
spark.conf.set("spark.sql.streaming.checkpointLocation", 
"/tmp/checkpoint_research/")
import spark.implicits._
val server = "10.205.82.113:9092"
val topic = "checkpoint"
val subscribeType="subscribe"
val lines = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", server)
  .option(subscribeType, topic)
  .load().selectExpr("CAST(value AS STRING)").as[String]
lines.printSchema()
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[String] {
   def open(partitionId: Long, version: Long):  Boolean = {
 println("After db props"); true
   }
   def process(value: String) = {
 val conn = DriverManager.getConnection(db_url)
 try{
   conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint1 
VALUES ('"+value+"')")
 }
 finally {
   conn.close()
 }
  }
   def close(errorOrNull: Throwable) = {}
}
import scala.concurrent.duration._
val query1 = lines.writeStream
 .outputMode("append")
 .queryName("checkpoint1")
 .trigger(ProcessingTime(30.seconds))
 .foreach(writer)
 .start()
 val writer2 = new ForeachWriter[String] {
  def open(partitionId: Long, version: Long):  Boolean = {
println("After db props"); true
  }
  def process(value: String) = {
val conn = DriverManager.getConnection(db_url)
try{
  conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint2 
VALUES ('"+value+"')")
}
finally {
  conn.close()
}
   }
  def close(errorOrNull: Throwable) = {}
}
import scala.concurrent.duration._
val query2 = lines.writeStream
  .outputMode("append")
  .queryName("checkpoint2")
  .trigger(ProcessingTime(30.seconds))
  .foreach(writer2)
  .start()
query2.awaitTermination()
query1.awaitTermination()
}}
{code}

  was:
When the application starts after a failure or a graceful shutdown, it is 
consistently processing the last message of the previous batch even though it 
was already processed correctly without failure.

We are making sure database writes are idempotent using postgres 9.6 feature. 
Is this the default behavior of spark? I added a code snippet with 2 streaming 
queries. One of the query is idempotent; since query2 is not idempotent, we are 
seeing duplicate entries in table. 

---
object StructuredStreaming {
  def main(args: Array[String]): Unit = {
val db_url = 
"jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password"
val spark = SparkSession
  .builder
  .appName("StructuredKafkaReader")
  .master("local[*]")
  .getOrCreate()
spark.conf.set("spark.sql.streaming.checkpointLocation", 
"/tmp/checkpoint_research/")
import spark.implicits._
val server = "10.205.82.113:9092"
val topic = "checkpoint"
val subscribeType="subscribe"
val lines = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", server)
  .option(subscribeType, topic)
  .load().selectExpr("CAST(value AS STRING)").as[String]
lines.printSchema()
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[String] {
   def open(partitionId: Long, version: Long):  Boolean = {
 println("After db props"); true
   }
   def process(value: String) = {
 val conn = DriverManager.getConnection(db_url)
 try{
   conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint1 
VALUES ('"+value+"')")
 }
 finally {
   conn.close()
 }
  }
   def close(errorOrNull: Throwable) = {}
}
import scala.concurrent.duration._
val query1 = 

[jira] [Updated] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint

2017-03-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20103:
-
Docs Text:   (was: object StructuredStreaming {
  def main(args: Array[String]): Unit = {
val db_url = 
"jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password"
val spark = SparkSession
  .builder
  .appName("StructuredKafkaReader")
  .master("local[*]")
  .getOrCreate()
spark.conf.set("spark.sql.streaming.checkpointLocation", 
"/tmp/checkpoint_research/")
import spark.implicits._
val server = "10.205.82.113:9092"
val topic = "checkpoint"
val subscribeType="subscribe"
val lines = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", server)
  .option(subscribeType, topic)
  .load().selectExpr("CAST(value AS STRING)").as[String]
lines.printSchema()
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[String] {
   def open(partitionId: Long, version: Long):  Boolean = {
 println("After db props"); true
   }
   def process(value: String) = {
 val conn = DriverManager.getConnection(db_url)
 try{
   conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint1 
VALUES ('"+value+"')")
 }
 finally {
   conn.close()
 }
  }
   def close(errorOrNull: Throwable) = {}
}
import scala.concurrent.duration._
val query1 = lines.writeStream
 .outputMode("append")
 .queryName("checkpoint1")
 .trigger(ProcessingTime(30.seconds))
 .foreach(writer)
 .start()
 val writer2 = new ForeachWriter[String] {
  def open(partitionId: Long, version: Long):  Boolean = {
println("After db props"); true
  }
  def process(value: String) = {
val conn = DriverManager.getConnection(db_url)
try{
  conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint2 
VALUES ('"+value+"')")
}
finally {
  conn.close()
}
   }
  def close(errorOrNull: Throwable) = {}
}
import scala.concurrent.duration._
val query2 = lines.writeStream
  .outputMode("append")
  .queryName("checkpoint2")
  .trigger(ProcessingTime(30.seconds))
  .foreach(writer2)
  .start()
query2.awaitTermination()
query1.awaitTermination()
}})

> Spark structured steaming from kafka - last message processed again after 
> resume from checkpoint
> 
>
> Key: SPARK-20103
> URL: https://issues.apache.org/jira/browse/SPARK-20103
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: Linux, Spark 2.10 
>Reporter: Rajesh Mutha
>  Labels: spark, streaming
>
> When the application starts after a failure or a graceful shutdown, it is 
> consistently processing the last message of the previous batch even though it 
> was already processed correctly without failure.
> We are making sure database writes are idempotent using postgres 9.6 feature. 
> Is this the default behavior of spark? I added a code snippet with 2 
> streaming queries. One of the query is idempotent; since query2 is not 
> idempotent, we are seeing duplicate entries in table. 
> {code}
> object StructuredStreaming {
>   def main(args: Array[String]): Unit = {
> val db_url = 
> "jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password"
> val spark = SparkSession
>   .builder
>   .appName("StructuredKafkaReader")
>   .master("local[*]")
>   .getOrCreate()
> spark.conf.set("spark.sql.streaming.checkpointLocation", 
> "/tmp/checkpoint_research/")
> import spark.implicits._
> val server = "10.205.82.113:9092"
> val topic = "checkpoint"
> val subscribeType="subscribe"
> val lines = spark
>   .readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", server)
>   .option(subscribeType, topic)
>   .load().selectExpr("CAST(value AS STRING)").as[String]
> lines.printSchema()
> import org.apache.spark.sql.ForeachWriter
> val writer = new ForeachWriter[String] {
>def open(partitionId: Long, version: Long):  Boolean = {
>  println("After db props"); true
>}
>def process(value: String) = {
>  val conn = DriverManager.getConnection(db_url)
>  try{
>conn.createStatement().executeUpdate("INSERT INTO 
> PUBLIC.checkpoint1 VALUES ('"+value+"')")
>  }
>  finally {
>

[jira] [Resolved] (SPARK-20120) spark-sql CLI support silent mode

2017-03-29 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20120.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 2.2.0

> spark-sql CLI support silent mode
> -
>
> Key: SPARK-20120
> URL: https://issues.apache.org/jira/browse/SPARK-20120
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
> Fix For: 2.2.0
>
>
> It is similar to Hive silent mode, just show the query result. see:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20147) Cloning SessionState does not clone streaming query listeners

2017-03-29 Thread Kunal Khamar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-20147:
-
Description: 
Cloning session should clone StreamingQueryListeners registered on the 
StreamingQueryListenerBus.
Similar to SPARK-20048, https://github.com/apache/spark/pull/17379

  was:Cloning session should clone StreamingQueryListeners registered on the 
StreamingQueryListenerBus.


> Cloning SessionState does not clone streaming query listeners
> -
>
> Key: SPARK-20147
> URL: https://issues.apache.org/jira/browse/SPARK-20147
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kunal Khamar
>
> Cloning session should clone StreamingQueryListeners registered on the 
> StreamingQueryListenerBus.
> Similar to SPARK-20048, https://github.com/apache/spark/pull/17379



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20147) Cloning SessionState does not clone streaming query listeners

2017-03-29 Thread Kunal Khamar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-20147:
-
Description: Cloning session should clone StreamingQueryListeners 
registered on the StreamingQueryListenerBus.

> Cloning SessionState does not clone streaming query listeners
> -
>
> Key: SPARK-20147
> URL: https://issues.apache.org/jira/browse/SPARK-20147
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kunal Khamar
>
> Cloning session should clone StreamingQueryListeners registered on the 
> StreamingQueryListenerBus.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20147) Cloning SessionState does not clone streaming query listeners

2017-03-29 Thread Kunal Khamar (JIRA)
Kunal Khamar created SPARK-20147:


 Summary: Cloning SessionState does not clone streaming query 
listeners
 Key: SPARK-20147
 URL: https://issues.apache.org/jira/browse/SPARK-20147
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Kunal Khamar






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19088) Optimize sequence type deserialization codegen

2017-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947972#comment-15947972
 ] 

Apache Spark commented on SPARK-19088:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17473

> Optimize sequence type deserialization codegen
> --
>
> Key: SPARK-19088
> URL: https://issues.apache.org/jira/browse/SPARK-19088
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michal Šenkýř
>Assignee: Michal Šenkýř
>Priority: Minor
>  Labels: performance
> Fix For: 2.2.0
>
>
> Sequence type deserialization codegen added in [PR 
> #16240|https://github.com/apache/spark/pull/16240] should use a proper 
> builder instead of a conversion (using {{to}}) to avoid an additional pass.
> This will require an additional {{MapObjects}}-like operation that will use 
> the provided builder instead of building an array.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2017-03-29 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-20144:
---
Summary: spark.read.parquet no long maintains ordering of the data  (was: 
spark.read.parquet no long maintains the ordering the the data)

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20145) "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't

2017-03-29 Thread sam elamin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947781#comment-15947781
 ] 

sam elamin commented on SPARK-20145:


if no one is picking this up, id love to take it

> "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't
> 
>
> Key: SPARK-20145
> URL: https://issues.apache.org/jira/browse/SPARK-20145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>
> Executed at clean tip of the master branch, with all default settings:
> scala> spark.sql("SELECT * FROM range(1)")
> res1: org.apache.spark.sql.DataFrame = [id: bigint]
> scala> spark.sql("SELECT * FROM RANGE(1)")
> org.apache.spark.sql.AnalysisException: could not resolve `RANGE` to a 
> table-valued function; line 1 pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:126)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
> ...
> I believe it should be case insensitive?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs

2017-03-29 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20009.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.2.0

> Use user-friendly DDL formats for defining a schema  in user-facing APIs
> 
>
> Key: SPARK-20009
> URL: https://issues.apache.org/jira/browse/SPARK-20009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
> Fix For: 2.2.0
>
>
> In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the 
> DDL parser to convert a DDL string into a schema. Then, we can use DDL 
> formats in existing some APIs, e.g., functions.from_json 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20009) Use user-friendly DDL formats for defining a schema in functions.from_json

2017-03-29 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20009:

Summary: Use user-friendly DDL formats for defining a schema  in 
functions.from_json  (was: Use user-friendly DDL formats for defining a schema  
in user-facing APIs)

> Use user-friendly DDL formats for defining a schema  in functions.from_json
> ---
>
> Key: SPARK-20009
> URL: https://issues.apache.org/jira/browse/SPARK-20009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
> Fix For: 2.2.0
>
>
> In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the 
> DDL parser to convert a DDL string into a schema. Then, we can use DDL 
> formats in existing some APIs, e.g., functions.from_json 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20048) Cloning SessionState does not clone query execution listeners

2017-03-29 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-20048.
---
   Resolution: Fixed
 Assignee: Kunal Khamar
Fix Version/s: 2.2.0

> Cloning SessionState does not clone query execution listeners
> -
>
> Key: SPARK-20048
> URL: https://issues.apache.org/jira/browse/SPARK-20048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>Assignee: Kunal Khamar
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19999) Test failures in Spark Core due to java.nio.Bits.unaligned()

2017-03-29 Thread sam elamin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947761#comment-15947761
 ] 

sam elamin commented on SPARK-1:


can someone assign this to me, happy to take it over

> Test failures in Spark Core due to java.nio.Bits.unaligned()
> 
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: Ubuntu 14.04 ppc64le 
> $ java -version
> openjdk version "1.8.0_111"
> OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>Reporter: Sonia Garudi
>  Labels: ppc64le
> Attachments: Core.patch
>
>
> There are multiple test failures seen in Spark Core project with the 
> following error message :
> {code:borderStyle=solid}
> java.lang.IllegalArgumentException: requirement failed: No support for 
> unaligned Unsafe. Set spark.memory.offHeap.enabled to false.
> {code}
> These errors occur due to java.nio.Bits.unaligned(), which does not return 
> true for the ppc64le arch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19999) Test failures in Spark Core due to java.nio.Bits.unaligned()

2017-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947755#comment-15947755
 ] 

Apache Spark commented on SPARK-1:
--

User 'samelamin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17472

> Test failures in Spark Core due to java.nio.Bits.unaligned()
> 
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: Ubuntu 14.04 ppc64le 
> $ java -version
> openjdk version "1.8.0_111"
> OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>Reporter: Sonia Garudi
>  Labels: ppc64le
> Attachments: Core.patch
>
>
> There are multiple test failures seen in Spark Core project with the 
> following error message :
> {code:borderStyle=solid}
> java.lang.IllegalArgumentException: requirement failed: No support for 
> unaligned Unsafe. Set spark.memory.offHeap.enabled to false.
> {code}
> These errors occur due to java.nio.Bits.unaligned(), which does not return 
> true for the ppc64le arch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19999) Test failures in Spark Core due to java.nio.Bits.unaligned()

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-1:


Assignee: Apache Spark

> Test failures in Spark Core due to java.nio.Bits.unaligned()
> 
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: Ubuntu 14.04 ppc64le 
> $ java -version
> openjdk version "1.8.0_111"
> OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>Reporter: Sonia Garudi
>Assignee: Apache Spark
>  Labels: ppc64le
> Attachments: Core.patch
>
>
> There are multiple test failures seen in Spark Core project with the 
> following error message :
> {code:borderStyle=solid}
> java.lang.IllegalArgumentException: requirement failed: No support for 
> unaligned Unsafe. Set spark.memory.offHeap.enabled to false.
> {code}
> These errors occur due to java.nio.Bits.unaligned(), which does not return 
> true for the ppc64le arch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19999) Test failures in Spark Core due to java.nio.Bits.unaligned()

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-1:


Assignee: (was: Apache Spark)

> Test failures in Spark Core due to java.nio.Bits.unaligned()
> 
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: Ubuntu 14.04 ppc64le 
> $ java -version
> openjdk version "1.8.0_111"
> OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>Reporter: Sonia Garudi
>  Labels: ppc64le
> Attachments: Core.patch
>
>
> There are multiple test failures seen in Spark Core project with the 
> following error message :
> {code:borderStyle=solid}
> java.lang.IllegalArgumentException: requirement failed: No support for 
> unaligned Unsafe. Set spark.memory.offHeap.enabled to false.
> {code}
> These errors occur due to java.nio.Bits.unaligned(), which does not return 
> true for the ppc64le arch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16938) Cannot resolve column name after a join

2017-03-29 Thread sam elamin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947732#comment-15947732
 ] 

sam elamin edited comment on SPARK-16938 at 3/29/17 7:20 PM:
-

[~cloud_fan] could you please check my comment on the github pr?

I am happy picking up this ticket, can someone assign it to me please. 


was (Author: samelamin):
[~cloud_fan] I am happy picking up this ticket, can someone assign it to me 
please

> Cannot resolve column name after a join
> ---
>
> Key: SPARK-16938
> URL: https://issues.apache.org/jira/browse/SPARK-16938
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Mathieu D
>Priority: Minor
>
> Found a change of behavior on spark-2.0.0, which breaks a query in our code 
> base.
> The following works on previous spark versions, 1.6.1 up to 2.0.0-preview :
> {code}
> val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa")
> val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb")
> dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", 
> "dfb.id"))
> {code}
> but fails with spark-2.0.0 with the exception : 
> {code}
> Cannot resolve column name "dfa.id" among (id, a, id, b); 
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" 
> among (id, a, id, b);
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16938) Cannot resolve column name after a join

2017-03-29 Thread sam elamin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947732#comment-15947732
 ] 

sam elamin commented on SPARK-16938:


[~cloud_fan] I am happy picking up this ticket, can someone assign it to me 
please

> Cannot resolve column name after a join
> ---
>
> Key: SPARK-16938
> URL: https://issues.apache.org/jira/browse/SPARK-16938
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Mathieu D
>Priority: Minor
>
> Found a change of behavior on spark-2.0.0, which breaks a query in our code 
> base.
> The following works on previous spark versions, 1.6.1 up to 2.0.0-preview :
> {code}
> val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa")
> val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb")
> dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", 
> "dfb.id"))
> {code}
> but fails with spark-2.0.0 with the exception : 
> {code}
> Cannot resolve column name "dfa.id" among (id, a, id, b); 
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" 
> among (id, a, id, b);
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20145) "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't

2017-03-29 Thread Bo Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947726#comment-15947726
 ] 

Bo Meng commented on SPARK-20145:
-

>From the current code, I can see builtinFunctions is using the exact match for 
>looking up ("range" as a key is all lowercase). 

> "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't
> 
>
> Key: SPARK-20145
> URL: https://issues.apache.org/jira/browse/SPARK-20145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>
> Executed at clean tip of the master branch, with all default settings:
> scala> spark.sql("SELECT * FROM range(1)")
> res1: org.apache.spark.sql.DataFrame = [id: bigint]
> scala> spark.sql("SELECT * FROM RANGE(1)")
> org.apache.spark.sql.AnalysisException: could not resolve `RANGE` to a 
> table-valued function; line 1 pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:126)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
> ...
> I believe it should be case insensitive?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19955) Update run-tests to support conda

2017-03-29 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-19955.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17355
[https://github.com/apache/spark/pull/17355]

> Update run-tests to support conda
> -
>
> Key: SPARK-19955
> URL: https://issues.apache.org/jira/browse/SPARK-19955
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark
>Affects Versions: 2.1.1, 2.2.0
>Reporter: holdenk
>Assignee: holdenk
> Fix For: 2.2.0
>
>
> The current test scripts only look at system python. On the Jenkins workers 
> we also have Conda installed, we should support looking for Python versions 
> in Conda and testing with those.
> This could unblock some of the 2.6 deprecation work and more easily enable 
> testing of pip packaging.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3577) Add task metric to report spill time

2017-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947678#comment-15947678
 ] 

Apache Spark commented on SPARK-3577:
-

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/17471

> Add task metric to report spill time
> 
>
> Key: SPARK-3577
> URL: https://issues.apache.org/jira/browse/SPARK-3577
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
> Attachments: spill_size.jpg
>
>
> The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
> {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
> We should probably add task metrics to report this spill time, since for 
> shuffles, this would have previously been reported as part of shuffle write 
> time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19955) Update run-tests to support conda

2017-03-29 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-19955:
---

Assignee: holdenk

> Update run-tests to support conda
> -
>
> Key: SPARK-19955
> URL: https://issues.apache.org/jira/browse/SPARK-19955
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark
>Affects Versions: 2.1.1, 2.2.0
>Reporter: holdenk
>Assignee: holdenk
>
> The current test scripts only look at system python. On the Jenkins workers 
> we also have Conda installed, we should support looking for Python versions 
> in Conda and testing with those.
> This could unblock some of the 2.6 deprecation work and more easily enable 
> testing of pip packaging.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3577) Add task metric to report spill time

2017-03-29 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947669#comment-15947669
 ] 

Sital Kedia commented on SPARK-3577:


I am making a change to report correct spill data size on disk. 

> Add task metric to report spill time
> 
>
> Key: SPARK-3577
> URL: https://issues.apache.org/jira/browse/SPARK-3577
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
> Attachments: spill_size.jpg
>
>
> The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
> {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
> We should probably add task metrics to report this spill time, since for 
> shuffles, this would have previously been reported as part of shuffle write 
> time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20146) Column comment information is missing for Thrift Server's TableSchema

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20146:


Assignee: (was: Apache Spark)

> Column comment information is missing for Thrift Server's TableSchema
> -
>
> Key: SPARK-20146
> URL: https://issues.apache.org/jira/browse/SPARK-20146
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bo Meng
>Priority: Minor
>
> I found this issue while doing some tests against Thrift Server.
> The column comments information were missing while querying the TableSchema. 
> Currently, all the comments were ignored.
> I will post a fix shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20146) Column comment information is missing for Thrift Server's TableSchema

2017-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947665#comment-15947665
 ] 

Apache Spark commented on SPARK-20146:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/17470

> Column comment information is missing for Thrift Server's TableSchema
> -
>
> Key: SPARK-20146
> URL: https://issues.apache.org/jira/browse/SPARK-20146
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bo Meng
>Priority: Minor
>
> I found this issue while doing some tests against Thrift Server.
> The column comments information were missing while querying the TableSchema. 
> Currently, all the comments were ignored.
> I will post a fix shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20146) Column comment information is missing for Thrift Server's TableSchema

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20146:


Assignee: Apache Spark

> Column comment information is missing for Thrift Server's TableSchema
> -
>
> Key: SPARK-20146
> URL: https://issues.apache.org/jira/browse/SPARK-20146
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bo Meng
>Assignee: Apache Spark
>Priority: Minor
>
> I found this issue while doing some tests against Thrift Server.
> The column comments information were missing while querying the TableSchema. 
> Currently, all the comments were ignored.
> I will post a fix shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20146) Column comment information is missing for Thrift Server's TableSchema

2017-03-29 Thread Bo Meng (JIRA)
Bo Meng created SPARK-20146:
---

 Summary: Column comment information is missing for Thrift Server's 
TableSchema
 Key: SPARK-20146
 URL: https://issues.apache.org/jira/browse/SPARK-20146
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Bo Meng
Priority: Minor


I found this issue while doing some tests against Thrift Server.

The column comments information were missing while querying the TableSchema. 
Currently, all the comments were ignored.

I will post a fix shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18692) Test Java 8 unidoc build on Jenkins master builder

2017-03-29 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947627#comment-15947627
 ] 

Josh Rosen commented on SPARK-18692:


We can't get the full Jekyll doc build running until we have Jekyll installed 
on all workers, but the extra code to just test unidoc isn't that much:

{code}
diff --git a/dev/run-tests.py b/dev/run-tests.py
index 04035b3..46d6b8a 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -344,6 +344,19 @@ def build_spark_sbt(hadoop_version):
 exec_sbt(profiles_and_goals)


+def build_spark_unidoc_sbt(hadoop_version):
+set_title_and_block("Building Unidoc API Documentation", 
"BLOCK_DOCUMENTATION")
+# Enable all of the profiles for the build:
+build_profiles = get_hadoop_profiles(hadoop_version) + 
modules.root.build_profile_flags
+sbt_goals = ["unidoc"]
+profiles_and_goals = build_profiles + sbt_goals
+
+print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these 
arguments: ",
+  " ".join(profiles_and_goals))
+
+exec_sbt(profiles_and_goals)
+
+
 def build_spark_assembly_sbt(hadoop_version):
 # Enable all of the profiles for the build:
 build_profiles = get_hadoop_profiles(hadoop_version) + 
modules.root.build_profile_flags
@@ -576,6 +589,8 @@ def main():
 # Since we did not build assembly/package before running dev/mima, we 
need to
 # do it here because the tests still rely on it; see SPARK-13294 for 
details.
 build_spark_assembly_sbt(hadoop_version)
+# Make sure that Java and Scala API documentation can be generated
+build_spark_unidoc_sbt(hadoop_version)

 # run the test suites
 run_scala_tests(build_tool, hadoop_version, test_modules, excluded_tags)
{code}

On my laptop this added about 1.5 minutes of extra run time. One problem that I 
noticed was that Unidoc appeared to be processing test sources: if we can find 
a way to exclude those from being processed in the first place then that might 
significantly speed things up.

It turns out that it's also possible to disable Java 8's strict doc validation, 
so we could consider that as well.

The master builder and PR builder should both be running Java 8 right now. The 
dedicated doc builder jobs are still using Java 7 (for convoluted legacy 
reasons) but I'll push a conf change to fix that.

Assuming that we want to use the stricter validation: [~hyukjin.kwon], could 
you help to fix the current Javadoc breaks and include the above diff to test 
the unidoc as part of our dev/run-tests process? I'll be happy to help review 
and merge this fix.

> Test Java 8 unidoc build on Jenkins master builder
> --
>
> Key: SPARK-18692
> URL: https://issues.apache.org/jira/browse/SPARK-18692
> Project: Spark
>  Issue Type: Test
>  Components: Build, Documentation
>Reporter: Joseph K. Bradley
>  Labels: jenkins
>
> [SPARK-3359] fixed the unidoc build for Java 8, but it is easy to break.  It 
> would be great to add this build to the Spark master builder on Jenkins to 
> make it easier to identify PRs which break doc builds.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20132) Add documentation for column string functions

2017-03-29 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-20132:
--
Comment: was deleted

(was: I have a commit with the documentation: 
https://github.com/map222/spark/commit/ac91b654555f9a07021222f2f1a162634d81be5b

I will make a more formal PR tonight.)

> Add documentation for column string functions
> -
>
> Key: SPARK-20132
> URL: https://issues.apache.org/jira/browse/SPARK-20132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>  Labels: documentation, newbie
>
> Four Column string functions do not have documentation for PySpark:
> rlike
> like
> startswith
> endswith
> These functions are called through the _bin_op interface, which allows the 
> passing of a docstring. I have added docstrings with examples to each of the 
> four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20132) Add documentation for column string functions

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20132:


Assignee: (was: Apache Spark)

> Add documentation for column string functions
> -
>
> Key: SPARK-20132
> URL: https://issues.apache.org/jira/browse/SPARK-20132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>  Labels: documentation, newbie
>
> Four Column string functions do not have documentation for PySpark:
> rlike
> like
> startswith
> endswith
> These functions are called through the _bin_op interface, which allows the 
> passing of a docstring. I have added docstrings with examples to each of the 
> four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20132) Add documentation for column string functions

2017-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947538#comment-15947538
 ] 

Apache Spark commented on SPARK-20132:
--

User 'map222' has created a pull request for this issue:
https://github.com/apache/spark/pull/17469

> Add documentation for column string functions
> -
>
> Key: SPARK-20132
> URL: https://issues.apache.org/jira/browse/SPARK-20132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>  Labels: documentation, newbie
>
> Four Column string functions do not have documentation for PySpark:
> rlike
> like
> startswith
> endswith
> These functions are called through the _bin_op interface, which allows the 
> passing of a docstring. I have added docstrings with examples to each of the 
> four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20132) Add documentation for column string functions

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20132:


Assignee: Apache Spark

> Add documentation for column string functions
> -
>
> Key: SPARK-20132
> URL: https://issues.apache.org/jira/browse/SPARK-20132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Assignee: Apache Spark
>Priority: Minor
>  Labels: documentation, newbie
>
> Four Column string functions do not have documentation for PySpark:
> rlike
> like
> startswith
> endswith
> These functions are called through the _bin_op interface, which allows the 
> passing of a docstring. I have added docstrings with examples to each of the 
> four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20059) HbaseCredentialProvider uses wrong classloader

2017-03-29 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20059.

   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.2.0
   2.1.1

> HbaseCredentialProvider uses wrong classloader
> --
>
> Key: SPARK-20059
> URL: https://issues.apache.org/jira/browse/SPARK-20059
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.1.1, 2.2.0
>
>
> {{HBaseCredentialProvider}} uses system classloader instead of child 
> classloader, which will make HBase jars specified with {{--jars}} fail to 
> work, so here we should use the right class loader.
> Besides in yarn cluster mode jars specified with {{--jars}} is not added into 
> client's class path, which will make it fail to load HBase jars and issue 
> tokens in our scenario. Also some customized credential provider cannot be 
> registered into client.
> So here I will fix this two issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16938) Cannot resolve column name after a join

2017-03-29 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947485#comment-15947485
 ] 

Dongjoon Hyun commented on SPARK-16938:
---

Sure, go ahead. I'm not working on this.

> Cannot resolve column name after a join
> ---
>
> Key: SPARK-16938
> URL: https://issues.apache.org/jira/browse/SPARK-16938
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Mathieu D
>Priority: Minor
>
> Found a change of behavior on spark-2.0.0, which breaks a query in our code 
> base.
> The following works on previous spark versions, 1.6.1 up to 2.0.0-preview :
> {code}
> val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa")
> val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb")
> dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", 
> "dfb.id"))
> {code}
> but fails with spark-2.0.0 with the exception : 
> {code}
> Cannot resolve column name "dfa.id" among (id, a, id, b); 
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" 
> among (id, a, id, b);
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16938) Cannot resolve column name after a join

2017-03-29 Thread sam elamin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947395#comment-15947395
 ] 

sam elamin commented on SPARK-16938:


[~dongjoon] I can pick this up if you dont mind, are you still not working on 
it? 

> Cannot resolve column name after a join
> ---
>
> Key: SPARK-16938
> URL: https://issues.apache.org/jira/browse/SPARK-16938
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Mathieu D
>Priority: Minor
>
> Found a change of behavior on spark-2.0.0, which breaks a query in our code 
> base.
> The following works on previous spark versions, 1.6.1 up to 2.0.0-preview :
> {code}
> val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa")
> val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb")
> dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", 
> "dfb.id"))
> {code}
> but fails with spark-2.0.0 with the exception : 
> {code}
> Cannot resolve column name "dfa.id" among (id, a, id, b); 
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" 
> among (id, a, id, b);
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20145) "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't

2017-03-29 Thread Juliusz Sompolski (JIRA)
Juliusz Sompolski created SPARK-20145:
-

 Summary: "SELECT * FROM range(1)" works, but "SELECT * FROM 
RANGE(1)" doesn't
 Key: SPARK-20145
 URL: https://issues.apache.org/jira/browse/SPARK-20145
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Juliusz Sompolski


Executed at clean tip of the master branch, with all default settings:

scala> spark.sql("SELECT * FROM range(1)")
res1: org.apache.spark.sql.DataFrame = [id: bigint]

scala> spark.sql("SELECT * FROM RANGE(1)")
org.apache.spark.sql.AnalysisException: could not resolve `RANGE` to a 
table-valued function; line 1 pos 14
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:106)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
...

I believe it should be case insensitive?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20143) DataType.fromJson should throw an exception with better message

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20143:


Assignee: (was: Apache Spark)

> DataType.fromJson should throw an exception with better message
> ---
>
> Key: SPARK-20143
> URL: https://issues.apache.org/jira/browse/SPARK-20143
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, 
> {code}
> scala> import org.apache.spark.sql.types.DataType
> import org.apache.spark.sql.types.DataType
> scala> DataType.fromJson( abcd)
> java.util.NoSuchElementException: key not found: abcd
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at org.apache.spark.sql.types.DataType$.nameToType(DataType.scala:118)
>   at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:132)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
>   ... 48 elided
> scala> DataType.fromJson( """{"abcd":"a"}""")
> scala.MatchError: JObject(List((abcd,JString(a (of class 
> org.json4s.JsonAST$JObject)
>   at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:130)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
>   ... 48 elided
> scala> DataType.fromJson( """{"fields": [{"a":123}], "type": "struct"}""")
> scala.MatchError: JObject(List((a,JInt(123 (of class 
> org.json4s.JsonAST$JObject)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:169)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150)
>   at scala.collection.immutable.List.map(List.scala:273)
>   at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:150)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
>   ... 48 elided
> {code}
> {{DataType.fromJson}} throws non-readable error messages for the json input. 
> We could improve this rather than throwing {{scala.MatchError}} or 
> {{java.util.NoSuchElementException}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20143) DataType.fromJson should throw an exception with better message

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20143:


Assignee: Apache Spark

> DataType.fromJson should throw an exception with better message
> ---
>
> Key: SPARK-20143
> URL: https://issues.apache.org/jira/browse/SPARK-20143
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, 
> {code}
> scala> import org.apache.spark.sql.types.DataType
> import org.apache.spark.sql.types.DataType
> scala> DataType.fromJson( abcd)
> java.util.NoSuchElementException: key not found: abcd
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at org.apache.spark.sql.types.DataType$.nameToType(DataType.scala:118)
>   at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:132)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
>   ... 48 elided
> scala> DataType.fromJson( """{"abcd":"a"}""")
> scala.MatchError: JObject(List((abcd,JString(a (of class 
> org.json4s.JsonAST$JObject)
>   at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:130)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
>   ... 48 elided
> scala> DataType.fromJson( """{"fields": [{"a":123}], "type": "struct"}""")
> scala.MatchError: JObject(List((a,JInt(123 (of class 
> org.json4s.JsonAST$JObject)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:169)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150)
>   at scala.collection.immutable.List.map(List.scala:273)
>   at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:150)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
>   ... 48 elided
> {code}
> {{DataType.fromJson}} throws non-readable error messages for the json input. 
> We could improve this rather than throwing {{scala.MatchError}} or 
> {{java.util.NoSuchElementException}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20143) DataType.fromJson should throw an exception with better message

2017-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947237#comment-15947237
 ] 

Apache Spark commented on SPARK-20143:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17468

> DataType.fromJson should throw an exception with better message
> ---
>
> Key: SPARK-20143
> URL: https://issues.apache.org/jira/browse/SPARK-20143
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, 
> {code}
> scala> import org.apache.spark.sql.types.DataType
> import org.apache.spark.sql.types.DataType
> scala> DataType.fromJson( abcd)
> java.util.NoSuchElementException: key not found: abcd
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at org.apache.spark.sql.types.DataType$.nameToType(DataType.scala:118)
>   at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:132)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
>   ... 48 elided
> scala> DataType.fromJson( """{"abcd":"a"}""")
> scala.MatchError: JObject(List((abcd,JString(a (of class 
> org.json4s.JsonAST$JObject)
>   at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:130)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
>   ... 48 elided
> scala> DataType.fromJson( """{"fields": [{"a":123}], "type": "struct"}""")
> scala.MatchError: JObject(List((a,JInt(123 (of class 
> org.json4s.JsonAST$JObject)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:169)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150)
>   at scala.collection.immutable.List.map(List.scala:273)
>   at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:150)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
>   ... 48 elided
> {code}
> {{DataType.fromJson}} throws non-readable error messages for the json input. 
> We could improve this rather than throwing {{scala.MatchError}} or 
> {{java.util.NoSuchElementException}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20144) spark.read.parquet no long maintains the ordering the the data

2017-03-29 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-20144:
---
Description: 
Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
when we read parquet files in 2.0.2, the ordering of rows in the resulting 
dataframe is not the same as the ordering of rows in the dataframe that the 
parquet file was reproduced with. 

This is because FileSourceStrategy.scala combines the parquet files into fewer 
partitions and also reordered them. This breaks our workflows because they 
assume the ordering of the data. 

Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1.

  was:
Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
when we read parquet files in 2.0.2, the ordering of rows in the resulting 
dataframe is not the same as the ordering of rows in the dataframe that the 
parquet file was reproduced with. 

This is because FileSourceStrategy.scala combines the parquet files into fewer 
partitions and also reordered them. This breaks our workout because they assume 
the ordering of the data. 

Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1.


> spark.read.parquet no long maintains the ordering the the data
> --
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20144) spark.read.parquet no long maintains the ordering the the data

2017-03-29 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-20144:
---
Description: 
Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
when we read parquet files in 2.0.2, the ordering of rows in the resulting 
dataframe is not the same as the ordering of rows in the dataframe that the 
parquet file was reproduced with. 

This is because FileSourceStrategy.scala combines the parquet files into fewer 
partitions and also reordered them. This breaks our workout because they assume 
the ordering of the data. 

Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1.

  was:Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we 
found is when we read parquet files in 2.0.2, the ordering of rows in the 
resulting dataframe is not the same as the ordering of rows in the dataframe 
that the parquet file was reproduced with. This is because 
FileSourceStrategy.scala combines the parquet files into fewer partitions and 
also reordered them. This breaks our workout because they assume the ordering 
of the data. Is this considered a bug? Also FileSourceStrategy and 
FileSourceScanExec changed quite a bit from 2.0.2 to 2.1, so not sure if this 
is an issue with 2.1.


> spark.read.parquet no long maintains the ordering the the data
> --
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workout because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20144) spark.read.parquet no long maintains the ordering the the data

2017-03-29 Thread Li Jin (JIRA)
Li Jin created SPARK-20144:
--

 Summary: spark.read.parquet no long maintains the ordering the the 
data
 Key: SPARK-20144
 URL: https://issues.apache.org/jira/browse/SPARK-20144
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Li Jin


Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
when we read parquet files in 2.0.2, the ordering of rows in the resulting 
dataframe is not the same as the ordering of rows in the dataframe that the 
parquet file was reproduced with. This is because FileSourceStrategy.scala 
combines the parquet files into fewer partitions and also reordered them. This 
breaks our workout because they assume the ordering of the data. Is this 
considered a bug? Also FileSourceStrategy and FileSourceScanExec changed quite 
a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20143) DataType.fromJson should throw an exception with better message

2017-03-29 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-20143:


 Summary: DataType.fromJson should throw an exception with better 
message
 Key: SPARK-20143
 URL: https://issues.apache.org/jira/browse/SPARK-20143
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon
Priority: Minor


Currently, 

{code}
scala> import org.apache.spark.sql.types.DataType
import org.apache.spark.sql.types.DataType

scala> DataType.fromJson( abcd)
java.util.NoSuchElementException: key not found: abcd
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:59)
  at scala.collection.MapLike$class.apply(MapLike.scala:141)
  at scala.collection.AbstractMap.apply(Map.scala:59)
  at org.apache.spark.sql.types.DataType$.nameToType(DataType.scala:118)
  at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:132)
  at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
  ... 48 elided

scala> DataType.fromJson( """{"abcd":"a"}""")
scala.MatchError: JObject(List((abcd,JString(a (of class 
org.json4s.JsonAST$JObject)
  at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:130)
  at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
  ... 48 elided

scala> DataType.fromJson( """{"fields": [{"a":123}], "type": "struct"}""")
scala.MatchError: JObject(List((a,JInt(123 (of class 
org.json4s.JsonAST$JObject)
  at 
org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:169)
  at 
org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150)
  at 
org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:150)
  at scala.collection.immutable.List.map(List.scala:273)
  at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:150)
  at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:104)
  ... 48 elided
{code}

{{DataType.fromJson}} throws non-readable error messages for the json input. We 
could improve this rather than throwing {{scala.MatchError}} or 
{{java.util.NoSuchElementException}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20142) Move RewriteDistinctAggregates later into query execution

2017-03-29 Thread Juliusz Sompolski (JIRA)
Juliusz Sompolski created SPARK-20142:
-

 Summary: Move RewriteDistinctAggregates later into query execution
 Key: SPARK-20142
 URL: https://issues.apache.org/jira/browse/SPARK-20142
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 2.1.0
Reporter: Juliusz Sompolski
Priority: Minor


The rewrite of distinct aggregates complicates the later analysis of them by 
later optimizer rules.
Move it to later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2017-03-29 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947114#comment-15947114
 ] 

Nick Pentreath commented on SPARK-14174:


The actual fix in the PR is pretty small - essentially just adding an 
{{rdd.sample}} call (similar to the old {{mllib}} gradient descent impl). So if 
we can see some good speed improvements on a relatively large class of input 
datasets, this seems like an easy win. From the performance tests above it 
seems like there's a significant win even for low-dimensional vectors. For 
higher dimensions the improvement may be as large or perhaps larger.

[~podongfeng] it may be best to add a few different cases to the performance 
tests to illustrate the behavior for different cases (and if not for certain 
cases, we should document that):

# small dimension, dense
# high dimension, dense
# small dimension, sparse
# high dimension, sparse

[~rnowling] do you have time to check out the PR here? It seems similar in 
spirit to what you had done and just uses the built-in RDD sampling (which I 
think [~derrickburns] mentioned in SPARK-2308).

> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> {code}
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> {code}
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> {code}
> KMeans2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01)   90sec
> {code}
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20141) jdbc query gives ORA-00903

2017-03-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947106#comment-15947106
 ] 

Sean Owen commented on SPARK-20141:
---

That sounds like an Oracle error. There's no detail that suggests there is a 
Spark error here.

> jdbc query gives ORA-00903
> --
>
> Key: SPARK-20141
> URL: https://issues.apache.org/jira/browse/SPARK-20141
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.2
> Environment: Windows7
>Reporter: sergio
>  Labels: windows
> Attachments: exception.png
>
>
> Error while querying to external oracle database. 
> It works this way and then I can work with jdbcDF:
> val jdbcDF = sqlContext.read.format("jdbc").options(
>   Map("url" -> "jdbc:oracle:thin:@//crmdbmr.cgs.comp.ru:1521/crmrmir",
>   "user" -> "my_login",
>   "password" -> "my_password",
>   "dbtable" -> "siebel.table1")).load() 
> while when trying to send some query, it fails 
> val jdbcDF = sqlContext.read.format("jdbc").options(
>   Map("url" -> "jdbc:oracle:thin:@//crmdbmr.cgs.comp.ru:1521/crmrmir",
>   "user" -> "my_login",
>   "password" -> "my_password",
>   "dbtable" -> "select * from siebel.table1 where call_id= 
> '1-1TMC4D4U'")).load() 
> This query works fine in SQLDeveloper, or when i registerTempTable, but when 
> I put direct query instead of schema.table, it gives this error:
> java.sql.SQLSyntaxErrorException: ORA-00903:
> It looks like spark sends wrong query.
> I tried everything in "JDBC To Other Databases":
> http://spark.apache.org/docs/latest/sql-programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries

2017-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947099#comment-15947099
 ] 

Apache Spark commented on SPARK-20140:
--

User 'yssharma' has created a pull request for this issue:
https://github.com/apache/spark/pull/17467

> Remove hardcoded kinesis retry wait and max retries
> ---
>
> Key: SPARK-20140
> URL: https://issues.apache.org/jira/browse/SPARK-20140
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>  Labels: kinesis, recovery
>
> The pull requests proposes to remove the hardcoded values for Amazon Kinesis 
> - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES.
> This change is critical for kinesis checkpoint recovery when the kinesis 
> backed rdd is huge.
> Following happens in a typical kinesis recovery :
> - kinesis throttles large number of requests while recovering
> - retries in case of throttling are not able to recover due to the small wait 
> period
> - kinesis throttles per second, the wait period should be configurable for 
> recovery
> The patch picks the spark kinesis configs from:
> - spark.streaming.kinesis.retry.wait.time
> - spark.streaming.kinesis.retry.max.attempts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20140:


Assignee: Apache Spark

> Remove hardcoded kinesis retry wait and max retries
> ---
>
> Key: SPARK-20140
> URL: https://issues.apache.org/jira/browse/SPARK-20140
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Apache Spark
>  Labels: kinesis, recovery
>
> The pull requests proposes to remove the hardcoded values for Amazon Kinesis 
> - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES.
> This change is critical for kinesis checkpoint recovery when the kinesis 
> backed rdd is huge.
> Following happens in a typical kinesis recovery :
> - kinesis throttles large number of requests while recovering
> - retries in case of throttling are not able to recover due to the small wait 
> period
> - kinesis throttles per second, the wait period should be configurable for 
> recovery
> The patch picks the spark kinesis configs from:
> - spark.streaming.kinesis.retry.wait.time
> - spark.streaming.kinesis.retry.max.attempts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries

2017-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20140:


Assignee: (was: Apache Spark)

> Remove hardcoded kinesis retry wait and max retries
> ---
>
> Key: SPARK-20140
> URL: https://issues.apache.org/jira/browse/SPARK-20140
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>  Labels: kinesis, recovery
>
> The pull requests proposes to remove the hardcoded values for Amazon Kinesis 
> - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES.
> This change is critical for kinesis checkpoint recovery when the kinesis 
> backed rdd is huge.
> Following happens in a typical kinesis recovery :
> - kinesis throttles large number of requests while recovering
> - retries in case of throttling are not able to recover due to the small wait 
> period
> - kinesis throttles per second, the wait period should be configurable for 
> recovery
> The patch picks the spark kinesis configs from:
> - spark.streaming.kinesis.retry.wait.time
> - spark.streaming.kinesis.retry.max.attempts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20141) jdbc query gives ORA-00903

2017-03-29 Thread sergio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sergio updated SPARK-20141:
---
Attachment: exception.png

> jdbc query gives ORA-00903
> --
>
> Key: SPARK-20141
> URL: https://issues.apache.org/jira/browse/SPARK-20141
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.2
> Environment: Windows7
>Reporter: sergio
>  Labels: windows
> Attachments: exception.png
>
>
> Error while querying to external oracle database. 
> It works this way and then I can work with jdbcDF:
> val jdbcDF = sqlContext.read.format("jdbc").options(
>   Map("url" -> "jdbc:oracle:thin:@//crmdbmr.cgs.comp.ru:1521/crmrmir",
>   "user" -> "my_login",
>   "password" -> "my_password",
>   "dbtable" -> "siebel.table1")).load() 
> while when trying to send some query, it fails 
> val jdbcDF = sqlContext.read.format("jdbc").options(
>   Map("url" -> "jdbc:oracle:thin:@//crmdbmr.cgs.comp.ru:1521/crmrmir",
>   "user" -> "my_login",
>   "password" -> "my_password",
>   "dbtable" -> "select * from siebel.table1 where call_id= 
> '1-1TMC4D4U'")).load() 
> This query works fine in SQLDeveloper, or when i registerTempTable, but when 
> I put direct query instead of schema.table, it gives this error:
> java.sql.SQLSyntaxErrorException: ORA-00903:
> It looks like spark sends wrong query.
> I tried everything in "JDBC To Other Databases":
> http://spark.apache.org/docs/latest/sql-programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20141) jdbc query gives ORA-00903

2017-03-29 Thread sergio (JIRA)
sergio created SPARK-20141:
--

 Summary: jdbc query gives ORA-00903
 Key: SPARK-20141
 URL: https://issues.apache.org/jira/browse/SPARK-20141
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.0.2
 Environment: Windows7
Reporter: sergio


Error while querying to external oracle database. 
It works this way and then I can work with jdbcDF:

val jdbcDF = sqlContext.read.format("jdbc").options(
  Map("url" -> "jdbc:oracle:thin:@//crmdbmr.cgs.comp.ru:1521/crmrmir",
  "user" -> "my_login",
  "password" -> "my_password",
  "dbtable" -> "siebel.table1")).load() 

while when trying to send some query, it fails 

val jdbcDF = sqlContext.read.format("jdbc").options(
  Map("url" -> "jdbc:oracle:thin:@//crmdbmr.cgs.comp.ru:1521/crmrmir",
  "user" -> "my_login",
  "password" -> "my_password",
  "dbtable" -> "select * from siebel.table1 where call_id= 
'1-1TMC4D4U'")).load() 

This query works fine in SQLDeveloper, or when i registerTempTable, but when I 
put direct query instead of schema.table, it gives this error:
java.sql.SQLSyntaxErrorException: ORA-00903:

It looks like spark sends wrong query.
I tried everything in "JDBC To Other Databases":
http://spark.apache.org/docs/latest/sql-programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20139) Spark UI reports partial success for completed stage while log shows all tasks are finished

2017-03-29 Thread Etti Gur (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etti Gur updated SPARK-20139:
-
Description: 
Spark UI reports partial success for completed stage while log shows all tasks 
are finished - i.e.:
We have a stage that is presented under completed stages on spark UI,
but the successful tasks are shown like so: (146372/524964) not as you'd expect 
(524964/524964)
Looking at the application master log shows all tasks in that stage are 
successful:
17/03/29 09:45:49 INFO TaskSetManager: Finished task 522973.0 in stage 0.0 (TID 
522973) in 1163910 ms on ip-10-1-15-34.ec2.internal (executor 116) 
(524963/524964)
17/03/29 09:45:49 INFO TaskSetManager: Finished task 12508.0 in stage 2.0 (TID 
537472) in 241250 ms on ip-10-1-15-14.ec2.internal (executor 38) (20234/20262)
17/03/29 09:45:49 INFO TaskSetManager: Finished task 12465.0 in stage 2.0 (TID 
537429) in 241994 ms on ip-10-1-15-106.ec2.internal (executor 133) (20235/20262)
17/03/29 09:45:49 INFO TaskSetManager: Finished task 15079.0 in stage 2.0 (TID 
540043) in 202889 ms on ip-10-1-15-173.ec2.internal (executor 295) (20236/20262)
17/03/29 09:45:49 INFO TaskSetManager: Finished task 19828.0 in stage 2.0 (TID 
544792) in 137845 ms on ip-10-1-15-147.ec2.internal (executor 43) (20237/20262)
17/03/29 09:45:50 INFO TaskSetManager: Finished task 19072.0 in stage 2.0 (TID 
544036) in 147363 ms on ip-10-1-15-19.ec2.internal (executor 175) (20238/20262)
17/03/29 09:45:50 INFO TaskSetManager: Finished task 524146.0 in stage 0.0 (TID 
524146) in 889950 ms on ip-10-1-15-72.ec2.internal (executor 74) (524964/524964)

Also in the log we get an error:

17/03/29 08:24:16 ERROR LiveListenerBus: Dropping SparkListenerEvent because no 
remaining room in event queue. This likely means one of the SparkListeners is 
too slow and cannot keep up with the rate at which tasks are being started by 
the scheduler.

This looks like the stage is indeed completed with all its tasks but UI shows 
like not all tasks really finished.

  was:
Spark UI reports partial success for completed stage while log shows all tasks 
are finished - i.e.:
We have a stage that is presented under completed stages on spark UI,
but the successful tasks are shown like so: (146372/524964) not as you'd expect 
(524964/524964)
Looking at the application master log shows all tasks in that stage are 
successful:
17/03/29 09:45:49 INFO TaskSetManager: Finished task 522973.0 in stage 0.0 (TID 
522973) in 1163910 ms on ip-10-1-15-34.ec2.internal (executor 116) 
(524963/524964)
17/03/29 09:45:49 INFO TaskSetManager: Finished task 12508.0 in stage 2.0 (TID 
537472) in 241250 ms on ip-10-1-15-14.ec2.internal (executor 38) (20234/20262)
17/03/29 09:45:49 INFO TaskSetManager: Finished task 12465.0 in stage 2.0 (TID 
537429) in 241994 ms on ip-10-1-15-106.ec2.internal (executor 133) (20235/20262)
17/03/29 09:45:49 INFO TaskSetManager: Finished task 15079.0 in stage 2.0 (TID 
540043) in 202889 ms on ip-10-1-15-173.ec2.internal (executor 295) (20236/20262)
17/03/29 09:45:49 INFO TaskSetManager: Finished task 19828.0 in stage 2.0 (TID 
544792) in 137845 ms on ip-10-1-15-147.ec2.internal (executor 43) (20237/20262)
17/03/29 09:45:50 INFO TaskSetManager: Finished task 19072.0 in stage 2.0 (TID 
544036) in 147363 ms on ip-10-1-15-19.ec2.internal (executor 175) (20238/20262)
17/03/29 09:45:50 INFO TaskSetManager: Finished task 524146.0 in stage 0.0 (TID 
524146) in 889950 ms on ip-10-1-15-72.ec2.internal (executor 74) 
*(524964/524964)*

This looks like the stage is indeed completed with all its tasks but UI shows 
like not all tasks really finished.


> Spark UI reports partial success for completed stage while log shows all 
> tasks are finished
> ---
>
> Key: SPARK-20139
> URL: https://issues.apache.org/jira/browse/SPARK-20139
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Etti Gur
> Attachments: screenshot-1.png
>
>
> Spark UI reports partial success for completed stage while log shows all 
> tasks are finished - i.e.:
> We have a stage that is presented under completed stages on spark UI,
> but the successful tasks are shown like so: (146372/524964) not as you'd 
> expect (524964/524964)
> Looking at the application master log shows all tasks in that stage are 
> successful:
> 17/03/29 09:45:49 INFO TaskSetManager: Finished task 522973.0 in stage 0.0 
> (TID 522973) in 1163910 ms on ip-10-1-15-34.ec2.internal (executor 116) 
> (524963/524964)
> 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12508.0 in stage 2.0 
> (TID 537472) in 241250 ms on ip-10-1-15-14.ec2.internal (executor 38) 
> (20234/20262)
> 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12465.0 in stage 2.0 
> (TID 537429) in 

[jira] [Resolved] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on

2017-03-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19556.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17295
[https://github.com/apache/spark/pull/17295]

> Broadcast data is not encrypted when I/O encryption is on
> -
>
> Key: SPARK-19556
> URL: https://issues.apache.org/jira/browse/SPARK-19556
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.2.0
>
>
> {{TorrentBroadcast}} uses a couple of "back doors" into the block manager to 
> write and read data:
> {code}
>   if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(s"Failed to store $pieceId of $broadcastId 
> in local BlockManager")
>   }
> {code}
> {code}
>   bm.getLocalBytes(pieceId) match {
> case Some(block) =>
>   blocks(pid) = block
>   releaseLock(pieceId)
> case None =>
>   bm.getRemoteBytes(pieceId) match {
> case Some(b) =>
>   if (checksumEnabled) {
> val sum = calcChecksum(b.chunks(0))
> if (sum != checksums(pid)) {
>   throw new SparkException(s"corrupt remote block $pieceId of 
> $broadcastId:" +
> s" $sum != ${checksums(pid)}")
> }
>   }
>   // We found the block from remote executors/driver's 
> BlockManager, so put the block
>   // in this executor's BlockManager.
>   if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(
>   s"Failed to store $pieceId of $broadcastId in local 
> BlockManager")
>   }
>   blocks(pid) = b
> case None =>
>   throw new SparkException(s"Failed to get $pieceId of 
> $broadcastId")
>   }
>   }
> {code}
> The thing these block manager methods have in common is that they bypass the 
> encryption code; so broadcast data is stored unencrypted in the block 
> manager, causing unencrypted data to be written to disk if those blocks need 
> to be evicted from memory.
> The correct fix here is actually not to change {{TorrentBroadcast}}, but to 
> fix the block manager so that:
> - data stored in memory is not encrypted
> - data written to disk is encrypted
> This would simplify the code paths that use BlockManager / SerializerManager 
> APIs (e.g. see SPARK-19520), but requires some tricky changes inside the 
> BlockManager to still be able to use file channels to avoid reading whole 
> blocks back into memory so they can be decrypted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset

2017-03-29 Thread Tomas Pranckevicius (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934218#comment-15934218
 ] 

Tomas Pranckevicius edited comment on SPARK-12261 at 3/29/17 12:28 PM:
---

Thank you Shea for the details. These solutions will not necessarily apply to 
my situation, but this is clear - these solutions does not solve my problem. 
There is new project called IBM systemML that might solve these issues, because 
most probably the current version of MLlib does not support automatic 
optimization based on data and cluster characteristics to ensure efficiency and 
scalability. So lets see what is happening on Apache SystemML. More: 
http://systemml.apache.org


was (Author: tomas pranckevicius):
Thank you Shea for the details. These solutions will not necessarily apply to 
my situation, but this is clear - these solutions does not solve my problem.

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
> URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on

2017-03-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19556:
---

Assignee: Marcelo Vanzin

> Broadcast data is not encrypted when I/O encryption is on
> -
>
> Key: SPARK-19556
> URL: https://issues.apache.org/jira/browse/SPARK-19556
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>
> {{TorrentBroadcast}} uses a couple of "back doors" into the block manager to 
> write and read data:
> {code}
>   if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(s"Failed to store $pieceId of $broadcastId 
> in local BlockManager")
>   }
> {code}
> {code}
>   bm.getLocalBytes(pieceId) match {
> case Some(block) =>
>   blocks(pid) = block
>   releaseLock(pieceId)
> case None =>
>   bm.getRemoteBytes(pieceId) match {
> case Some(b) =>
>   if (checksumEnabled) {
> val sum = calcChecksum(b.chunks(0))
> if (sum != checksums(pid)) {
>   throw new SparkException(s"corrupt remote block $pieceId of 
> $broadcastId:" +
> s" $sum != ${checksums(pid)}")
> }
>   }
>   // We found the block from remote executors/driver's 
> BlockManager, so put the block
>   // in this executor's BlockManager.
>   if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(
>   s"Failed to store $pieceId of $broadcastId in local 
> BlockManager")
>   }
>   blocks(pid) = b
> case None =>
>   throw new SparkException(s"Failed to get $pieceId of 
> $broadcastId")
>   }
>   }
> {code}
> The thing these block manager methods have in common is that they bypass the 
> encryption code; so broadcast data is stored unencrypted in the block 
> manager, causing unencrypted data to be written to disk if those blocks need 
> to be evicted from memory.
> The correct fix here is actually not to change {{TorrentBroadcast}}, but to 
> fix the block manager so that:
> - data stored in memory is not encrypted
> - data written to disk is encrypted
> This would simplify the code paths that use BlockManager / SerializerManager 
> APIs (e.g. see SPARK-19520), but requires some tricky changes inside the 
> BlockManager to still be able to use file channels to avoid reading whole 
> blocks back into memory so they can be decrypted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries

2017-03-29 Thread Yash Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947015#comment-15947015
 ] 

Yash Sharma commented on SPARK-20140:
-

Proposing : https://github.com/apache/spark/pull/17467
Please review.

> Remove hardcoded kinesis retry wait and max retries
> ---
>
> Key: SPARK-20140
> URL: https://issues.apache.org/jira/browse/SPARK-20140
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>  Labels: kinesis, recovery
>
> The pull requests proposes to remove the hardcoded values for Amazon Kinesis 
> - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES.
> This change is critical for kinesis checkpoint recovery when the kinesis 
> backed rdd is huge.
> Following happens in a typical kinesis recovery :
> - kinesis throttles large number of requests while recovering
> - retries in case of throttling are not able to recover due to the small wait 
> period
> - kinesis throttles per second, the wait period should be configurable for 
> recovery
> The patch picks the spark kinesis configs from:
> - spark.streaming.kinesis.retry.wait.time
> - spark.streaming.kinesis.retry.max.attempts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries

2017-03-29 Thread Yash Sharma (JIRA)
Yash Sharma created SPARK-20140:
---

 Summary: Remove hardcoded kinesis retry wait and max retries
 Key: SPARK-20140
 URL: https://issues.apache.org/jira/browse/SPARK-20140
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.1.0
Reporter: Yash Sharma


The pull requests proposes to remove the hardcoded values for Amazon Kinesis - 
MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES.

This change is critical for kinesis checkpoint recovery when the kinesis backed 
rdd is huge.
Following happens in a typical kinesis recovery :
- kinesis throttles large number of requests while recovering
- retries in case of throttling are not able to recover due to the small wait 
period
- kinesis throttles per second, the wait period should be configurable for 
recovery

The patch picks the spark kinesis configs from:
- spark.streaming.kinesis.retry.wait.time
- spark.streaming.kinesis.retry.max.attempts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20139) Spark UI reports partial success for completed stage while log shows all tasks are finished

2017-03-29 Thread Etti Gur (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Etti Gur updated SPARK-20139:
-
Attachment: screenshot-1.png

> Spark UI reports partial success for completed stage while log shows all 
> tasks are finished
> ---
>
> Key: SPARK-20139
> URL: https://issues.apache.org/jira/browse/SPARK-20139
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Etti Gur
> Attachments: screenshot-1.png
>
>
> Spark UI reports partial success for completed stage while log shows all 
> tasks are finished - i.e.:
> We have a stage that is presented under completed stages on spark UI,
> but the successful tasks are shown like so: (146372/524964) not as you'd 
> expect (524964/524964)
> Looking at the application master log shows all tasks in that stage are 
> successful:
> 17/03/29 09:45:49 INFO TaskSetManager: Finished task 522973.0 in stage 0.0 
> (TID 522973) in 1163910 ms on ip-10-1-15-34.ec2.internal (executor 116) 
> (524963/524964)
> 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12508.0 in stage 2.0 
> (TID 537472) in 241250 ms on ip-10-1-15-14.ec2.internal (executor 38) 
> (20234/20262)
> 17/03/29 09:45:49 INFO TaskSetManager: Finished task 12465.0 in stage 2.0 
> (TID 537429) in 241994 ms on ip-10-1-15-106.ec2.internal (executor 133) 
> (20235/20262)
> 17/03/29 09:45:49 INFO TaskSetManager: Finished task 15079.0 in stage 2.0 
> (TID 540043) in 202889 ms on ip-10-1-15-173.ec2.internal (executor 295) 
> (20236/20262)
> 17/03/29 09:45:49 INFO TaskSetManager: Finished task 19828.0 in stage 2.0 
> (TID 544792) in 137845 ms on ip-10-1-15-147.ec2.internal (executor 43) 
> (20237/20262)
> 17/03/29 09:45:50 INFO TaskSetManager: Finished task 19072.0 in stage 2.0 
> (TID 544036) in 147363 ms on ip-10-1-15-19.ec2.internal (executor 175) 
> (20238/20262)
> 17/03/29 09:45:50 INFO TaskSetManager: Finished task 524146.0 in stage 0.0 
> (TID 524146) in 889950 ms on ip-10-1-15-72.ec2.internal (executor 74) 
> *(524964/524964)*
> This looks like the stage is indeed completed with all its tasks but UI shows 
> like not all tasks really finished.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20139) Spark UI reports partial success for completed stage while log shows all tasks are finished

2017-03-29 Thread Etti Gur (JIRA)
Etti Gur created SPARK-20139:


 Summary: Spark UI reports partial success for completed stage 
while log shows all tasks are finished
 Key: SPARK-20139
 URL: https://issues.apache.org/jira/browse/SPARK-20139
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0
Reporter: Etti Gur


Spark UI reports partial success for completed stage while log shows all tasks 
are finished - i.e.:
We have a stage that is presented under completed stages on spark UI,
but the successful tasks are shown like so: (146372/524964) not as you'd expect 
(524964/524964)
Looking at the application master log shows all tasks in that stage are 
successful:
17/03/29 09:45:49 INFO TaskSetManager: Finished task 522973.0 in stage 0.0 (TID 
522973) in 1163910 ms on ip-10-1-15-34.ec2.internal (executor 116) 
(524963/524964)
17/03/29 09:45:49 INFO TaskSetManager: Finished task 12508.0 in stage 2.0 (TID 
537472) in 241250 ms on ip-10-1-15-14.ec2.internal (executor 38) (20234/20262)
17/03/29 09:45:49 INFO TaskSetManager: Finished task 12465.0 in stage 2.0 (TID 
537429) in 241994 ms on ip-10-1-15-106.ec2.internal (executor 133) (20235/20262)
17/03/29 09:45:49 INFO TaskSetManager: Finished task 15079.0 in stage 2.0 (TID 
540043) in 202889 ms on ip-10-1-15-173.ec2.internal (executor 295) (20236/20262)
17/03/29 09:45:49 INFO TaskSetManager: Finished task 19828.0 in stage 2.0 (TID 
544792) in 137845 ms on ip-10-1-15-147.ec2.internal (executor 43) (20237/20262)
17/03/29 09:45:50 INFO TaskSetManager: Finished task 19072.0 in stage 2.0 (TID 
544036) in 147363 ms on ip-10-1-15-19.ec2.internal (executor 175) (20238/20262)
17/03/29 09:45:50 INFO TaskSetManager: Finished task 524146.0 in stage 0.0 (TID 
524146) in 889950 ms on ip-10-1-15-72.ec2.internal (executor 74) 
*(524964/524964)*

This looks like the stage is indeed completed with all its tasks but UI shows 
like not all tasks really finished.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18971) Netty issue may cause the shuffle client hang

2017-03-29 Thread Emlyn Corrin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946882#comment-15946882
 ] 

Emlyn Corrin commented on SPARK-18971:
--

Will this fix go into Spark 2.1.1?

> Netty issue may cause the shuffle client hang
> -
>
> Key: SPARK-18971
> URL: https://issues.apache.org/jira/browse/SPARK-18971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.2.0
>
>
> Check https://github.com/netty/netty/issues/6153 for details
> You should be able to see the following similar stack track in the executor 
> thread dump.
> {code}
> "shuffle-client-7-4" daemon prio=5 tid=97 RUNNABLE
> at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:504)
> at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454)
> at io.netty.util.Recycler$Stack.pop(Recycler.java:435)
> at io.netty.util.Recycler.get(Recycler.java:144)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39)
> at 
> io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:140)
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20138) Add imports to snippets in Spark SQL, DataFrames and Datasets Guide doc

2017-03-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946818#comment-15946818
 ] 

Sean Owen commented on SPARK-20138:
---

I think the Spark imports were purposely excluded for brevity. However there 
may be some cases where the imports aren't obvious because they aren't from 
Spark. I think that could be valuable to add in some cases. 

> Add imports to snippets in Spark SQL, DataFrames and Datasets Guide doc
> ---
>
> Key: SPARK-20138
> URL: https://issues.apache.org/jira/browse/SPARK-20138
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> Given [the question on 
> StackOverflow|http://stackoverflow.com/q/43089100/1305344] it seems it'd be 
> helpful to add imports to the snippets to make _some_ people's lives easier.
> {quote}
> When I try to load data using the second method in the link, I get the 
> following error.
> scala> val connectionProperties = new Properties()
> :44: error: not found: type Properties
>val connectionProperties = new Properties()
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20138) Add imports to snippets in Spark SQL, DataFrames and Datasets Guide doc

2017-03-29 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-20138:
---

 Summary: Add imports to snippets in Spark SQL, DataFrames and 
Datasets Guide doc
 Key: SPARK-20138
 URL: https://issues.apache.org/jira/browse/SPARK-20138
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 2.2.0
Reporter: Jacek Laskowski
Priority: Trivial


Given [the question on 
StackOverflow|http://stackoverflow.com/q/43089100/1305344] it seems it'd be 
helpful to add imports to the snippets to make _some_ people's lives easier.

{quote}
When I try to load data using the second method in the link, I get the 
following error.

scala> val connectionProperties = new Properties()
:44: error: not found: type Properties
   val connectionProperties = new Properties()
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20135) spark thriftserver2: no job running but containers not release on yarn

2017-03-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946748#comment-15946748
 ] 

Sean Owen commented on SPARK-20135:
---

There isn't enough detail here. It may be normal operation depending on your 
timeouts and settings. It isn't even clear you have enabled dynamic allocation. 
The mailing list is the right place to start with questions. 

> spark thriftserver2: no job running but containers not release on yarn
> --
>
> Key: SPARK-20135
> URL: https://issues.apache.org/jira/browse/SPARK-20135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark 2.0.1 with hadoop 2.6.0 
>Reporter: bruce xu
> Attachments: 0329-1.png, 0329-2.png, 0329-3.png
>
>
> i opened the executor dynamic allocation feature, however it doesn't work 
> sometimes.
> i set the initial executor num 50,  after job finished the cores and mem 
> resource did not release. 
> from the spark web UI, the active job/running task/stage num is 0 , but the 
> executors page show  cores 1276, active task 7288.
> from the yarn web UI,  the thriftserver job's running containers is 639 
> without releasing. 
> this may be a bug. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >