[jira] [Commented] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match

2017-08-31 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150091#comment-16150091
 ] 

Liang-Chi Hsieh commented on SPARK-21658:
-

Thanks [~hyukjin.kwon] and [~jerryshao].

> Adds the default None for value in na.replace in PySpark to match
> -
>
> Key: SPARK-21658
> URL: https://issues.apache.org/jira/browse/SPARK-21658
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Chin Han Yu
>Priority: Minor
>  Labels: Starter
> Fix For: 2.3.0
>
>
> Looks {{na.replace}} missed the default value {{None}}.
> Both docs says they are aliases 
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace
> but the default values looks different, which ends up with:
> {code}
> >>> df = spark.createDataFrame([('Alice', 10, 80.0)])
> >>> df.replace({"Alice": "a"}).first()
> Row(_1=u'a', _2=10, _3=80.0)
> >>> df.na.replace({"Alice": "a"}).first()
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: replace() takes at least 3 arguments (2 given)
> {code}
> To take the advantage of SPARK-19454, sounds we should match them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21859) SparkFiles.get failed on driver in yarn-cluster and yarn-client mode

2017-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150088#comment-16150088
 ] 

Apache Spark commented on SPARK-21859:
--

User 'lgrcyanny' has created a pull request for this issue:
https://github.com/apache/spark/pull/19102

> SparkFiles.get failed on driver in yarn-cluster and yarn-client mode
> 
>
> Key: SPARK-21859
> URL: https://issues.apache.org/jira/browse/SPARK-21859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Cyanny
>
> when use SparkFiles.get a file on driver in yarn-client or yarn-cluster, it 
> will report file not found exception.
> This exception only happens on driver, SparkFiles.get works fine on 
> executor.
> 
> we can reproduce the bug as follows:
> ```scala
> def testOnDriver(fileName: String) = {
> val file = new File(SparkFiles.get(fileName))
> if (!file.exists()) {
> logging.info(s"$file not exist")
> } else {
> // print file content on driver
> val content = Source.fromFile(file).getLines().mkString("\n")
> logging.info(s"File content: ${content}")
> }
> }
> // the output will be file not exist
> ```
> 
> ```python
> conf = SparkConf().setAppName("test files")
> sc = SparkContext(appName="spark files test")
> 
> def test_on_driver(filename):
> file = SparkFiles.get(filename)
> print("file path: {}".format(file))
> if os.path.exists(file):
> with open(file) as f:
> lines = f.readlines()
> print(lines)
> else:
> print("file doesn't exist")
> run_command("ls .")
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21892) status code is 200 OK when kill application fail via spark master rest api

2017-08-31 Thread Zhuang Xueyin (JIRA)
Zhuang Xueyin created SPARK-21892:
-

 Summary: status code is 200 OK  when kill application fail via 
spark master rest api
 Key: SPARK-21892
 URL: https://issues.apache.org/jira/browse/SPARK-21892
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Zhuang Xueyin
Priority: Minor


Sent a post request to spark master restapi, eg:
http://:6066/v1/submissions/kill/driver-xxx

Request body:
{
"action" : "KillSubmissionRequest",
"clientSparkVersion" : "2.1.0",
}

Response body:
{
  "action" : "KillSubmissionResponse",
  "message" : "Driver driver-xxx has already finished or does not exist",
  "serverSparkVersion" : "2.1.0",
  "submissionId" : "driver-xxx",
  "success" : false
}

Response headers:
*Status Code: 200 OK*
Content-Length: 203
Content-Type: application/json; charset=UTF-8
Date: Fri, 01 Sep 2017 05:56:04 GMT
Server: Jetty(9.2.z-SNAPSHOT)

Result:
status code is 200 OK  when kill application fail via spark master rest api. 
While the response body indicates that the update is not successfully, this is 
not rest api standard, suggest to improve it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match

2017-08-31 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-21658:
---

Assignee: Chin Han Yu

> Adds the default None for value in na.replace in PySpark to match
> -
>
> Key: SPARK-21658
> URL: https://issues.apache.org/jira/browse/SPARK-21658
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Chin Han Yu
>Priority: Minor
>  Labels: Starter
> Fix For: 2.3.0
>
>
> Looks {{na.replace}} missed the default value {{None}}.
> Both docs says they are aliases 
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace
> but the default values looks different, which ends up with:
> {code}
> >>> df = spark.createDataFrame([('Alice', 10, 80.0)])
> >>> df.replace({"Alice": "a"}).first()
> Row(_1=u'a', _2=10, _3=80.0)
> >>> df.na.replace({"Alice": "a"}).first()
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: replace() takes at least 3 arguments (2 given)
> {code}
> To take the advantage of SPARK-19454, sounds we should match them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match

2017-08-31 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150059#comment-16150059
 ] 

Saisai Shao commented on SPARK-21658:
-

Done :).

> Adds the default None for value in na.replace in PySpark to match
> -
>
> Key: SPARK-21658
> URL: https://issues.apache.org/jira/browse/SPARK-21658
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Chin Han Yu
>Priority: Minor
>  Labels: Starter
> Fix For: 2.3.0
>
>
> Looks {{na.replace}} missed the default value {{None}}.
> Both docs says they are aliases 
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace
> but the default values looks different, which ends up with:
> {code}
> >>> df = spark.createDataFrame([('Alice', 10, 80.0)])
> >>> df.replace({"Alice": "a"}).first()
> Row(_1=u'a', _2=10, _3=80.0)
> >>> df.na.replace({"Alice": "a"}).first()
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: replace() takes at least 3 arguments (2 given)
> {code}
> To take the advantage of SPARK-19454, sounds we should match them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match

2017-08-31 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150022#comment-16150022
 ] 

Hyukjin Kwon commented on SPARK-21658:
--

[~jerryshao], I think you are now able to add a user to "Contributor" role if I 
understood correctly. This, [~byakuinss] user fixed this issue by 
https://github.com/apache/spark/pull/18895. Do you mind if I ask to assign this 
JIRA to her please?

> Adds the default None for value in na.replace in PySpark to match
> -
>
> Key: SPARK-21658
> URL: https://issues.apache.org/jira/browse/SPARK-21658
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: Starter
> Fix For: 2.3.0
>
>
> Looks {{na.replace}} missed the default value {{None}}.
> Both docs says they are aliases 
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace
> but the default values looks different, which ends up with:
> {code}
> >>> df = spark.createDataFrame([('Alice', 10, 80.0)])
> >>> df.replace({"Alice": "a"}).first()
> Row(_1=u'a', _2=10, _3=80.0)
> >>> df.na.replace({"Alice": "a"}).first()
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: replace() takes at least 3 arguments (2 given)
> {code}
> To take the advantage of SPARK-19454, sounds we should match them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21789) Remove obsolete codes for parsing abstract schema strings

2017-08-31 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21789.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18647
[https://github.com/apache/spark/pull/18647]

> Remove obsolete codes for parsing abstract schema strings
> -
>
> Key: SPARK-21789
> URL: https://issues.apache.org/jira/browse/SPARK-21789
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> It looks there are private functions that look not used in the main codes -
> {{_split_schema_abstract}}, {{_parse_field_abstract}}, 
> {{_parse_schema_abstract}} and {{_infer_schema_type}}, located in 
> {{python/pyspark/sql/types.py}}.
> This looks not used from the first place -
>  
> https://github.com/apache/spark/commit/880eabec37c69ce4e9594d7babfac291b0f93f50.
> Look we could remove this out.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21789) Remove obsolete codes for parsing abstract schema strings

2017-08-31 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21789:


Assignee: Hyukjin Kwon

> Remove obsolete codes for parsing abstract schema strings
> -
>
> Key: SPARK-21789
> URL: https://issues.apache.org/jira/browse/SPARK-21789
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> It looks there are private functions that look not used in the main codes -
> {{_split_schema_abstract}}, {{_parse_field_abstract}}, 
> {{_parse_schema_abstract}} and {{_infer_schema_type}}, located in 
> {{python/pyspark/sql/types.py}}.
> This looks not used from the first place -
>  
> https://github.com/apache/spark/commit/880eabec37c69ce4e9594d7babfac291b0f93f50.
> Look we could remove this out.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21779) Simpler Dataset.sample API in Python

2017-08-31 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21779:


Assignee: Hyukjin Kwon

> Simpler Dataset.sample API in Python
> 
>
> Key: SPARK-21779
> URL: https://issues.apache.org/jira/browse/SPARK-21779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> See parent ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21780) Simpler Dataset.sample API in R

2017-08-31 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150007#comment-16150007
 ] 

Hyukjin Kwon commented on SPARK-21780:
--

Let me work on this.

> Simpler Dataset.sample API in R
> ---
>
> Key: SPARK-21780
> URL: https://issues.apache.org/jira/browse/SPARK-21780
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> See parent ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21779) Simpler Dataset.sample API in Python

2017-08-31 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21779.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18999
[https://github.com/apache/spark/pull/18999]

> Simpler Dataset.sample API in Python
> 
>
> Key: SPARK-21779
> URL: https://issues.apache.org/jira/browse/SPARK-21779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
> Fix For: 2.3.0
>
>
> See parent ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21885) HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference enabled

2017-08-31 Thread liupengcheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149997#comment-16149997
 ] 

liupengcheng commented on SPARK-21885:
--

[~viirya] I think it's necessary, consider this senario, you have a timer job, 
and your schema may varies with time, you need to read the history data with 
old schema, but you are not expected to use  `INFER_AND_SAVE` to change the 
current schema. 

What's more, event if use `INFER_AND_SAVE`, it seems like that it will still 
infer schema. although there is some cache, but i think it's not enough, the 
first execution of any query for each session would be very slow.

{code:java}
private def inferIfNeeded(
  relation: MetastoreRelation,
  options: Map[String, String],
  fileFormat: FileFormat,
  fileIndexOpt: Option[FileIndex] = None): (StructType, CatalogTable) = {
val inferenceMode = 
sparkSession.sessionState.conf.caseSensitiveInferenceMode
val shouldInfer = (inferenceMode != {color:red}NEVER_INFER{color}) && 
!relation.catalogTable.schemaPreservesCase
val tableName = relation.catalogTable.identifier.unquotedString
if (shouldInfer) {
  logInfo(s"Inferring case-sensitive schema for table $tableName (inference 
mode: " +
s"$inferenceMode)")
  val fileIndex = fileIndexOpt.getOrElse {
val rootPath = new Path(relation.catalogTable.location)
new InMemoryFileIndex(sparkSession, Seq(rootPath), options, None)
  }

  val inferredSchema = fileFormat
.inferSchema(
  sparkSession,
  options,
  fileIndex.listFiles(Nil).flatMap(_.files))
.map(mergeWithMetastoreSchema(relation.catalogTable.schema, _))

  inferredSchema match {
case Some(schema) =>
  if (inferenceMode == INFER_AND_SAVE) {
updateCatalogSchema(relation.catalogTable.identifier, schema)
  }
  (schema, relation.catalogTable.copy(schema = schema))
case None =>
  logWarning(s"Unable to infer schema for table $tableName from file 
format " +
s"$fileFormat (inference mode: $inferenceMode). Using metastore 
schema.")
  (relation.catalogTable.schema, relation.catalogTable)
  }
} else {
  (relation.catalogTable.schema, relation.catalogTable)
}
  }
{code}


> HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference 
> enabled
> ---
>
> Key: SPARK-21885
> URL: https://issues.apache.org/jira/browse/SPARK-21885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
> Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to 
> INFER_ONLY
>Reporter: liupengcheng
>  Labels: slow, sql
>
> Currently, SparkSQL infer schema is too slow, almost take 2 minutes.
> I digged into the code, and finally findout the reason:
> 1. In the analysis process of  LogicalPlan spark will try to infer table 
> schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and 
> it will list all the leaf files of the rootPaths(just tableLocation), and 
> then call `getFileBlockLocations` to turn `FileStatus` into 
> `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files 
> will take a long time, and it seems that the locations info is never used.
> 2. When infer a parquet schema, if there is only one file, it will still 
> launch a spark job to merge schema. I think it's expensive.
> Time costly stack is as follow:
> {code:java}
> at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259)
> at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224)
> at 
> org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java

[jira] [Commented] (SPARK-21885) HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference enabled

2017-08-31 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149973#comment-16149973
 ] 

Liang-Chi Hsieh commented on SPARK-21885:
-

I tend to agree that when we don't actually need the block locations, the time 
spent on obtaining them is a waste. However, {{INFER_ONLY}} should only be used 
in very limited scenarios when you can't use {{INFER_AND_SAVE}} that only 
infers the schema on the first time. So I'm not sure if we should consider this 
when listing leaf files.

> HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference 
> enabled
> ---
>
> Key: SPARK-21885
> URL: https://issues.apache.org/jira/browse/SPARK-21885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
> Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to 
> INFER_ONLY
>Reporter: liupengcheng
>  Labels: slow, sql
>
> Currently, SparkSQL infer schema is too slow, almost take 2 minutes.
> I digged into the code, and finally findout the reason:
> 1. In the analysis process of  LogicalPlan spark will try to infer table 
> schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and 
> it will list all the leaf files of the rootPaths(just tableLocation), and 
> then call `getFileBlockLocations` to turn `FileStatus` into 
> `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files 
> will take a long time, and it seems that the locations info is never used.
> 2. When infer a parquet schema, if there is only one file, it will still 
> launch a spark job to merge schema. I think it's expensive.
> Time costly stack is as follow:
> {code:java}
> at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259)
> at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224)
> at 
> org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileInd

[jira] [Commented] (SPARK-21885) HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference enabled

2017-08-31 Thread liupengcheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149880#comment-16149880
 ] 

liupengcheng commented on SPARK-21885:
--

[~srowen] Fixed! thanks! anybody check this problem?

> HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference 
> enabled
> ---
>
> Key: SPARK-21885
> URL: https://issues.apache.org/jira/browse/SPARK-21885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
> Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to 
> INFER_ONLY
>Reporter: liupengcheng
>  Labels: slow, sql
>
> Currently, SparkSQL infer schema is too slow, almost take 2 minutes.
> I digged into the code, and finally findout the reason:
> 1. In the analysis process of  LogicalPlan spark will try to infer table 
> schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and 
> it will list all the leaf files of the rootPaths(just tableLocation), and 
> then call `getFileBlockLocations` to turn `FileStatus` into 
> `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files 
> will take a long time, and it seems that the locations info is never used.
> 2. When infer a parquet schema, if there is only one file, it will still 
> launch a spark job to merge schema. I think it's expensive.
> Time costly stack is as follow:
> {code:java}
> at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259)
> at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224)
> at 
> org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:30

[jira] [Resolved] (SPARK-21852) Empty Parquet Files created as a result of spark jobs fail when read

2017-08-31 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21852.
--
Resolution: Cannot Reproduce

[~sdalmia_asf], I think I need some details about "certain spark jobs" to check 
this issue. Otherwise, looks I am unable to follow and check. I am resolving 
this assuming it is currently difficult to provide the information.

Please let me know if you are actually aware of how to reproduce this that I 
could follow and try. I am resolving this for now and for the current status of 
this JIRA.

> Empty Parquet Files created as a result of spark jobs fail when read
> 
>
> Key: SPARK-21852
> URL: https://issues.apache.org/jira/browse/SPARK-21852
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Shivam Dalmia
>
> I have faced an issue intermittently with certain spark jobs writing parquet 
> files which apparently succeed but the written .parquet directory in HDFS is 
> an empty directory (with no _SUCCESS and _metadata parts, even). 
> Surprisingly, no errors are thrown from spark dataframe writer.
> However, when attempting to read this written file, spark throws the error:
> {{Unable to infer schema for Parquet. It must be specified manually}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21477) Mark LocalTableScanExec's input data transient

2017-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149802#comment-16149802
 ] 

Apache Spark commented on SPARK-21477:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19101

> Mark LocalTableScanExec's input data transient
> --
>
> Key: SPARK-21477
> URL: https://issues.apache.org/jira/browse/SPARK-21477
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> Mark the parameter rows and unsafeRow of LocalTableScanExec transient. It can 
> avoid serializing the unneeded objects.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery

2017-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149801#comment-16149801
 ] 

Apache Spark commented on SPARK-21884:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19101

> Fix StackOverflowError on MetadataOnlyQuery
> ---
>
> Key: SPARK-21884
> URL: https://issues.apache.org/jira/browse/SPARK-21884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>
> This issue aims to fix StackOverflowError in branch-2.2. In Apache master 
> branch, it doesn't throw StackOverflowError.
> {code}
> scala> spark.version
> res0: String = 2.2.0
> scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY 
> (p)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION 
> (p=$p)"))
> scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect
> java.lang.StackOverflowError
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21891) Add TBLPROPERTIES to DDL statement: CREATE TABLE USING

2017-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21891:


Assignee: Apache Spark  (was: Xiao Li)

> Add TBLPROPERTIES to DDL statement: CREATE TABLE USING 
> ---
>
> Key: SPARK-21891
> URL: https://issues.apache.org/jira/browse/SPARK-21891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21891) Add TBLPROPERTIES to DDL statement: CREATE TABLE USING

2017-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21891:


Assignee: Xiao Li  (was: Apache Spark)

> Add TBLPROPERTIES to DDL statement: CREATE TABLE USING 
> ---
>
> Key: SPARK-21891
> URL: https://issues.apache.org/jira/browse/SPARK-21891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21891) Add TBLPROPERTIES to DDL statement: CREATE TABLE USING

2017-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149781#comment-16149781
 ] 

Apache Spark commented on SPARK-21891:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19100

> Add TBLPROPERTIES to DDL statement: CREATE TABLE USING 
> ---
>
> Key: SPARK-21891
> URL: https://issues.apache.org/jira/browse/SPARK-21891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21891) Add TBLPROPERTIES to DDL statement: CREATE TABLE USING

2017-08-31 Thread Xiao Li (JIRA)
Xiao Li created SPARK-21891:
---

 Summary: Add TBLPROPERTIES to DDL statement: CREATE TABLE USING 
 Key: SPARK-21891
 URL: https://issues.apache.org/jira/browse/SPARK-21891
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21862) Add overflow check in PCA

2017-08-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21862.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19078
[https://github.com/apache/spark/pull/19078]

> Add overflow check in PCA
> -
>
> Key: SPARK-21862
> URL: https://issues.apache.org/jira/browse/SPARK-21862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.3.0
>
>
> We should add overflow check in PCA, otherwise it is possible to throw 
> `NegativeArraySizeException` when `k` and `numFeatures` are too large.
> The overflow checking formula is here:
> https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala#L87



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery

2017-08-31 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-21884.
-
Resolution: Duplicate

This is fixed at SPARK-21477 in master branch.

> Fix StackOverflowError on MetadataOnlyQuery
> ---
>
> Key: SPARK-21884
> URL: https://issues.apache.org/jira/browse/SPARK-21884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>
> This issue aims to fix StackOverflowError in branch-2.2. In Apache master 
> branch, it doesn't throw StackOverflowError.
> {code}
> scala> spark.version
> res0: String = 2.2.0
> scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY 
> (p)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION 
> (p=$p)"))
> scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect
> java.lang.StackOverflowError
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21088) CrossValidator, TrainValidationSplit should preserve all models after fitting: Python

2017-08-31 Thread Daniel Imberman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149743#comment-16149743
 ] 

Daniel Imberman commented on SPARK-21088:
-

[~ajaysaini] Are you still working on this one?

> CrossValidator, TrainValidationSplit should preserve all models after 
> fitting: Python
> -
>
> Key: SPARK-21088
> URL: https://issues.apache.org/jira/browse/SPARK-21088
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20221) Port pyspark.mllib.linalg tests in pyspark/mllib/tests.py to pyspark.ml.linalg

2017-08-31 Thread Daniel Imberman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149742#comment-16149742
 ] 

Daniel Imberman commented on SPARK-20221:
-

I can do this one

> Port pyspark.mllib.linalg tests in pyspark/mllib/tests.py to pyspark.ml.linalg
> --
>
> Key: SPARK-20221
> URL: https://issues.apache.org/jira/browse/SPARK-20221
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark, Tests
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> There are several linear algebra unit tests in pyspark/mllib/tests.py which 
> should be ported over to pyspark/ml/tests.py.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21110) Structs should be usable in inequality filters

2017-08-31 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21110.
-
   Resolution: Fixed
 Assignee: Andrew Ray
Fix Version/s: 2.3.0

> Structs should be usable in inequality filters
> --
>
> Key: SPARK-21110
> URL: https://issues.apache.org/jira/browse/SPARK-21110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nicholas Chammas
>Assignee: Andrew Ray
>Priority: Minor
> Fix For: 2.3.0
>
>
> It seems like a missing feature that you can't compare structs in a filter on 
> a DataFrame.
> Here's a simple demonstration of a) where this would be useful and b) how 
> it's different from simply comparing each of the components of the structs.
> {code}
> import pyspark
> from pyspark.sql.functions import col, struct, concat
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [
> ('Boston', 'Bob'),
> ('Boston', 'Nick'),
> ('San Francisco', 'Bob'),
> ('San Francisco', 'Nick'),
> ],
> ['city', 'person']
> )
> pairs = (
> df.select(
> struct('city', 'person').alias('p1')
> )
> .crossJoin(
> df.select(
> struct('city', 'person').alias('p2')
> )
> )
> )
> print("Everything")
> pairs.show()
> print("Comparing parts separately (doesn't give me what I want)")
> (pairs
> .where(col('p1.city') < col('p2.city'))
> .where(col('p1.person') < col('p2.person'))
> .show())
> print("Comparing parts together with concat (gives me what I want but is 
> hacky)")
> (pairs
> .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
> .show())
> print("Comparing parts together with struct (my desired solution but 
> currently yields an error)")
> (pairs
> .where(col('p1') < col('p2'))
> .show())
> {code}
> The last query yields the following error in Spark 2.1.1:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
> data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint 
> or int or bigint or float or double or decimal or timestamp or date or string 
> or binary) type, not struct;;
> 'Filter (p1#5 < p2#8)
> +- Join Cross
>:- Project [named_struct(city, city#0, person, person#1) AS p1#5]
>:  +- LogicalRDD [city#0, person#1]
>+- Project [named_struct(city, city#0, person, person#1) AS p2#8]
>   +- LogicalRDD [city#0, person#1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries

2017-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21652:


Assignee: (was: Apache Spark)

> Optimizer cannot reach a fixed point on certain queries
> ---
>
> Key: SPARK-21652
> URL: https://issues.apache.org/jira/browse/SPARK-21652
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Anton Okolnychyi
>
> The optimizer cannot reach a fixed point on the following query:
> {code}
> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
> Seq(1, 2).toDF("col").write.saveAsTable("t2")
> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 
> = t2.col AND t1.col2 = t2.col").explain(true)
> {code}
> At some point during the optimization, InferFiltersFromConstraints infers a 
> new constraint '(col2#33 = col1#32)' that is appended to the join condition, 
> then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces 
> '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, 
> ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally 
> removes this predicate. However, InferFiltersFromConstraints will again infer 
> '(col2#33 = col1#32)' on the next iteration and the process will continue 
> until the limit of iterations is reached. 
> See below for more details
> {noformat}
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
> !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
> Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && 
> (col2#33 = col#34)))
>  :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
>  :  +- Relation[col1#32,col2#33] parquet  
> :  +- Relation[col1#32,col2#33] parquet
>  +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Relation[col#34] parquet   
>+- Relation[col#34] parquet
> 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
> !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = 
> col#34)))  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter (col2#33 = col1#32)
> !:  +- Relation[col1#32,col2#33] parquet  
> :  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
> : +- Relation[col1#32,col2#33] parquet
> !   +- Relation[col#34] parquet   
> +- Filter ((1 = col#34) && isnotnull(col#34))
> ! 
>+- Relation[col#34] parquet
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter (col2#33 = col1#32)
>:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))
> !:  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
> && (1 = col2#33)))   :  +- Relation[col1#32,col2#33] parquet
> !: +- Relation[col1#32,col2#33] parquet   
>+- Filter ((1 = col#34) && isnotnull(col#34))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
>   +- Relation[col#34] parquet
> !   +- Relation[col#34] parquet   
>
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation 
> ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>  Join Inner, ((col1#32 = col#34) && 
> (col2#33 = col#34))
> !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33))) && (col2#33 = col1#32))   :- Filter (((isnotnull(col1#32) && 
> isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && (

[jira] [Assigned] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries

2017-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21652:


Assignee: Apache Spark

> Optimizer cannot reach a fixed point on certain queries
> ---
>
> Key: SPARK-21652
> URL: https://issues.apache.org/jira/browse/SPARK-21652
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Anton Okolnychyi
>Assignee: Apache Spark
>
> The optimizer cannot reach a fixed point on the following query:
> {code}
> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
> Seq(1, 2).toDF("col").write.saveAsTable("t2")
> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 
> = t2.col AND t1.col2 = t2.col").explain(true)
> {code}
> At some point during the optimization, InferFiltersFromConstraints infers a 
> new constraint '(col2#33 = col1#32)' that is appended to the join condition, 
> then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces 
> '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, 
> ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally 
> removes this predicate. However, InferFiltersFromConstraints will again infer 
> '(col2#33 = col1#32)' on the next iteration and the process will continue 
> until the limit of iterations is reached. 
> See below for more details
> {noformat}
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
> !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
> Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && 
> (col2#33 = col#34)))
>  :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
>  :  +- Relation[col1#32,col2#33] parquet  
> :  +- Relation[col1#32,col2#33] parquet
>  +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Relation[col#34] parquet   
>+- Relation[col#34] parquet
> 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
> !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = 
> col#34)))  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter (col2#33 = col1#32)
> !:  +- Relation[col1#32,col2#33] parquet  
> :  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
> : +- Relation[col1#32,col2#33] parquet
> !   +- Relation[col#34] parquet   
> +- Filter ((1 = col#34) && isnotnull(col#34))
> ! 
>+- Relation[col#34] parquet
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter (col2#33 = col1#32)
>:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))
> !:  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
> && (1 = col2#33)))   :  +- Relation[col1#32,col2#33] parquet
> !: +- Relation[col1#32,col2#33] parquet   
>+- Filter ((1 = col#34) && isnotnull(col#34))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
>   +- Relation[col#34] parquet
> !   +- Relation[col#34] parquet   
>
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation 
> ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>  Join Inner, ((col1#32 = col#34) && 
> (col2#33 = col#34))
> !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33))) && (col2#33 = col1#32))   :- Filter (((isnotnull(col1#32) && 
> isnotnull(col2#33)) && ((col1#32 = 1

[jira] [Commented] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries

2017-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149680#comment-16149680
 ] 

Apache Spark commented on SPARK-21652:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/19099

> Optimizer cannot reach a fixed point on certain queries
> ---
>
> Key: SPARK-21652
> URL: https://issues.apache.org/jira/browse/SPARK-21652
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Anton Okolnychyi
>
> The optimizer cannot reach a fixed point on the following query:
> {code}
> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
> Seq(1, 2).toDF("col").write.saveAsTable("t2")
> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 
> = t2.col AND t1.col2 = t2.col").explain(true)
> {code}
> At some point during the optimization, InferFiltersFromConstraints infers a 
> new constraint '(col2#33 = col1#32)' that is appended to the join condition, 
> then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces 
> '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, 
> ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally 
> removes this predicate. However, InferFiltersFromConstraints will again infer 
> '(col2#33 = col1#32)' on the next iteration and the process will continue 
> until the limit of iterations is reached. 
> See below for more details
> {noformat}
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
> !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
> Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && 
> (col2#33 = col#34)))
>  :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
>  :  +- Relation[col1#32,col2#33] parquet  
> :  +- Relation[col1#32,col2#33] parquet
>  +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Relation[col#34] parquet   
>+- Relation[col#34] parquet
> 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
> !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = 
> col#34)))  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter (col2#33 = col1#32)
> !:  +- Relation[col1#32,col2#33] parquet  
> :  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
> : +- Relation[col1#32,col2#33] parquet
> !   +- Relation[col#34] parquet   
> +- Filter ((1 = col#34) && isnotnull(col#34))
> ! 
>+- Relation[col#34] parquet
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter (col2#33 = col1#32)
>:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))
> !:  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
> && (1 = col2#33)))   :  +- Relation[col1#32,col2#33] parquet
> !: +- Relation[col1#32,col2#33] parquet   
>+- Filter ((1 = col#34) && isnotnull(col#34))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
>   +- Relation[col#34] parquet
> !   +- Relation[col#34] parquet   
>
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation 
> ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>  Join Inner, ((col1#32 = col#34) && 
> (col2#33 = col#34))
> !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33))) && (col

[jira] [Resolved] (SPARK-20676) Upload to PyPi

2017-08-31 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-20676.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Upload to PyPi
> --
>
> Key: SPARK-20676
> URL: https://issues.apache.org/jira/browse/SPARK-20676
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.1.1, 2.2.0
>Reporter: holdenk
>Assignee: holdenk
> Fix For: 2.2.0
>
>
> Upload PySpark to PyPi.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20676) Upload to PyPi

2017-08-31 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149656#comment-16149656
 ] 

holdenk commented on SPARK-20676:
-

Yes.

> Upload to PyPi
> --
>
> Key: SPARK-20676
> URL: https://issues.apache.org/jira/browse/SPARK-20676
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.1.1, 2.2.0
>Reporter: holdenk
>Assignee: holdenk
> Fix For: 2.2.0
>
>
> Upload PySpark to PyPi.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2017-08-31 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-21866:
---
Attachment: (was: SPIP - Image support for Apache Spark.pdf)

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information abou

[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2017-08-31 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-21866:
---
Attachment: SPIP - Image support for Apache Spark V1.1.pdf

Updated authors' list.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Som

[jira] [Commented] (SPARK-21807) The getAliasedConstraints function in LogicalPlan will take a long time when number of expressions is greater than 100

2017-08-31 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149600#comment-16149600
 ] 

Andrew Ash commented on SPARK-21807:


For reference, here's a stacktrace I'm seeing on a cluster before this change 
that I think this PR will improve:

{noformat}
"spark-task-4" #714 prio=5 os_prio=0 tid=0x7fa368031000 nid=0x4d91 runnable 
[0x7fa24e592000]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.spark.sql.catalyst.expressions.AttributeReference.equals(namedExpressions.scala:220)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.EqualNullSafe.equals(predicates.scala:505)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:151)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:40)
at 
scala.collection.mutable.FlatHashTable$class.growTable(FlatHashTable.scala:225)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:159)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:40)
at 
scala.collection.mutable.FlatHashTable$class.addElem(FlatHashTable.scala:139)
at scala.collection.mutable.HashSet.addElem(HashSet.scala:40)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:59)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:40)
at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.AbstractSet.$plus$plus$eq(Set.scala:46)
at scala.collection.mutable.HashSet.clone(HashSet.scala:83)
at scala.collection.mutable.HashSet.clone(HashSet.scala:40)
at 
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:65)
at 
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:50)
at scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141)
at scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractTraversable.foldLeft(Traversable.scala:104)
at 
scala.collection.TraversableOnce$class.$div$colon(TraversableOn

[jira] [Commented] (SPARK-21276) Update lz4-java to remove custom LZ4BlockInputStream

2017-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149568#comment-16149568
 ] 

Apache Spark commented on SPARK-21276:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/18883

> Update  lz4-java to remove custom LZ4BlockInputStream
> -
>
> Key: SPARK-21276
> URL: https://issues.apache.org/jira/browse/SPARK-21276
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Trivial
> Fix For: 2.3.0
>
>
> We currently use custom LZ4BlockInputStream to read concatenated byte stream 
> in shuffle 
> (https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/io/LZ4BlockInputStream.java#L38).
>  In the recent pr (https://github.com/lz4/lz4-java/pull/105), this 
> functionality is implemented even in lz4-java upstream. So, we might update 
> the lz4-java package that will be released in near future.
> Issue about the next lz4-java release
> https://github.com/lz4/lz4-java/issues/98
> Diff between the latest release and the master in lz4-java
> https://github.com/lz4/lz4-java/compare/62f7547abb0819d1ca1e669645ee1a9d26cd60b0...6480bd9e06f92471bf400c16d4d5f3fd2afa3b3d
>  * fixed NPE in XXHashFactory similarly
>  * Don't place resources in default package to support shading
>  * Fixes ByteBuffer methods failing to apply arrayOffset() for array-backed
>  * Try to load lz4-java from java.library.path, then fallback to bundled
>  * Add ppc64le binary
>  * Add s390x JNI binding
>  * Add basic LZ4 Frame v1.5.0 support
>  * enable aarch64 support for lz4-java
>  * Allow unsafeInstance() for ppc64le archiecture
>  * Add unsafeInstance support for AArch64
>  * Support 64-bit JNI build on Solaris
>  * Avoid over-allocating a buffer
>  * Allow EndMark to be incompressible for LZ4FrameInputStream.
>  * Concat byte stream



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21890) ObtainCredentials does not pass creds to addDelegationTokens

2017-08-31 Thread Sanket Reddy (JIRA)
Sanket Reddy created SPARK-21890:


 Summary: ObtainCredentials does not pass creds to 
addDelegationTokens
 Key: SPARK-21890
 URL: https://issues.apache.org/jira/browse/SPARK-21890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Sanket Reddy


I observed this while running a oozie job trying to connect to hbase via spark.
It look like the creds are not being passed in 
thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53
 for 2.2 release.

Stack trace:
Warning: Skip remote jar 
hdfs://axonitered-nn1.red.ygrid.yahoo.com:8020/user/schintap/spark_oozie/apps/lib/spark-starter-2.0-SNAPSHOT-jar-with-dependencies.jar.
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], 
main() threw exception, Delegation Token can be issued only with kerberos or 
web authentication
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5858)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:687)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1003)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:448)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:999)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:881)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523)

org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5858)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:687)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1003)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:448)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:999)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:881)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523)

at org.apache.hadoop.ipc.Client.call(Client.java:1471)
at org.apache.hadoop.ipc.Client.call(Client.java:1408)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy10.getDelegationToken(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:933)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy11.getDelegationToken(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1038)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1543)
at 
org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:531)
at 
org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:509)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.addDelegation

[jira] [Commented] (SPARK-21184) QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n

2017-08-31 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149510#comment-16149510
 ] 

Timothy Hunter commented on SPARK-21184:


[~a1ray] thank you for the report, someone should investigate about these given 
values.

You raise some valid questions about the choice of data structures and 
algorithm, which were discussed during the implementation and that can 
certainly be revisited:

- tree structures: the major constraint here is that this structure gets 
serialized often, due to how UDAFs work. This is why the current implementation 
is amortized over multiple records. Edo Liberty has published some recent work 
that is relevant in that area.

- algorithm: we looked at t-digest (and q-digest). The main concern back then 
was that there was no published worst-time guarantee given a target precision. 
This is still the case to my knowledge. Because of that, it is hard to 
understand what could happen in some unusual cases - which tend to be not so 
unusual in big data. That being said, it looks like it is a popular and 
well-maintained choice now, so I am certainly open to relaxing this constraint.

> QuantileSummaries implementation is wrong and QuantileSummariesSuite fails 
> with larger n
> 
>
> Key: SPARK-21184
> URL: https://issues.apache.org/jira/browse/SPARK-21184
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Andrew Ray
>
> 1. QuantileSummaries implementation does not match the paper it is supposed 
> to be based on.
> 1a. The compress method 
> (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240)
>  merges neighboring buckets, but thats not what the paper says to do. The 
> paper 
> (http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) 
> describes an implicit tree structure and the compress method deletes selected 
> subtrees.
> 1b. The paper does not discuss merging these summary data structures at all. 
> The following comment is in the merge method of QuantileSummaries:
> {quote}  // The GK algorithm is a bit unclear about it, but it seems 
> there is no need to adjust the
>   // statistics during the merging: the invariants are still respected 
> after the merge.{quote}
> Unless I'm missing something that needs substantiation, it's not clear that 
> that the invariants hold.
> 2. QuantileSummariesSuite fails with n = 1 (and other non trivial values)
> https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27
> One possible solution if these issues can't be resolved would be to move to 
> an algorithm that explicitly supports merging and is well tested like 
> https://github.com/tdunning/t-digest



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries

2017-08-31 Thread Anton Okolnychyi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-21652:
-
Comment: was deleted

(was: Is there anything I can help here? I see that some cost-based estimation 
is needed. If there is an example/guide of what should be done, I can try to 
fix the issue.)

> Optimizer cannot reach a fixed point on certain queries
> ---
>
> Key: SPARK-21652
> URL: https://issues.apache.org/jira/browse/SPARK-21652
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Anton Okolnychyi
>
> The optimizer cannot reach a fixed point on the following query:
> {code}
> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
> Seq(1, 2).toDF("col").write.saveAsTable("t2")
> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 
> = t2.col AND t1.col2 = t2.col").explain(true)
> {code}
> At some point during the optimization, InferFiltersFromConstraints infers a 
> new constraint '(col2#33 = col1#32)' that is appended to the join condition, 
> then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces 
> '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, 
> ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally 
> removes this predicate. However, InferFiltersFromConstraints will again infer 
> '(col2#33 = col1#32)' on the next iteration and the process will continue 
> until the limit of iterations is reached. 
> See below for more details
> {noformat}
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
> !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
> Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && 
> (col2#33 = col#34)))
>  :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
>  :  +- Relation[col1#32,col2#33] parquet  
> :  +- Relation[col1#32,col2#33] parquet
>  +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Relation[col#34] parquet   
>+- Relation[col#34] parquet
> 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
> !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = 
> col#34)))  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter (col2#33 = col1#32)
> !:  +- Relation[col1#32,col2#33] parquet  
> :  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
> : +- Relation[col1#32,col2#33] parquet
> !   +- Relation[col#34] parquet   
> +- Filter ((1 = col#34) && isnotnull(col#34))
> ! 
>+- Relation[col#34] parquet
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter (col2#33 = col1#32)
>:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))
> !:  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
> && (1 = col2#33)))   :  +- Relation[col1#32,col2#33] parquet
> !: +- Relation[col1#32,col2#33] parquet   
>+- Filter ((1 = col#34) && isnotnull(col#34))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
>   +- Relation[col#34] parquet
> !   +- Relation[col#34] parquet   
>
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation 
> ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>  Join Inner, ((col1#32 = col#34) && 
> (col2#33 = col#34))
> !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) &&

[jira] [Commented] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries

2017-08-31 Thread Anton Okolnychyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149462#comment-16149462
 ] 

Anton Okolnychyi commented on SPARK-21652:
--

Is there anything I can help here? I see that some cost-based estimation is 
needed. If there is an example/guide of what should be done, I can try to fix 
the issue.

> Optimizer cannot reach a fixed point on certain queries
> ---
>
> Key: SPARK-21652
> URL: https://issues.apache.org/jira/browse/SPARK-21652
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Anton Okolnychyi
>
> The optimizer cannot reach a fixed point on the following query:
> {code}
> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
> Seq(1, 2).toDF("col").write.saveAsTable("t2")
> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 
> = t2.col AND t1.col2 = t2.col").explain(true)
> {code}
> At some point during the optimization, InferFiltersFromConstraints infers a 
> new constraint '(col2#33 = col1#32)' that is appended to the join condition, 
> then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces 
> '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, 
> ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally 
> removes this predicate. However, InferFiltersFromConstraints will again infer 
> '(col2#33 = col1#32)' on the next iteration and the process will continue 
> until the limit of iterations is reached. 
> See below for more details
> {noformat}
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
> !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
> Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && 
> (col2#33 = col#34)))
>  :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
>  :  +- Relation[col1#32,col2#33] parquet  
> :  +- Relation[col1#32,col2#33] parquet
>  +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Relation[col#34] parquet   
>+- Relation[col#34] parquet
> 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
> !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = 
> col#34)))  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter (col2#33 = col1#32)
> !:  +- Relation[col1#32,col2#33] parquet  
> :  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
> : +- Relation[col1#32,col2#33] parquet
> !   +- Relation[col#34] parquet   
> +- Filter ((1 = col#34) && isnotnull(col#34))
> ! 
>+- Relation[col#34] parquet
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter (col2#33 = col1#32)
>:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))
> !:  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
> && (1 = col2#33)))   :  +- Relation[col1#32,col2#33] parquet
> !: +- Relation[col1#32,col2#33] parquet   
>+- Filter ((1 = col#34) && isnotnull(col#34))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
>   +- Relation[col#34] parquet
> !   +- Relation[col#34] parquet   
>
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation 
> ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>  Join Inner, ((col1#32 = col#34) && 
> (col2#33 = col#34))
> !:- Filter (((isnotnull(col1#32) 

[jira] [Commented] (SPARK-21842) Support Kerberos ticket renewal and creation in Mesos

2017-08-31 Thread Kalvin Chau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149433#comment-16149433
 ] 

Kalvin Chau commented on SPARK-21842:
-


I'm looking into implementing this feature.

Is the implementation still following the design doc? 
Start up a periodic renewal thread in the driver then, using RPC calls to send 
renewed tokens from the driver/scheduler to the executors ? 

Or does the yarn polling method sound like something we should go with to be 
consistent with yarn?

> Support Kerberos ticket renewal and creation in Mesos 
> --
>
> Key: SPARK-21842
> URL: https://issues.apache.org/jira/browse/SPARK-21842
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Arthur Rand
>
> We at Mesosphere have written Kerberos support for Spark on Mesos. The code 
> to use Kerberos on a Mesos cluster has been added to Apache Spark 
> (SPARK-16742). This ticket is to complete the implementation and allow for 
> ticket renewal and creation. Specifically for long running and streaming jobs.
> Mesosphere design doc (needs revision, wip): 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \

2017-08-31 Thread Anton Okolnychyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149401#comment-16149401
 ] 

Anton Okolnychyi commented on SPARK-21850:
--

[~instanceof me] yeah, you are absolutely correct, I need to double escape it. 
This means that a scala string "n" is "\\n" on the SQL layer and is 
"\n" as an actual column value, right? 

The LIKE pattern validation happens on actual values, which means that the last 
statement in the block below also fails with the same exception:
{code}
Seq((1, "\\nbc")).toDF("col1", "col2").write.saveAsTable("t1")
spark.sql("SELECT * FROM t1").show()
+++
|col1|col2|
+++
|   1|\nbc|
+++
spark.sql("SELECT * FROM t1 WHERE col2 = 'nbc'").show()
+++
|col1|col2|
+++
|   1|\nbc|
+++
spark.sql("SELECT * FROM t1 WHERE col2 LIKE 'nb_'").show() // fails
{code}

Since "\n" as a column value corresponds to "n" in a scala string and the 
exception occurs even with literals, the overall behavior looks logical. Do I 
miss anything? I am curious to understand so any explanation is welcome.

> SparkSQL cannot perform LIKE someColumn if someColumn's value contains a 
> backslash \
> 
>
> Key: SPARK-21850
> URL: https://issues.apache.org/jira/browse/SPARK-21850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adrien Lavoillotte
>
> I have a test table looking like this:
> {code:none}
> spark.sql("select * from `test`.`types_basic`").show()
> {code}
> ||id||c_tinyint|| [...] || c_string||
> |  0| -128| [...] |  string|
> |  1|0| [...] |string 'with' "qu...|
> |  2|  127| [...] |  unicod€ strĭng|
> |  3|   42| [...] |there is a \n in ...|
> |  4| null| [...] |null|
> Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n 
> string", not a line break). I would like to join another table using a LIKE 
> condition (to join on prefix). If I do this:
> {code:none}
> spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE 
> CONCAT(a.`c_string`, '%')").show()
> {code}
> I get the following error in spark 2.2 (but not in any earlier version):
> {noformat}
> 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 
> (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the 
> pattern 'there is a \n in this line%' is invalid, the escape character is not 
> allowed to precede 'n';
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It seems to me that if LIKE requires special escaping there, then it should 
> be provided by SparkSQL on the value of the column.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration

2017-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149397#comment-16149397
 ] 

Apache Spark commented on SPARK-21583:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/19098

> Create a ColumnarBatch with ArrowColumnVectors for row based iteration
> --
>
> Key: SPARK-21583
> URL: https://issues.apache.org/jira/browse/SPARK-21583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
>
> The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data.  
> It would be useful to be able to create a {{ColumnarBatch}} to allow row 
> based iteration over multiple {{ArrowColumnVectors}}.  This would avoid extra 
> copying to translate column elements into rows and be more efficient memory 
> usage while increasing performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-08-31 Thread Danil Kirsanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149393#comment-16149393
 ] 

Danil Kirsanov commented on SPARK-21866:


Hi Sean, echoing the previous comments: yes, this is a small project, basically 
just a schema and the functions for reading images. 
At the same time, figuring it out proved to be quite time consuming, so it's 
easier to agree on a common format that could be shared among different 
pipelines and libraries. 


> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by conventio

[jira] [Comment Edited] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-08-31 Thread Parth Gandhi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149350#comment-16149350
 ] 

Parth Gandhi edited comment on SPARK-21888 at 8/31/17 6:01 PM:
---

The spark job runs successfully only if hbase-site.xml is placed in 
SPARK_CONF_DIR. If I add the xml file to --jars then it gets added to the 
driver classpath which is required but hbase fails to get a valid Kerberos 
token as the xml file is not found in the system classpath on the gateway where 
I launch the application.


was (Author: pgandhi):
The spark job runs successfully only if hbase-site.xml is placed in 
SPARK_CONF_DIR. If I add the xml file to --jars then it gets added to the 
driver classmate which is required but hbase fails to get a valid Kerberos 
token as the xml file is not found in the system classpath on the gateway where 
I launch the application.

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files, jars etc. to Client classpath. An example for this is that 
> suppose you want to run an application that uses hbase. Then, unless and 
> until we do not copy the necessary config files required by hbase to Spark 
> Config folder, we cannot specify or set their exact locations in classpath on 
> Client end which we could do so earlier by setting the environment variable 
> "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20812) Add Mesos Secrets support to the spark dispatcher

2017-08-31 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20812.

   Resolution: Fixed
 Assignee: Arthur Rand  (was: Apache Spark)
Fix Version/s: 2.3.0

> Add Mesos Secrets support to the spark dispatcher
> -
>
> Key: SPARK-20812
> URL: https://issues.apache.org/jira/browse/SPARK-20812
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Michael Gummelt
>Assignee: Arthur Rand
> Fix For: 2.3.0
>
>
> Mesos 1.4 will support secrets.  In order to support sending keytabs through 
> the Spark Dispatcher, or any other secret, we need to integrate this with the 
> Spark Dispatcher.
> The integration should include support for both file-based and env-based 
> secrets.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-08-31 Thread Parth Gandhi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149350#comment-16149350
 ] 

Parth Gandhi commented on SPARK-21888:
--

The spark job runs successfully only if hbase-site.xml is placed in 
SPARK_CONF_DIR. If I add the xml file to --jars then it gets added to the 
driver classmate which is required but hbase fails to get a valid Kerberos 
token as the xml file is not found in the system classpath on the gateway where 
I launch the application.

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files, jars etc. to Client classpath. An example for this is that 
> suppose you want to run an application that uses hbase. Then, unless and 
> until we do not copy the necessary config files required by hbase to Spark 
> Config folder, we cannot specify or set their exact locations in classpath on 
> Client end which we could do so earlier by setting the environment variable 
> "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17107) Remove redundant pushdown rule for Union

2017-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149322#comment-16149322
 ] 

Apache Spark commented on SPARK-17107:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19097

> Remove redundant pushdown rule for Union
> 
>
> Key: SPARK-17107
> URL: https://issues.apache.org/jira/browse/SPARK-17107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.1.0
>
>
> The Optimizer rules PushThroughSetOperations and PushDownPredicate have a 
> redundant rule to push down Filter through Union. We should remove it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21889) Web site headers not rendered correctly in some pages

2017-08-31 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21889.

Resolution: Duplicate

> Web site headers not rendered correctly in some pages
> -
>
> Key: SPARK-21889
> URL: https://issues.apache.org/jira/browse/SPARK-21889
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Some headers on pages of the Spark docs are not rendering correctly. In 
> http://spark.apache.org/docs/latest/configuration.html, you can find a few of 
> these examples under the "Memory Management" section, where it shows things 
> like:
> {noformat}
> ### Execution Behavior
> {noformat}
> {noformat}
> ### Networking
> {noformat}
> And others.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21889) Web site headers not rendered correctly in some pages

2017-08-31 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149318#comment-16149318
 ] 

Marcelo Vanzin commented on SPARK-21889:


Ah, I only searched for open bugs, not closed ones.

> Web site headers not rendered correctly in some pages
> -
>
> Key: SPARK-21889
> URL: https://issues.apache.org/jira/browse/SPARK-21889
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Some headers on pages of the Spark docs are not rendering correctly. In 
> http://spark.apache.org/docs/latest/configuration.html, you can find a few of 
> these examples under the "Memory Management" section, where it shows things 
> like:
> {noformat}
> ### Execution Behavior
> {noformat}
> {noformat}
> ### Networking
> {noformat}
> And others.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-08-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149315#comment-16149315
 ] 

Sean Owen commented on SPARK-21888:
---

For your case specifically, shouldn't hbase-site.xml be available from the 
cluster environment, given its name?
What happens if you add the file to --jars anyway; I'd not be surprised if it 
just ends up on the classpath too.
Or, build it into your app jar?

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files, jars etc. to Client classpath. An example for this is that 
> suppose you want to run an application that uses hbase. Then, unless and 
> until we do not copy the necessary config files required by hbase to Spark 
> Config folder, we cannot specify or set their exact locations in classpath on 
> Client end which we could do so earlier by setting the environment variable 
> "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21889) Web site headers not rendered correctly in some pages

2017-08-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149308#comment-16149308
 ] 

Sean Owen commented on SPARK-21889:
---

Yeah I think this is the stuff fixed in 
https://issues.apache.org/jira/browse/SPARK-19106

> Web site headers not rendered correctly in some pages
> -
>
> Key: SPARK-21889
> URL: https://issues.apache.org/jira/browse/SPARK-21889
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Some headers on pages of the Spark docs are not rendering correctly. In 
> http://spark.apache.org/docs/latest/configuration.html, you can find a few of 
> these examples under the "Memory Management" section, where it shows things 
> like:
> {noformat}
> ### Execution Behavior
> {noformat}
> {noformat}
> ### Networking
> {noformat}
> And others.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21889) Web site headers not rendered correctly in some pages

2017-08-31 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-21889:
--

 Summary: Web site headers not rendered correctly in some pages
 Key: SPARK-21889
 URL: https://issues.apache.org/jira/browse/SPARK-21889
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.2.0
Reporter: Marcelo Vanzin
Priority: Minor


Some headers on pages of the Spark docs are not rendering correctly. In 
http://spark.apache.org/docs/latest/configuration.html, you can find a few of 
these examples under the "Memory Management" section, where it shows things 
like:

{noformat}
### Execution Behavior
{noformat}

{noformat}
### Networking
{noformat}

And others.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-08-31 Thread Parth Gandhi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149263#comment-16149263
 ] 

Parth Gandhi commented on SPARK-21888:
--

Sorry I forgot to mention that, --jars certainly adds jar files to client 
classpath, but not config files like hbase-site.xml.

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files, jars etc. to Client classpath. An example for this is that 
> suppose you want to run an application that uses hbase. Then, unless and 
> until we do not copy the necessary config files required by hbase to Spark 
> Config folder, we cannot specify or set their exact locations in classpath on 
> Client end which we could do so earlier by setting the environment variable 
> "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21886) Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator

2017-08-31 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21886.
-
   Resolution: Fixed
 Assignee: Jacek Laskowski
Fix Version/s: 2.3.0

> Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD 
> logical operator
> ---
>
> Key: SPARK-21886
> URL: https://issues.apache.org/jira/browse/SPARK-21886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Minor
> Fix For: 2.3.0
>
>
> While exploring where {{LogicalRDD}} is created I noticed that there are a 
> few places that beg for {{SparkSession.internalCreateDataFrame}}. The task is 
> to simply re-use the method wherever possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-08-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149217#comment-16149217
 ] 

Sean Owen commented on SPARK-21888:
---

You haven't said what you tried. --jars does this. 

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files, jars etc. to Client classpath. An example for this is that 
> suppose you want to run an application that uses hbase. Then, unless and 
> until we do not copy the necessary config files required by hbase to Spark 
> Config folder, we cannot specify or set their exact locations in classpath on 
> Client end which we could do so earlier by setting the environment variable 
> "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21878) Create SQLMetricsTestUtils

2017-08-31 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21878.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Create SQLMetricsTestUtils
> --
>
> Key: SPARK-21878
> URL: https://issues.apache.org/jira/browse/SPARK-21878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific 
> and the other SQLMetrics test cases. 
> Also, move two SQLMetrics test cases from sql/hive to sql/core. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-08-31 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149186#comment-16149186
 ] 

Timothy Hunter commented on SPARK-21866:


[~srowen] thank you for the comments. Indeed, this proposal is limited in scope 
on purpose, because it aims at achieving consensus around multiple libraries. 
For instance, the MMLSpark project from Microsoft uses this data format to 
interface with OpenCV (wrapped through JNI), and the Deep Learning Pipelines is 
going to rely on it as its primary mechanism to load and process images. Also, 
nothing precludes adding common transforms to this package later - it is easier 
to start small.

Regarding the spark package, yes, it will be discontinued like the CSV parser. 
The aim is to offer a working library that can be tried out without having to 
wait for an implementation to be merged into Spark itself.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = Struct

[jira] [Updated] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-08-31 Thread Parth Gandhi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Gandhi updated SPARK-21888:
-
Description: While running Spark on Yarn in cluster mode, currently there 
is no way to add any config files, jars etc. to Client classpath. An example 
for this is that suppose you want to run an application that uses hbase. Then, 
unless and until we do not copy the necessary config files required by hbase to 
Spark Config folder, we cannot specify or set their exact locations in 
classpath on Client end which we could do so earlier by setting the environment 
variable "SPARK_CLASSPATH".  (was: While running Spark on Yarn in cluster mode, 
currently there is no way to add any config files, jars etc. to Client 
classpath. An example for this is that suppose you want to run an application 
that uses hbase. Then, unless and until we do not copy the necessary config 
files required by hbase to Spark Config folder, we cannot specify or set their 
exact locations in classpath on Client end which we could do earlier by setting 
the environment variable "SPARK_CLASSPATH".)

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files, jars etc. to Client classpath. An example for this is that 
> suppose you want to run an application that uses hbase. Then, unless and 
> until we do not copy the necessary config files required by hbase to Spark 
> Config folder, we cannot specify or set their exact locations in classpath on 
> Client end which we could do so earlier by setting the environment variable 
> "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-08-31 Thread Parth Gandhi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Gandhi updated SPARK-21888:
-
Description: While running Spark on Yarn in cluster mode, currently there 
is no way to add any config files, jars etc. to Client classpath. An example 
for this is that suppose you want to run an application that uses hbase. Then, 
unless and until we do not copy the necessary config files required by hbase to 
Spark Config folder, we cannot specify or set their exact locations in 
classpath on Client end which we could do earlier by setting the environment 
variable "SPARK_CLASSPATH".  (was: While running Spark on Yarn in cluster mode, 
currently there is no way to add any config files, jars etc. to Client 
classpath. An example for this is that suppose you want to run an application 
that uses hbase. Then, unless and until we do not copy the config files to 
Spark Config folder, we cannot specify or set their exact locations in 
classpath on Client end which we could do earlier by setting the environment 
variable "SPARK_CLASSPATH".)

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files, jars etc. to Client classpath. An example for this is that 
> suppose you want to run an application that uses hbase. Then, unless and 
> until we do not copy the necessary config files required by hbase to Spark 
> Config folder, we cannot specify or set their exact locations in classpath on 
> Client end which we could do earlier by setting the environment variable 
> "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-08-31 Thread Parth Gandhi (JIRA)
Parth Gandhi created SPARK-21888:


 Summary: Cannot add stuff to Client Classpath for Yarn Cluster Mode
 Key: SPARK-21888
 URL: https://issues.apache.org/jira/browse/SPARK-21888
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Parth Gandhi
Priority: Minor


While running Spark on Yarn in cluster mode, currently there is no way to add 
any config files, jars etc. to Client classpath. An example for this is that 
suppose you want to run an application that uses hbase. Then, unless and until 
we do not copy the config files to Spark Config folder, we cannot specify or 
set their exact locations in classpath on Client end which we could do earlier 
by setting the environment variable "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21885) HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference enabled

2017-08-31 Thread liupengcheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-21885:
-
Summary: HiveMetastoreCatalog.InferIfNeeded too slow when 
caseSensitiveInference enabled  (was: HiveMetastoreCatalog#InferIfNeeded too 
slow when caseSensitiveInference enabled)

> HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference 
> enabled
> ---
>
> Key: SPARK-21885
> URL: https://issues.apache.org/jira/browse/SPARK-21885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
> Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to 
> INFER_ONLY
>Reporter: liupengcheng
>  Labels: slow, sql
>
> Currently, SparkSQL infer schema is too slow, almost take 2 minutes.
> I digged into the code, and finally findout the reason:
> 1. In the analysis process of  LogicalPlan spark will try to infer table 
> schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and 
> it will list all the leaf files of the rootPaths(just tableLocation), and 
> then call `getFileBlockLocations` to turn `FileStatus` into 
> `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files 
> will take a long time, and it seems that the locations info is never used.
> 2. When infer a parquet schema, if there is only one file, it will still 
> launch a spark job to merge schema. I think it's expensive.
> Time costly stack is as follow:
> {code:java}
> at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259)
> at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224)
> at 
> org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAware

[jira] [Updated] (SPARK-21885) HiveMetastoreCatalog#InferIfNeeded too slow when caseSensitiveInference enabled

2017-08-31 Thread liupengcheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-21885:
-
Summary: HiveMetastoreCatalog#InferIfNeeded too slow when 
caseSensitiveInference enabled  (was: SQL inferSchema too slow)

> HiveMetastoreCatalog#InferIfNeeded too slow when caseSensitiveInference 
> enabled
> ---
>
> Key: SPARK-21885
> URL: https://issues.apache.org/jira/browse/SPARK-21885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
> Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to 
> INFER_ONLY
>Reporter: liupengcheng
>  Labels: slow, sql
>
> Currently, SparkSQL infer schema is too slow, almost take 2 minutes.
> I digged into the code, and finally findout the reason:
> 1. In the analysis process of  LogicalPlan spark will try to infer table 
> schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and 
> it will list all the leaf files of the rootPaths(just tableLocation), and 
> then call `getFileBlockLocations` to turn `FileStatus` into 
> `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files 
> will take a long time, and it seems that the locations info is never used.
> 2. When infer a parquet schema, if there is only one file, it will still 
> launch a spark job to merge schema. I think it's expensive.
> Time costly stack is as follow:
> {code:java}
> at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259)
> at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224)
> at 
> org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFi

[jira] [Comment Edited] (SPARK-21802) Make sparkR MLP summary() expose probability column

2017-08-31 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136234#comment-16136234
 ] 

Weichen Xu edited comment on SPARK-21802 at 8/31/17 1:48 PM:
-

cc [~felixcheung]
I found this already supported automatically, so in which case you find this do 
not show up ?


was (Author: weichenxu123):
cc [~felixcheung]
Discuss:
Should we also add support for probability to other classification algos such 
as LogisticRegression/LinearSVC and so on..?

> Make sparkR MLP summary() expose probability column
> ---
>
> Key: SPARK-21802
> URL: https://issues.apache.org/jira/browse/SPARK-21802
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>
> Make sparkR MLP summary() expose probability column



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13258) --conf properties not honored in Mesos cluster mode

2017-08-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13258.
---
Resolution: Not A Problem

Great, thanks for the research and follow up.

> --conf properties not honored in Mesos cluster mode
> ---
>
> Key: SPARK-13258
> URL: https://issues.apache.org/jira/browse/SPARK-13258
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Michael Gummelt
>
> Spark properties set on {{spark-submit}} via the deprecated 
> {{SPARK_JAVA_OPTS}} are passed along to the driver, but those set via the 
> preferred {{--conf}} are not.
> For example, this results in the URI being fetched in the executor:
> {{SPARK_JAVA_OPTS="-Dspark.mesos.uris=https://raw.githubusercontent.com/mesosphere/spark/master/README.md
>  -Dspark.mesos.executor.docker.image=mesosphere/spark:1.6.0" 
> ./bin/spark-submit --deploy-mode cluster --master mesos://10.0.78.140:7077  
> --class org.apache.spark.examples.SparkPi 
> http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.5.0.jar}}
> This does not:
> {{SPARK_JAVA_OPTS="-Dspark.mesos.executor.docker.image=mesosphere/spark:1.6.0"
>  ./bin/spark-submit --deploy-mode cluster --master mesos://10.0.78.140:7077 
> --conf 
> spark.mesos.uris=https://raw.githubusercontent.com/mesosphere/spark/master/README.md
>  --class org.apache.spark.examples.SparkPi 
> http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.5.0.jar}}
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L369
> In the above line of code, you can see that SPARK_JAVA_OPTS is passed along 
> to the driver, so those properties take effect.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L373
> Whereas in this line of code, you see that {{--conf}} variables are set on 
> {{SPARK_EXECUTOR_OPTS}}, which AFAICT has absolutely no effect because this 
> env var is being set on the driver, not the executor.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13258) --conf properties not honored in Mesos cluster mode

2017-08-31 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148982#comment-16148982
 ] 

Stavros Kontopoulos commented on SPARK-13258:
-

SPARK_JAVA_OPTS has been removed from the code base 
(https://issues.apache.org/jira/browse/SPARK-14453). I verified that this is 
not a bug anymore. Ssh'd to a node in a dc/os cluster and run a job in cluster 
mode:
./bin/spark-submit --verbose --deploy-mode cluster --master 
mesos://spark.marathon.mesos:14003 --conf 
spark.mesos.uris=https://raw.githubusercontent.com/mesosphere/spark/master/README.md
 --conf spark.executor.memory=1g  --conf spark.executor.cores=2 --conf 
spark.cores.max=4 --conf spark.mesos.executor.home=/opt/spark/dist --conf 
spark.mesos.executor.docker.image=mesosphere/spark:1.1.1-2.2.0-hadoop-2.6 
--class org.apache.spark.examples.SparkPi 
https://s3-eu-west-1.amazonaws.com/fdp-stavros-test/spark-examples_2.11-2.1.1.jar
 1000

By looking at the driver sandbox the README is fetched:
I0831 13:24:54.727625 11257 fetcher.cpp:580] Fetched 
'https://raw.githubusercontent.com/mesosphere/spark/master/README.md' to 
'/var/lib/mesos/slave/slaves/53b59f05-430e-4837-aa14-658ca13a82d3-S0/frameworks/53b59f05-430e-4837-aa14-658ca13a82d3-0002/executors/driver-20170831132454-0007/runs/54527349-52de-49e2-b090-9587e4547b99/README.md'
I0831 13:24:54.976838 11268 exec.cpp:162] Version: 1.2.2
I0831 13:24:54.983006 11275 exec.cpp:237] Executor registered on agent 
53b59f05-430e-4837-aa14-658ca13a82d3-S0
17/08/31 13:24:56 INFO SparkContext: Running Spark version 2.2.0

Also looking at the executor's sandbox:
W0831 13:24:58.799739 11234 fetcher.cpp:322] Copying instead of extracting 
resource from URI with 'extract' flag, because it does not seem to be an 
archive: https://raw.githubusercontent.com/mesosphere/spark/master/README.md
I0831 13:24:58.799767 11234 fetcher.cpp:580] Fetched 
'https://raw.githubusercontent.com/mesosphere/spark/master/README.md' to 
'/var/lib/mesos/slave/slaves/53b59f05-430e-4837-aa14-658ca13a82d3-S2/frameworks/53b59f05-430e-4837-aa14-658ca13a82d3-0002-driver-20170831132454-0007/executors/3/runs/8aa9a13d-c6a6-4449-b37d-8b8ab32339d2/README.md'
I0831 13:24:59.033010 11244 exec.cpp:162] Version: 1.2.2
I0831 13:24:59.038556 11246 exec.cpp:237] Executor registered on agent 
53b59f05-430e-4837-aa14-658ca13a82d3-S2
17/08/31 13:25:00 INFO CoarseGrainedExecutorBackend: Started daemon with 
process name: 7...@ip-10-0-2-43.eu-west-1.compute.internal

[~susanxhuynh] [~arand] [~sowen] Not an issue anymore for versions >= 2.2.0 you 
just need to use the --conf option.



> --conf properties not honored in Mesos cluster mode
> ---
>
> Key: SPARK-13258
> URL: https://issues.apache.org/jira/browse/SPARK-13258
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Michael Gummelt
>
> Spark properties set on {{spark-submit}} via the deprecated 
> {{SPARK_JAVA_OPTS}} are passed along to the driver, but those set via the 
> preferred {{--conf}} are not.
> For example, this results in the URI being fetched in the executor:
> {{SPARK_JAVA_OPTS="-Dspark.mesos.uris=https://raw.githubusercontent.com/mesosphere/spark/master/README.md
>  -Dspark.mesos.executor.docker.image=mesosphere/spark:1.6.0" 
> ./bin/spark-submit --deploy-mode cluster --master mesos://10.0.78.140:7077  
> --class org.apache.spark.examples.SparkPi 
> http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.5.0.jar}}
> This does not:
> {{SPARK_JAVA_OPTS="-Dspark.mesos.executor.docker.image=mesosphere/spark:1.6.0"
>  ./bin/spark-submit --deploy-mode cluster --master mesos://10.0.78.140:7077 
> --conf 
> spark.mesos.uris=https://raw.githubusercontent.com/mesosphere/spark/master/README.md
>  --class org.apache.spark.examples.SparkPi 
> http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.5.0.jar}}
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L369
> In the above line of code, you can see that SPARK_JAVA_OPTS is passed along 
> to the driver, so those properties take effect.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L373
> Whereas in this line of code, you see that {{--conf}} variables are set on 
> {{SPARK_EXECUTOR_OPTS}}, which AFAICT has absolutely no effect because this 
> env var is being set on the driver, not the executor.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21882) OutputMetrics doesn't count written bytes correctly in the saveAsHadoopDataset function

2017-08-31 Thread linxiaojun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

linxiaojun updated SPARK-21882:
---
Description: The first job called from saveAsHadoopDataset, running in each 
executor, does not calculate the writtenBytes of OutputMetrics correctly 
(writtenBytes is 0). The reason is that we did not initialize the callback 
function called to find bytes written in the right way. As usual, 
statisticsTable which records statistics in a FileSystem must be initialized at 
the beginning (this will be triggered when open SparkHadoopWriter). The 
solution for this issue is to adjust the order of callback function 
initialization.   (was: The first job called from saveAsHadoopDataset, running 
in each executor, does not calculate the writtenBytes of OutputMetrics 
correctly. The reason is that we did not initialize the callback function 
called to find bytes written in the right way. As usual, statisticsTable which 
records statistics in a FileSystem must be initialized at the beginning (this 
will be triggered when open SparkHadoopWriter). The solution for this issue is 
to adjust the order of callback function initialization. )

> OutputMetrics doesn't count written bytes correctly in the 
> saveAsHadoopDataset function
> ---
>
> Key: SPARK-21882
> URL: https://issues.apache.org/jira/browse/SPARK-21882
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.2.0
>Reporter: linxiaojun
>Priority: Minor
> Attachments: SPARK-21882.patch
>
>
> The first job called from saveAsHadoopDataset, running in each executor, does 
> not calculate the writtenBytes of OutputMetrics correctly (writtenBytes is 
> 0). The reason is that we did not initialize the callback function called to 
> find bytes written in the right way. As usual, statisticsTable which records 
> statistics in a FileSystem must be initialized at the beginning (this will be 
> triggered when open SparkHadoopWriter). The solution for this issue is to 
> adjust the order of callback function initialization. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21887) DST on History server

2017-08-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21887.
---
Resolution: Invalid

> DST on History server
> -
>
> Key: SPARK-21887
> URL: https://issues.apache.org/jira/browse/SPARK-21887
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 1.6.1
>Reporter: Iraitz Montalban
>Priority: Minor
>
> I couldn't find any configuration or fix available so I just would like to 
> raise that even though log files store event timestamp correctly (UTC 
> timestamp) when Spark History Web UI is showing local time (GMT + 1 in our 
> case) is not taking into account Daylight Saving Time. Meaning that, during 
> summer, all our Spark applications show a 1 hour difference according to YARN.
> Anybody has experienced the same?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21887) DST on History server

2017-08-31 Thread Iraitz Montalban (JIRA)
Iraitz Montalban created SPARK-21887:


 Summary: DST on History server
 Key: SPARK-21887
 URL: https://issues.apache.org/jira/browse/SPARK-21887
 Project: Spark
  Issue Type: Question
  Components: Web UI
Affects Versions: 1.6.1
Reporter: Iraitz Montalban
Priority: Minor


I couldn't find any configuration or fix available so I just would like to 
raise that even though log files store event timestamp correctly (UTC 
timestamp) when Spark History Web UI is showing local time (GMT + 1 in our 
case) is not taking into account Daylight Saving Time. Meaning that, during 
summer, all our Spark applications show a 1 hour difference according to YARN.

Anybody has experienced the same?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2017-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148878#comment-16148878
 ] 

Apache Spark commented on SPARK-21869:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/19096

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2017-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21869:


Assignee: (was: Apache Spark)

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.

2017-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21869:


Assignee: Apache Spark

> A cached Kafka producer should not be closed if any task is using it.
> -
>
> Key: SPARK-21869
> URL: https://issues.apache.org/jira/browse/SPARK-21869
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Right now a cached Kafka producer may be closed if a large task uses it for 
> more than 10 minutes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21885) SQL inferSchema too slow

2017-08-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148818#comment-16148818
 ] 

Sean Owen commented on SPARK-21885:
---

Please fix the title

> SQL inferSchema too slow
> 
>
> Key: SPARK-21885
> URL: https://issues.apache.org/jira/browse/SPARK-21885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
> Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to 
> INFER_ONLY
>Reporter: liupengcheng
>  Labels: slow, sql
>
> Currently, SparkSQL infer schema is too slow, almost take 2 minutes.
> I digged into the code, and finally findout the reason:
> 1. In the analysis process of  LogicalPlan spark will try to infer table 
> schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and 
> it will list all the leaf files of the rootPaths(just tableLocation), and 
> then call `getFileBlockLocations` to turn `FileStatus` into 
> `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files 
> will take a long time, and it seems that the locations info is never used.
> 2. When infer a parquet schema, if there is only one file, it will still 
> launch a spark job to merge schema. I think it's expensive.
> Time costly stack is as follow:
> {code:java}
> at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259)
> at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224)
> at 
> org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:301)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$a

[jira] [Assigned] (SPARK-21886) Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator

2017-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21886:


Assignee: Apache Spark

> Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD 
> logical operator
> ---
>
> Key: SPARK-21886
> URL: https://issues.apache.org/jira/browse/SPARK-21886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
>
> While exploring where {{LogicalRDD}} is created I noticed that there are a 
> few places that beg for {{SparkSession.internalCreateDataFrame}}. The task is 
> to simply re-use the method wherever possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21886) Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator

2017-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148817#comment-16148817
 ] 

Apache Spark commented on SPARK-21886:
--

User 'jaceklaskowski' has created a pull request for this issue:
https://github.com/apache/spark/pull/19095

> Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD 
> logical operator
> ---
>
> Key: SPARK-21886
> URL: https://issues.apache.org/jira/browse/SPARK-21886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> While exploring where {{LogicalRDD}} is created I noticed that there are a 
> few places that beg for {{SparkSession.internalCreateDataFrame}}. The task is 
> to simply re-use the method wherever possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21886) Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator

2017-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21886:


Assignee: (was: Apache Spark)

> Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD 
> logical operator
> ---
>
> Key: SPARK-21886
> URL: https://issues.apache.org/jira/browse/SPARK-21886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> While exploring where {{LogicalRDD}} is created I noticed that there are a 
> few places that beg for {{SparkSession.internalCreateDataFrame}}. The task is 
> to simply re-use the method wherever possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21886) Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator

2017-08-31 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-21886:
---

 Summary: Use SparkSession.internalCreateDataFrame to create 
Dataset with LogicalRDD logical operator
 Key: SPARK-21886
 URL: https://issues.apache.org/jira/browse/SPARK-21886
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jacek Laskowski
Priority: Minor


While exploring where {{LogicalRDD}} is created I noticed that there are a few 
places that beg for {{SparkSession.internalCreateDataFrame}}. The task is to 
simply re-use the method wherever possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery

2017-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21884:


Assignee: Apache Spark

> Fix StackOverflowError on MetadataOnlyQuery
> ---
>
> Key: SPARK-21884
> URL: https://issues.apache.org/jira/browse/SPARK-21884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue aims to fix StackOverflowError in branch-2.2. In Apache master 
> branch, it doesn't throw StackOverflowError.
> {code}
> scala> spark.version
> res0: String = 2.2.0
> scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY 
> (p)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION 
> (p=$p)"))
> scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect
> java.lang.StackOverflowError
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery

2017-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21884:


Assignee: (was: Apache Spark)

> Fix StackOverflowError on MetadataOnlyQuery
> ---
>
> Key: SPARK-21884
> URL: https://issues.apache.org/jira/browse/SPARK-21884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>
> This issue aims to fix StackOverflowError in branch-2.2. In Apache master 
> branch, it doesn't throw StackOverflowError.
> {code}
> scala> spark.version
> res0: String = 2.2.0
> scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY 
> (p)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION 
> (p=$p)"))
> scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect
> java.lang.StackOverflowError
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery

2017-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148794#comment-16148794
 ] 

Apache Spark commented on SPARK-21884:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19094

> Fix StackOverflowError on MetadataOnlyQuery
> ---
>
> Key: SPARK-21884
> URL: https://issues.apache.org/jira/browse/SPARK-21884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>
> This issue aims to fix StackOverflowError in branch-2.2. In Apache master 
> branch, it doesn't throw StackOverflowError.
> {code}
> scala> spark.version
> res0: String = 2.2.0
> scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY 
> (p)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION 
> (p=$p)"))
> scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect
> java.lang.StackOverflowError
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21885) SQL inferSchema too slow

2017-08-31 Thread liupengcheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148793#comment-16148793
 ] 

liupengcheng commented on SPARK-21885:
--

[~smilegator] [~dongjoon] [~viirya]

> SQL inferSchema too slow
> 
>
> Key: SPARK-21885
> URL: https://issues.apache.org/jira/browse/SPARK-21885
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
> Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to 
> INFER_ONLY
>Reporter: liupengcheng
>  Labels: slow, sql
>
> Currently, SparkSQL infer schema is too slow, almost take 2 minutes.
> I digged into the code, and finally findout the reason:
> 1. In the analysis process of  LogicalPlan spark will try to infer table 
> schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and 
> it will list all the leaf files of the rootPaths(just tableLocation), and 
> then call `getFileBlockLocations` to turn `FileStatus` into 
> `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files 
> will take a long time, and it seems that the locations info is never used.
> 2. When infer a parquet schema, if there is only one file, it will still 
> launch a spark job to merge schema. I think it's expensive.
> Time costly stack is as follow:
> {code:java}
> at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259)
> at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224)
> at 
> org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302)
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:301)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collecti

[jira] [Created] (SPARK-21885) SQL inferSchema too slow

2017-08-31 Thread liupengcheng (JIRA)
liupengcheng created SPARK-21885:


 Summary: SQL inferSchema too slow
 Key: SPARK-21885
 URL: https://issues.apache.org/jira/browse/SPARK-21885
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0, 2.1.0, 2.3.0
 Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to 
INFER_ONLY

Reporter: liupengcheng


Currently, SparkSQL infer schema is too slow, almost take 2 minutes.
I digged into the code, and finally findout the reason:
1. In the analysis process of  LogicalPlan spark will try to infer table schema 
if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and it will 
list all the leaf files of the rootPaths(just tableLocation), and then call 
`getFileBlockLocations` to turn `FileStatus` into `LocatedFileStatus`. This 
`getFileBlockLocations` for so manny leaf files will take a long time, and it 
seems that the locations info is never used.
2. When infer a parquet schema, if there is only one file, it will still launch 
a spark job to merge schema. I think it's expensive.

Time costly stack is as follow:

{code:java}
at org.apache.hadoop.ipc.Client.call(Client.java:1403)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259)
at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224)
at 
org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427)
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410)
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302)
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:301)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 

[jira] [Updated] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery

2017-08-31 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21884:
--
Description: 
This issue aims to fix StackOverflowError in branch-2.2. In Apache master 
branch, it doesn't throw StackOverflowError.

{code}
scala> spark.version
res0: String = 2.2.0

scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY 
(p)")
res1: org.apache.spark.sql.DataFrame = []

scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION (p=$p)"))

scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect
java.lang.StackOverflowError
  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522)
{code}

  was:
{code}
scala> spark.version
res0: String = 2.2.0

scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY 
(p)")
res1: org.apache.spark.sql.DataFrame = []

scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION (p=$p)"))

scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect
java.lang.StackOverflowError
  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522)
{code}


> Fix StackOverflowError on MetadataOnlyQuery
> ---
>
> Key: SPARK-21884
> URL: https://issues.apache.org/jira/browse/SPARK-21884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>
> This issue aims to fix StackOverflowError in branch-2.2. In Apache master 
> branch, it doesn't throw StackOverflowError.
> {code}
> scala> spark.version
> res0: String = 2.2.0
> scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY 
> (p)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION 
> (p=$p)"))
> scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect
> java.lang.StackOverflowError
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery

2017-08-31 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-21884:
-

 Summary: Fix StackOverflowError on MetadataOnlyQuery
 Key: SPARK-21884
 URL: https://issues.apache.org/jira/browse/SPARK-21884
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Dongjoon Hyun


{code}
scala> spark.version
res0: String = 2.2.0

scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY 
(p)")
res1: org.apache.spark.sql.DataFrame = []

scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION (p=$p)"))

scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect
java.lang.StackOverflowError
  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \

2017-08-31 Thread Adrien Lavoillotte (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148746#comment-16148746
 ] 

Adrien Lavoillotte commented on SPARK-21850:


[~aokolnychyi] aren't you supposed to double the escape: once for the scala 
string, once for the SQL layer?


{code:none}
spark.sql("SELECT * FROM t1 WHERE col2 = 'n'").show()
+++
|col1|col2|
+++
|   1|  \n|
+++
{code}

> SparkSQL cannot perform LIKE someColumn if someColumn's value contains a 
> backslash \
> 
>
> Key: SPARK-21850
> URL: https://issues.apache.org/jira/browse/SPARK-21850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adrien Lavoillotte
>
> I have a test table looking like this:
> {code:none}
> spark.sql("select * from `test`.`types_basic`").show()
> {code}
> ||id||c_tinyint|| [...] || c_string||
> |  0| -128| [...] |  string|
> |  1|0| [...] |string 'with' "qu...|
> |  2|  127| [...] |  unicod€ strĭng|
> |  3|   42| [...] |there is a \n in ...|
> |  4| null| [...] |null|
> Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n 
> string", not a line break). I would like to join another table using a LIKE 
> condition (to join on prefix). If I do this:
> {code:none}
> spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE 
> CONCAT(a.`c_string`, '%')").show()
> {code}
> I get the following error in spark 2.2 (but not in any earlier version):
> {noformat}
> 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 
> (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the 
> pattern 'there is a \n in this line%' is invalid, the escape character is not 
> allowed to precede 'n';
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It seems to me that if LIKE requires special escaping there, then it should 
> be provided by SparkSQL on the value of the column.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box

2017-08-31 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148713#comment-16148713
 ] 

Saisai Shao commented on SPARK-11574:
-

Thanks a lot [~srowen]!

> Spark should support StatsD sink out of box
> ---
>
> Key: SPARK-11574
> URL: https://issues.apache.org/jira/browse/SPARK-11574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Xiaofeng Lin
>Assignee: Xiaofeng Lin
> Fix For: 2.3.0
>
>
> In order to run spark in production, monitoring is essential. StatsD is such 
> a common metric reporting mechanism that it should be supported out of the 
> box.  This will enable publishing metrics to monitoring services like 
> datadog, etc. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11574) Spark should support StatsD sink out of box

2017-08-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-11574:
-

Assignee: Xiaofeng Lin

> Spark should support StatsD sink out of box
> ---
>
> Key: SPARK-11574
> URL: https://issues.apache.org/jira/browse/SPARK-11574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Xiaofeng Lin
>Assignee: Xiaofeng Lin
> Fix For: 2.3.0
>
>
> In order to run spark in production, monitoring is essential. StatsD is such 
> a common metric reporting mechanism that it should be supported out of the 
> box.  This will enable publishing metrics to monitoring services like 
> datadog, etc. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box

2017-08-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148707#comment-16148707
 ] 

Sean Owen commented on SPARK-11574:
---

[~jerryshao] I'll make you a JIRA admin so you can add the user to the 
"Contributor" role. It won't show up otherwise.

> Spark should support StatsD sink out of box
> ---
>
> Key: SPARK-11574
> URL: https://issues.apache.org/jira/browse/SPARK-11574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Xiaofeng Lin
> Fix For: 2.3.0
>
>
> In order to run spark in production, monitoring is essential. StatsD is such 
> a common metric reporting mechanism that it should be supported out of the 
> box.  This will enable publishing metrics to monitoring services like 
> datadog, etc. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \

2017-08-31 Thread Anton Okolnychyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148705#comment-16148705
 ] 

Anton Okolnychyi commented on SPARK-21850:
--

Then we should not be bound to the LIKE case only. I am wondering what is 
expected in the code below?

{code}
Seq((1, "\\n")).toDF("col1", "col2").write.saveAsTable("t1")
spark.sql("SELECT * FROM t1").show()
+++
|col1|col2|
+++
|   1|  \n|
+++
spark.sql("SELECT * FROM t1 WHERE col2 = '\\n'").show() // does it give a wrong 
result?
+++
|col1|col2|
+++
+++
spark.sql("SELECT * FROM t1 WHERE col2 = '\n'").show()
+++
|col1|col2|
+++
+++
{code}

Just to mention, [PR#15398|https://github.com/apache/spark/pull/15398] was also 
merged in 2.1 (might not be publicly available yet). I got the same exception 
in the 2.1 branch using the following code:

{code}
Seq((1, "\\n")).toDF("col1", "col2").write.saveAsTable("t1")
spark.sql("SELECT * FROM t1").show()
spark.sql("SELECT * FROM t1 WHERE col2 LIKE CONCAT(col2, '%')").show()
{code}


> SparkSQL cannot perform LIKE someColumn if someColumn's value contains a 
> backslash \
> 
>
> Key: SPARK-21850
> URL: https://issues.apache.org/jira/browse/SPARK-21850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adrien Lavoillotte
>
> I have a test table looking like this:
> {code:none}
> spark.sql("select * from `test`.`types_basic`").show()
> {code}
> ||id||c_tinyint|| [...] || c_string||
> |  0| -128| [...] |  string|
> |  1|0| [...] |string 'with' "qu...|
> |  2|  127| [...] |  unicod€ strĭng|
> |  3|   42| [...] |there is a \n in ...|
> |  4| null| [...] |null|
> Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n 
> string", not a line break). I would like to join another table using a LIKE 
> condition (to join on prefix). If I do this:
> {code:none}
> spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE 
> CONCAT(a.`c_string`, '%')").show()
> {code}
> I get the following error in spark 2.2 (but not in any earlier version):
> {noformat}
> 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 
> (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the 
> pattern 'there is a \n in this line%' is invalid, the escape character is not 
> allowed to precede 'n';
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> It seems to me that if LIKE requires special escaping there, then it should 
> be provided by SparkSQL on the value of the column.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box

2017-08-31 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148701#comment-16148701
 ] 

Saisai Shao commented on SPARK-11574:
-

Ping [~srowen]. Hi Sean I cannot assign this JIRA to Xiaofeng, since I cannot 
find his name in the prompt list, would you please help to check it, if 
possible can you please assign the JIRA to him, I already merged it in Github, 
thanks a lot!

> Spark should support StatsD sink out of box
> ---
>
> Key: SPARK-11574
> URL: https://issues.apache.org/jira/browse/SPARK-11574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Xiaofeng Lin
> Fix For: 2.3.0
>
>
> In order to run spark in production, monitoring is essential. StatsD is such 
> a common metric reporting mechanism that it should be supported out of the 
> box.  This will enable publishing metrics to monitoring services like 
> datadog, etc. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18523) OOM killer may leave SparkContext in broken state causing Connection Refused errors

2017-08-31 Thread Alexander Shorin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148698#comment-16148698
 ] 

Alexander Shorin commented on SPARK-18523:
--

[~kadeng]
I don't have a 2.2.0 in production for now (shame on me!), but will check this. 
Thanks for report!

> OOM killer may leave SparkContext in broken state causing Connection Refused 
> errors
> ---
>
> Key: SPARK-18523
> URL: https://issues.apache.org/jira/browse/SPARK-18523
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Alexander Shorin
>Assignee: Alexander Shorin
> Fix For: 2.1.0
>
>
> When you run some memory-heavy spark job, Spark driver may consume more 
> memory resources than host available to provide.
> In this case OOM killer comes on scene and successfully kills a spark-submit 
> process.
> The pyspark.SparkContext is not able to handle such state of things and 
> becomes completely broken. 
> You cannot stop it as on stop it tries to call stop method of bounded java 
> context (jsc) and fails with Py4JError, because such process no longer exists 
> as like as the connection to it. 
> You cannot start new SparkContext because you have your broken one as active 
> one and pyspark still is not able to not have SparkContext as sort of 
> singleton.
> The only thing you can do is shutdown your IPython Notebook and start it 
> over. Or dive into SparkContext internal attributes and reset them manually 
> to initial None state.
> The OOM killer case is just one of the many: any reason of spark-submit crash 
> in the middle of something leaves SparkContext in a broken state.
> Example on error log on {{sc.stop()}} in broken state:
> {code}
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 
> 883, in send_command
> response = connection.send_command(command)
>   File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 
> 1040, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> Py4JNetworkError: Error while receiving
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
> server (127.0.0.1:59911)
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 
> 963, in start
> self.socket.connect((self.address, self.port))
>   File "/usr/local/lib/python2.7/socket.py", line 224, in meth
> return getattr(self._sock,name)(*args)
> error: [Errno 61] Connection refused
> ---
> Py4JError Traceback (most recent call last)
>  in ()
> > 1 sc.stop()
> /usr/local/share/spark/python/pyspark/context.py in stop(self)
> 360 """
> 361 if getattr(self, "_jsc", None):
> --> 362 self._jsc.stop()
> 363 self._jsc = None
> 364 if getattr(self, "_accumulatorServer", None):
> /usr/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in 
> __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134 
>1135 for temp_arg in temp_args:
> /usr/local/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  43 def deco(*a, **kw):
>  44 try:
> ---> 45 return f(*a, **kw)
>  46 except py4j.protocol.Py4JJavaError as e:
>  47 s = e.java_exception.toString()
> /usr/local/lib/python2.7/site-packages/py4j/protocol.pyc in 
> get_return_value(answer, gateway_client, target_id, name)
> 325 raise Py4JError(
> 326 "An error occurred while calling {0}{1}{2}".
> --> 327 format(target_id, ".", name))
> 328 else:
> 329 type = answer[1]
> Py4JError: An error occurred while calling o47.stop
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21883) when a stage failed,should remove it's child stages's pending state on spark UI,and mark jobs which this stage belongs to failed

2017-08-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21883.
---
Resolution: Invalid

I'm going to preemptively close this because it's poorly described, is reported 
versus a stale version and branch, and sounds like it might be a duplicate of 
other similar JIRAs.

> when a stage failed,should remove it's child stages's pending state on spark 
> UI,and mark  jobs which this stage belongs to failed
> -
>
> Key: SPARK-21883
> URL: https://issues.apache.org/jira/browse/SPARK-21883
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: xianlongZhang
>
> When a stage failed ,we should remove it's child stages on “Pending” tab  on 
> spark UI,and mark  jobs which this stage belongs to failed,then response 
> client,usually ,this is ok, but with fair-schedule to run spark thrift 
> server, multiple sql concurrent inquiries, sometimes ,when  task failed with 
> "FetchFailed" state,spark did not abort the failed stage and mark the job 
> failed,  the job will hang up



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21883) when a stage failed,should remove it's child stages's pending state on spark UI,and mark jobs which this stage belongs to failed

2017-08-31 Thread xianlongZhang (JIRA)
xianlongZhang created SPARK-21883:
-

 Summary: when a stage failed,should remove it's child stages's 
pending state on spark UI,and mark  jobs which this stage belongs to failed
 Key: SPARK-21883
 URL: https://issues.apache.org/jira/browse/SPARK-21883
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1
Reporter: xianlongZhang


When a stage failed ,we should remove it's child stages on “Pending” tab  on 
spark UI,and mark  jobs which this stage belongs to failed,then response 
client,usually ,this is ok, but with fair-schedule to run spark thrift server, 
multiple sql concurrent inquiries, sometimes ,when  task failed with 
"FetchFailed" state,spark did not abort the failed stage and mark the job 
failed,  the job will hang up



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-08-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148643#comment-16148643
 ] 

Sean Owen commented on SPARK-21866:
---

It makes some sense. I guess I'm mostly trying to match up the scope that a 
SPIP implies, with the relatively simple functionality here. Is this not just 
about a page of code to call ImageIO to parse a BufferedImage and to map its 
fields to a Row? That does look like the substance of 
https://github.com/Microsoft/spark-images/blob/master/src/main/scala/org/apache/spark/image/ImageSchema.scala
  Well, maybe this is just a really small SPIP.

Also why: "This Spark package will also be published in a binary form on 
spark-packages.org ." It'd be discontinued and included in Spark right, like 
with the CSV parser?


> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** 

[jira] [Resolved] (SPARK-21881) Again: OOM killer may leave SparkContext in broken state causing Connection Refused errors

2017-08-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21881.
---
Resolution: Duplicate

Let's not fork the discussion. Post on the original issue and someone can 
reopen if they believe it's valid. All the better if you have a pull request.

> Again: OOM killer may leave SparkContext in broken state causing Connection 
> Refused errors
> --
>
> Key: SPARK-21881
> URL: https://issues.apache.org/jira/browse/SPARK-21881
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Kai Londenberg
>Assignee: Alexander Shorin
>
> This is a duplicate of SPARK-18523, which was not really fixed for me 
> (PySpark 2.2.0, Python 3.5, py4j 0.10.4 )
> *Original Summary:*
> When you run some memory-heavy spark job, Spark driver may consume more 
> memory resources than host available to provide.
> In this case OOM killer comes on scene and successfully kills a spark-submit 
> process.
> The pyspark.SparkContext is not able to handle such state of things and 
> becomes completely broken. 
> You cannot stop it as on stop it tries to call stop method of bounded java 
> context (jsc) and fails with Py4JError, because such process no longer exists 
> as like as the connection to it. 
> You cannot start new SparkContext because you have your broken one as active 
> one and pyspark still is not able to not have SparkContext as sort of 
> singleton.
> The only thing you can do is shutdown your IPython Notebook and start it 
> over. Or dive into SparkContext internal attributes and reset them manually 
> to initial None state.
> The OOM killer case is just one of the many: any reason of spark-submit crash 
> in the middle of something leaves SparkContext in a broken state.
> *Latest Comment*
> In PySpark 2.2.0 this issue was not really fixed. While I could close the 
> SparkContext (with an Exception message, but it was closed afterwards), I 
> could not reopen any new spark contexts.
> *Current Workaround*
> If I resetted the global SparkContext variables like this, it worked :
> {code:none}
> def reset_spark():
> import pyspark
> from threading import RLock
> pyspark.SparkContext._jvm = None
> pyspark.SparkContext._gateway = None
> pyspark.SparkContext._next_accum_id = 0
> pyspark.SparkContext._active_spark_context = None
> pyspark.SparkContext._lock = RLock()
> pyspark.SparkContext._python_includes = None
> reset_spark()
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14327) Scheduler holds locks which cause huge scheulder delays and executor timeouts

2017-08-31 Thread Fei Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148622#comment-16148622
 ] 

Fei Chen commented on SPARK-14327:
--

I also came across this problem. In TaskScheduler, a thread which holds the 
lock is blocked by some unknown factors. Other threads must wait until the lock 
is released. As a result, the scheduler delay takes up tens seconds or even 
several minutes. How can i solve this problem?

> Scheduler holds locks which cause huge scheulder delays and executor timeouts
> -
>
> Key: SPARK-14327
> URL: https://issues.apache.org/jira/browse/SPARK-14327
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1
>Reporter: Chris Bannister
> Attachments: driver.jstack
>
>
> I have a job which after a while in one of its stages grinds to a halt, from 
> processing around 300k tasks in 15 minutes to less than 1000 in the next 
> hour. The driver ends up using 100% CPU on a single core (out of 4) and the 
> executors start failing to receive heartbeat responses, tasks are not 
> scheduled and results trickle in.
> For this stage the max scheduler delay is 15 minutes, and the 75% percentile 
> is 4ms.
> It appears that TaskScheulderImpl does most of its work whilst holding the 
> global synchronised lock for the class, this synchronised lock is shared 
> between at least,
> TaskSetManager.canFetchMoreResults
> TaskSchedulerImpl.handleSuccessfulTask
> TaskSchedulerImpl.executorHeartbeatReceived
> TaskSchedulerImpl.statusUpdate
> TaskSchedulerImpl.checkSpeculatableTasks
> This looks to severely limit the latency and throughput of the scheduler, and 
> casuses my job to straight up fail due to taking too long.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21882) OutputMetrics doesn't count written bytes correctly in the saveAsHadoopDataset function

2017-08-31 Thread linxiaojun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

linxiaojun updated SPARK-21882:
---
Attachment: SPARK-21882.patch

SPARK-21882.patch for Spark-1.6.1

> OutputMetrics doesn't count written bytes correctly in the 
> saveAsHadoopDataset function
> ---
>
> Key: SPARK-21882
> URL: https://issues.apache.org/jira/browse/SPARK-21882
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.2.0
>Reporter: linxiaojun
>Priority: Minor
> Attachments: SPARK-21882.patch
>
>
> The first job called from saveAsHadoopDataset, running in each executor, does 
> not calculate the writtenBytes of OutputMetrics correctly. The reason is that 
> we did not initialize the callback function called to find bytes written in 
> the right way. As usual, statisticsTable which records statistics in a 
> FileSystem must be initialized at the beginning (this will be triggered when 
> open SparkHadoopWriter). The solution for this issue is to adjust the order 
> of callback function initialization. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21882) OutputMetrics doesn't count written bytes correctly in the saveAsHadoopDataset function

2017-08-31 Thread linxiaojun (JIRA)
linxiaojun created SPARK-21882:
--

 Summary: OutputMetrics doesn't count written bytes correctly in 
the saveAsHadoopDataset function
 Key: SPARK-21882
 URL: https://issues.apache.org/jira/browse/SPARK-21882
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0, 1.6.1
Reporter: linxiaojun
Priority: Minor


The first job called from saveAsHadoopDataset, running in each executor, does 
not calculate the writtenBytes of OutputMetrics correctly. The reason is that 
we did not initialize the callback function called to find bytes written in the 
right way. As usual, statisticsTable which records statistics in a FileSystem 
must be initialized at the beginning (this will be triggered when open 
SparkHadoopWriter). The solution for this issue is to adjust the order of 
callback function initialization. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21881) Again: OOM killer may leave SparkContext in broken state causing Connection Refused errors

2017-08-31 Thread Kai Londenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Londenberg updated SPARK-21881:
---
Description: 
This is a duplicate of SPARK-18523, which was not really fixed for me (PySpark 
2.2.0, Python 3.5, py4j 0.10.4 )

*Original Summary:*

When you run some memory-heavy spark job, Spark driver may consume more memory 
resources than host available to provide.

In this case OOM killer comes on scene and successfully kills a spark-submit 
process.
The pyspark.SparkContext is not able to handle such state of things and becomes 
completely broken. 
You cannot stop it as on stop it tries to call stop method of bounded java 
context (jsc) and fails with Py4JError, because such process no longer exists 
as like as the connection to it. 
You cannot start new SparkContext because you have your broken one as active 
one and pyspark still is not able to not have SparkContext as sort of singleton.
The only thing you can do is shutdown your IPython Notebook and start it over. 
Or dive into SparkContext internal attributes and reset them manually to 
initial None state.

The OOM killer case is just one of the many: any reason of spark-submit crash 
in the middle of something leaves SparkContext in a broken state.

*Latest Comment*

In PySpark 2.2.0 this issue was not really fixed. While I could close the 
SparkContext (with an Exception message, but it was closed afterwards), I could 
not reopen any new spark contexts.

*Current Workaround*

If I resetted the global SparkContext variables like this, it worked :

{code:none}
def reset_spark():
import pyspark
from threading import RLock
pyspark.SparkContext._jvm = None
pyspark.SparkContext._gateway = None
pyspark.SparkContext._next_accum_id = 0
pyspark.SparkContext._active_spark_context = None
pyspark.SparkContext._lock = RLock()
pyspark.SparkContext._python_includes = None
reset_spark()
{code}






  was:

This is a duplicate of SPARK-18523, which was not really fixed for me (PySpark 
2.2.0, Python 3.5, py4j 0.10.4 )

*Original Summary:*

When you run some memory-heavy spark job, Spark driver may consume more memory 
resources than host available to provide.

In this case OOM killer comes on scene and successfully kills a spark-submit 
process.
The pyspark.SparkContext is not able to handle such state of things and becomes 
completely broken. 
You cannot stop it as on stop it tries to call stop method of bounded java 
context (jsc) and fails with Py4JError, because such process no longer exists 
as like as the connection to it. 
You cannot start new SparkContext because you have your broken one as active 
one and pyspark still is not able to not have SparkContext as sort of singleton.
The only thing you can do is shutdown your IPython Notebook and start it over. 
Or dive into SparkContext internal attributes and reset them manually to 
initial None state.

The OOM killer case is just one of the many: any reason of spark-submit crash 
in the middle of something leaves SparkContext in a broken state.

*Latest Comment (for PySpark 2.2.0) *

In PySpark 2.2.0 this issue was not really fixed. While I could close the 
SparkContext (with an Exception message, but it was closed afterwards), I could 
not reopen any new spark contexts.

*Current Workaround*

If I resetted the global SparkContext variables like this, it worked :

{code:none}
def reset_spark():
import pyspark
from threading import RLock
pyspark.SparkContext._jvm = None
pyspark.SparkContext._gateway = None
pyspark.SparkContext._next_accum_id = 0
pyspark.SparkContext._active_spark_context = None
pyspark.SparkContext._lock = RLock()
pyspark.SparkContext._python_includes = None
reset_spark()
{code}







> Again: OOM killer may leave SparkContext in broken state causing Connection 
> Refused errors
> --
>
> Key: SPARK-21881
> URL: https://issues.apache.org/jira/browse/SPARK-21881
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Kai Londenberg
>Assignee: Alexander Shorin
>
> This is a duplicate of SPARK-18523, which was not really fixed for me 
> (PySpark 2.2.0, Python 3.5, py4j 0.10.4 )
> *Original Summary:*
> When you run some memory-heavy spark job, Spark driver may consume more 
> memory resources than host available to provide.
> In this case OOM killer comes on scene and successfully kills a spark-submit 
> process.
> The pyspark.SparkContext is not able to handle such state of things and 
> becomes completely broken. 
> You cannot stop it as on stop it tries to call stop method of bounded java 
> context (jsc) and fails with Py4JError, because such process no longer exists 
> as like as the connection to it. 
> You canno

[jira] [Updated] (SPARK-21881) Again: OOM killer may leave SparkContext in broken state causing Connection Refused errors

2017-08-31 Thread Kai Londenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Londenberg updated SPARK-21881:
---
Affects Version/s: (was: 1.6.1)
   (was: 2.0.0)

> Again: OOM killer may leave SparkContext in broken state causing Connection 
> Refused errors
> --
>
> Key: SPARK-21881
> URL: https://issues.apache.org/jira/browse/SPARK-21881
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Kai Londenberg
>Assignee: Alexander Shorin
>
> This is a duplicate of SPARK-18523, which was not really fixed for me 
> (PySpark 2.2.0, Python 3.5, py4j 0.10.4 )
> *Original Summary:*
> When you run some memory-heavy spark job, Spark driver may consume more 
> memory resources than host available to provide.
> In this case OOM killer comes on scene and successfully kills a spark-submit 
> process.
> The pyspark.SparkContext is not able to handle such state of things and 
> becomes completely broken. 
> You cannot stop it as on stop it tries to call stop method of bounded java 
> context (jsc) and fails with Py4JError, because such process no longer exists 
> as like as the connection to it. 
> You cannot start new SparkContext because you have your broken one as active 
> one and pyspark still is not able to not have SparkContext as sort of 
> singleton.
> The only thing you can do is shutdown your IPython Notebook and start it 
> over. Or dive into SparkContext internal attributes and reset them manually 
> to initial None state.
> The OOM killer case is just one of the many: any reason of spark-submit crash 
> in the middle of something leaves SparkContext in a broken state.
> *Latest Comment*
> In PySpark 2.2.0 this issue was not really fixed. While I could close the 
> SparkContext (with an Exception message, but it was closed afterwards), I 
> could not reopen any new spark contexts.
> *Current Workaround*
> If I resetted the global SparkContext variables like this, it worked :
> {code:none}
> def reset_spark():
> import pyspark
> from threading import RLock
> pyspark.SparkContext._jvm = None
> pyspark.SparkContext._gateway = None
> pyspark.SparkContext._next_accum_id = 0
> pyspark.SparkContext._active_spark_context = None
> pyspark.SparkContext._lock = RLock()
> pyspark.SparkContext._python_includes = None
> reset_spark()
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21881) Again: OOM killer may leave SparkContext in broken state causing Connection Refused errors

2017-08-31 Thread Kai Londenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Londenberg updated SPARK-21881:
---
Description: 

This is a duplicate of SPARK-18523, which was not really fixed for me (PySpark 
2.2.0, Python 3.5, py4j 0.10.4 )

*Original Summary:*

When you run some memory-heavy spark job, Spark driver may consume more memory 
resources than host available to provide.

In this case OOM killer comes on scene and successfully kills a spark-submit 
process.
The pyspark.SparkContext is not able to handle such state of things and becomes 
completely broken. 
You cannot stop it as on stop it tries to call stop method of bounded java 
context (jsc) and fails with Py4JError, because such process no longer exists 
as like as the connection to it. 
You cannot start new SparkContext because you have your broken one as active 
one and pyspark still is not able to not have SparkContext as sort of singleton.
The only thing you can do is shutdown your IPython Notebook and start it over. 
Or dive into SparkContext internal attributes and reset them manually to 
initial None state.

The OOM killer case is just one of the many: any reason of spark-submit crash 
in the middle of something leaves SparkContext in a broken state.

*Latest Comment (for PySpark 2.2.0) *

In PySpark 2.2.0 this issue was not really fixed. While I could close the 
SparkContext (with an Exception message, but it was closed afterwards), I could 
not reopen any new spark contexts.

*Current Workaround*

If I resetted the global SparkContext variables like this, it worked :

{code:none}
def reset_spark():
import pyspark
from threading import RLock
pyspark.SparkContext._jvm = None
pyspark.SparkContext._gateway = None
pyspark.SparkContext._next_accum_id = 0
pyspark.SparkContext._active_spark_context = None
pyspark.SparkContext._lock = RLock()
pyspark.SparkContext._python_includes = None
reset_spark()
{code}






  was:
When you run some memory-heavy spark job, Spark driver may consume more memory 
resources than host available to provide.

In this case OOM killer comes on scene and successfully kills a spark-submit 
process.

The pyspark.SparkContext is not able to handle such state of things and becomes 
completely broken. 

You cannot stop it as on stop it tries to call stop method of bounded java 
context (jsc) and fails with Py4JError, because such process no longer exists 
as like as the connection to it. 

You cannot start new SparkContext because you have your broken one as active 
one and pyspark still is not able to not have SparkContext as sort of singleton.

The only thing you can do is shutdown your IPython Notebook and start it over. 
Or dive into SparkContext internal attributes and reset them manually to 
initial None state.

The OOM killer case is just one of the many: any reason of spark-submit crash 
in the middle of something leaves SparkContext in a broken state.

Example on error log on {{sc.stop()}} in broken state:
{code}
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 883, 
in send_command
response = connection.send_command(command)
  File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 
1040, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
server (127.0.0.1:59911)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 963, 
in start
self.socket.connect((self.address, self.port))
  File "/usr/local/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 61] Connection refused

---
Py4JError Traceback (most recent call last)
 in ()
> 1 sc.stop()

/usr/local/share/spark/python/pyspark/context.py in stop(self)
360 """
361 if getattr(self, "_jsc", None):
--> 362 self._jsc.stop()
363 self._jsc = None
364 if getattr(self, "_accumulatorServer", None):

/usr/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, 
*args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/usr/local/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 43 def deco(*a, **kw):
 44 try:
---> 45 return f(*a, **kw)
 46 except py4j.protocol.Py4JJavaError as e:
 47 s = e.java_exception.toString()

/usr/local/lib/pytho

  1   2   >