[jira] [Commented] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match
[ https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150091#comment-16150091 ] Liang-Chi Hsieh commented on SPARK-21658: - Thanks [~hyukjin.kwon] and [~jerryshao]. > Adds the default None for value in na.replace in PySpark to match > - > > Key: SPARK-21658 > URL: https://issues.apache.org/jira/browse/SPARK-21658 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Chin Han Yu >Priority: Minor > Labels: Starter > Fix For: 2.3.0 > > > Looks {{na.replace}} missed the default value {{None}}. > Both docs says they are aliases > http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace > http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace > but the default values looks different, which ends up with: > {code} > >>> df = spark.createDataFrame([('Alice', 10, 80.0)]) > >>> df.replace({"Alice": "a"}).first() > Row(_1=u'a', _2=10, _3=80.0) > >>> df.na.replace({"Alice": "a"}).first() > Traceback (most recent call last): > File "", line 1, in > TypeError: replace() takes at least 3 arguments (2 given) > {code} > To take the advantage of SPARK-19454, sounds we should match them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21859) SparkFiles.get failed on driver in yarn-cluster and yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-21859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150088#comment-16150088 ] Apache Spark commented on SPARK-21859: -- User 'lgrcyanny' has created a pull request for this issue: https://github.com/apache/spark/pull/19102 > SparkFiles.get failed on driver in yarn-cluster and yarn-client mode > > > Key: SPARK-21859 > URL: https://issues.apache.org/jira/browse/SPARK-21859 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.2.1 >Reporter: Cyanny > > when use SparkFiles.get a file on driver in yarn-client or yarn-cluster, it > will report file not found exception. > This exception only happens on driver, SparkFiles.get works fine on > executor. > > we can reproduce the bug as follows: > ```scala > def testOnDriver(fileName: String) = { > val file = new File(SparkFiles.get(fileName)) > if (!file.exists()) { > logging.info(s"$file not exist") > } else { > // print file content on driver > val content = Source.fromFile(file).getLines().mkString("\n") > logging.info(s"File content: ${content}") > } > } > // the output will be file not exist > ``` > > ```python > conf = SparkConf().setAppName("test files") > sc = SparkContext(appName="spark files test") > > def test_on_driver(filename): > file = SparkFiles.get(filename) > print("file path: {}".format(file)) > if os.path.exists(file): > with open(file) as f: > lines = f.readlines() > print(lines) > else: > print("file doesn't exist") > run_command("ls .") > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21892) status code is 200 OK when kill application fail via spark master rest api
Zhuang Xueyin created SPARK-21892: - Summary: status code is 200 OK when kill application fail via spark master rest api Key: SPARK-21892 URL: https://issues.apache.org/jira/browse/SPARK-21892 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0 Reporter: Zhuang Xueyin Priority: Minor Sent a post request to spark master restapi, eg: http://:6066/v1/submissions/kill/driver-xxx Request body: { "action" : "KillSubmissionRequest", "clientSparkVersion" : "2.1.0", } Response body: { "action" : "KillSubmissionResponse", "message" : "Driver driver-xxx has already finished or does not exist", "serverSparkVersion" : "2.1.0", "submissionId" : "driver-xxx", "success" : false } Response headers: *Status Code: 200 OK* Content-Length: 203 Content-Type: application/json; charset=UTF-8 Date: Fri, 01 Sep 2017 05:56:04 GMT Server: Jetty(9.2.z-SNAPSHOT) Result: status code is 200 OK when kill application fail via spark master rest api. While the response body indicates that the update is not successfully, this is not rest api standard, suggest to improve it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match
[ https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-21658: --- Assignee: Chin Han Yu > Adds the default None for value in na.replace in PySpark to match > - > > Key: SPARK-21658 > URL: https://issues.apache.org/jira/browse/SPARK-21658 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Chin Han Yu >Priority: Minor > Labels: Starter > Fix For: 2.3.0 > > > Looks {{na.replace}} missed the default value {{None}}. > Both docs says they are aliases > http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace > http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace > but the default values looks different, which ends up with: > {code} > >>> df = spark.createDataFrame([('Alice', 10, 80.0)]) > >>> df.replace({"Alice": "a"}).first() > Row(_1=u'a', _2=10, _3=80.0) > >>> df.na.replace({"Alice": "a"}).first() > Traceback (most recent call last): > File "", line 1, in > TypeError: replace() takes at least 3 arguments (2 given) > {code} > To take the advantage of SPARK-19454, sounds we should match them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match
[ https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150059#comment-16150059 ] Saisai Shao commented on SPARK-21658: - Done :). > Adds the default None for value in na.replace in PySpark to match > - > > Key: SPARK-21658 > URL: https://issues.apache.org/jira/browse/SPARK-21658 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Chin Han Yu >Priority: Minor > Labels: Starter > Fix For: 2.3.0 > > > Looks {{na.replace}} missed the default value {{None}}. > Both docs says they are aliases > http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace > http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace > but the default values looks different, which ends up with: > {code} > >>> df = spark.createDataFrame([('Alice', 10, 80.0)]) > >>> df.replace({"Alice": "a"}).first() > Row(_1=u'a', _2=10, _3=80.0) > >>> df.na.replace({"Alice": "a"}).first() > Traceback (most recent call last): > File "", line 1, in > TypeError: replace() takes at least 3 arguments (2 given) > {code} > To take the advantage of SPARK-19454, sounds we should match them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match
[ https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150022#comment-16150022 ] Hyukjin Kwon commented on SPARK-21658: -- [~jerryshao], I think you are now able to add a user to "Contributor" role if I understood correctly. This, [~byakuinss] user fixed this issue by https://github.com/apache/spark/pull/18895. Do you mind if I ask to assign this JIRA to her please? > Adds the default None for value in na.replace in PySpark to match > - > > Key: SPARK-21658 > URL: https://issues.apache.org/jira/browse/SPARK-21658 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: Starter > Fix For: 2.3.0 > > > Looks {{na.replace}} missed the default value {{None}}. > Both docs says they are aliases > http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace > http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace > but the default values looks different, which ends up with: > {code} > >>> df = spark.createDataFrame([('Alice', 10, 80.0)]) > >>> df.replace({"Alice": "a"}).first() > Row(_1=u'a', _2=10, _3=80.0) > >>> df.na.replace({"Alice": "a"}).first() > Traceback (most recent call last): > File "", line 1, in > TypeError: replace() takes at least 3 arguments (2 given) > {code} > To take the advantage of SPARK-19454, sounds we should match them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21789) Remove obsolete codes for parsing abstract schema strings
[ https://issues.apache.org/jira/browse/SPARK-21789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21789. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18647 [https://github.com/apache/spark/pull/18647] > Remove obsolete codes for parsing abstract schema strings > - > > Key: SPARK-21789 > URL: https://issues.apache.org/jira/browse/SPARK-21789 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 2.3.0 > > > It looks there are private functions that look not used in the main codes - > {{_split_schema_abstract}}, {{_parse_field_abstract}}, > {{_parse_schema_abstract}} and {{_infer_schema_type}}, located in > {{python/pyspark/sql/types.py}}. > This looks not used from the first place - > > https://github.com/apache/spark/commit/880eabec37c69ce4e9594d7babfac291b0f93f50. > Look we could remove this out. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21789) Remove obsolete codes for parsing abstract schema strings
[ https://issues.apache.org/jira/browse/SPARK-21789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-21789: Assignee: Hyukjin Kwon > Remove obsolete codes for parsing abstract schema strings > - > > Key: SPARK-21789 > URL: https://issues.apache.org/jira/browse/SPARK-21789 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.3.0 > > > It looks there are private functions that look not used in the main codes - > {{_split_schema_abstract}}, {{_parse_field_abstract}}, > {{_parse_schema_abstract}} and {{_infer_schema_type}}, located in > {{python/pyspark/sql/types.py}}. > This looks not used from the first place - > > https://github.com/apache/spark/commit/880eabec37c69ce4e9594d7babfac291b0f93f50. > Look we could remove this out. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21779) Simpler Dataset.sample API in Python
[ https://issues.apache.org/jira/browse/SPARK-21779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-21779: Assignee: Hyukjin Kwon > Simpler Dataset.sample API in Python > > > Key: SPARK-21779 > URL: https://issues.apache.org/jira/browse/SPARK-21779 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Hyukjin Kwon > Fix For: 2.3.0 > > > See parent ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21780) Simpler Dataset.sample API in R
[ https://issues.apache.org/jira/browse/SPARK-21780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150007#comment-16150007 ] Hyukjin Kwon commented on SPARK-21780: -- Let me work on this. > Simpler Dataset.sample API in R > --- > > Key: SPARK-21780 > URL: https://issues.apache.org/jira/browse/SPARK-21780 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > See parent ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21779) Simpler Dataset.sample API in Python
[ https://issues.apache.org/jira/browse/SPARK-21779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21779. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18999 [https://github.com/apache/spark/pull/18999] > Simpler Dataset.sample API in Python > > > Key: SPARK-21779 > URL: https://issues.apache.org/jira/browse/SPARK-21779 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > Fix For: 2.3.0 > > > See parent ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21885) HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference enabled
[ https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149997#comment-16149997 ] liupengcheng commented on SPARK-21885: -- [~viirya] I think it's necessary, consider this senario, you have a timer job, and your schema may varies with time, you need to read the history data with old schema, but you are not expected to use `INFER_AND_SAVE` to change the current schema. What's more, event if use `INFER_AND_SAVE`, it seems like that it will still infer schema. although there is some cache, but i think it's not enough, the first execution of any query for each session would be very slow. {code:java} private def inferIfNeeded( relation: MetastoreRelation, options: Map[String, String], fileFormat: FileFormat, fileIndexOpt: Option[FileIndex] = None): (StructType, CatalogTable) = { val inferenceMode = sparkSession.sessionState.conf.caseSensitiveInferenceMode val shouldInfer = (inferenceMode != {color:red}NEVER_INFER{color}) && !relation.catalogTable.schemaPreservesCase val tableName = relation.catalogTable.identifier.unquotedString if (shouldInfer) { logInfo(s"Inferring case-sensitive schema for table $tableName (inference mode: " + s"$inferenceMode)") val fileIndex = fileIndexOpt.getOrElse { val rootPath = new Path(relation.catalogTable.location) new InMemoryFileIndex(sparkSession, Seq(rootPath), options, None) } val inferredSchema = fileFormat .inferSchema( sparkSession, options, fileIndex.listFiles(Nil).flatMap(_.files)) .map(mergeWithMetastoreSchema(relation.catalogTable.schema, _)) inferredSchema match { case Some(schema) => if (inferenceMode == INFER_AND_SAVE) { updateCatalogSchema(relation.catalogTable.identifier, schema) } (schema, relation.catalogTable.copy(schema = schema)) case None => logWarning(s"Unable to infer schema for table $tableName from file format " + s"$fileFormat (inference mode: $inferenceMode). Using metastore schema.") (relation.catalogTable.schema, relation.catalogTable) } } else { (relation.catalogTable.schema, relation.catalogTable) } } {code} > HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference > enabled > --- > > Key: SPARK-21885 > URL: https://issues.apache.org/jira/browse/SPARK-21885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0 > Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to > INFER_ONLY >Reporter: liupengcheng > Labels: slow, sql > > Currently, SparkSQL infer schema is too slow, almost take 2 minutes. > I digged into the code, and finally findout the reason: > 1. In the analysis process of LogicalPlan spark will try to infer table > schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and > it will list all the leaf files of the rootPaths(just tableLocation), and > then call `getFileBlockLocations` to turn `FileStatus` into > `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files > will take a long time, and it seems that the locations info is never used. > 2. When infer a parquet schema, if there is only one file, it will still > launch a spark job to merge schema. I think it's expensive. > Time costly stack is as follow: > {code:java} > at org.apache.hadoop.ipc.Client.call(Client.java:1403) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224) > at > org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java
[jira] [Commented] (SPARK-21885) HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference enabled
[ https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149973#comment-16149973 ] Liang-Chi Hsieh commented on SPARK-21885: - I tend to agree that when we don't actually need the block locations, the time spent on obtaining them is a waste. However, {{INFER_ONLY}} should only be used in very limited scenarios when you can't use {{INFER_AND_SAVE}} that only infers the schema on the first time. So I'm not sure if we should consider this when listing leaf files. > HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference > enabled > --- > > Key: SPARK-21885 > URL: https://issues.apache.org/jira/browse/SPARK-21885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0 > Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to > INFER_ONLY >Reporter: liupengcheng > Labels: slow, sql > > Currently, SparkSQL infer schema is too slow, almost take 2 minutes. > I digged into the code, and finally findout the reason: > 1. In the analysis process of LogicalPlan spark will try to infer table > schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and > it will list all the leaf files of the rootPaths(just tableLocation), and > then call `getFileBlockLocations` to turn `FileStatus` into > `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files > will take a long time, and it seems that the locations info is never used. > 2. When infer a parquet schema, if there is only one file, it will still > launch a spark job to merge schema. I think it's expensive. > Time costly stack is as follow: > {code:java} > at org.apache.hadoop.ipc.Client.call(Client.java:1403) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224) > at > org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileInd
[jira] [Commented] (SPARK-21885) HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference enabled
[ https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149880#comment-16149880 ] liupengcheng commented on SPARK-21885: -- [~srowen] Fixed! thanks! anybody check this problem? > HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference > enabled > --- > > Key: SPARK-21885 > URL: https://issues.apache.org/jira/browse/SPARK-21885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0 > Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to > INFER_ONLY >Reporter: liupengcheng > Labels: slow, sql > > Currently, SparkSQL infer schema is too slow, almost take 2 minutes. > I digged into the code, and finally findout the reason: > 1. In the analysis process of LogicalPlan spark will try to infer table > schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and > it will list all the leaf files of the rootPaths(just tableLocation), and > then call `getFileBlockLocations` to turn `FileStatus` into > `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files > will take a long time, and it seems that the locations info is never used. > 2. When infer a parquet schema, if there is only one file, it will still > launch a spark job to merge schema. I think it's expensive. > Time costly stack is as follow: > {code:java} > at org.apache.hadoop.ipc.Client.call(Client.java:1403) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224) > at > org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:30
[jira] [Resolved] (SPARK-21852) Empty Parquet Files created as a result of spark jobs fail when read
[ https://issues.apache.org/jira/browse/SPARK-21852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21852. -- Resolution: Cannot Reproduce [~sdalmia_asf], I think I need some details about "certain spark jobs" to check this issue. Otherwise, looks I am unable to follow and check. I am resolving this assuming it is currently difficult to provide the information. Please let me know if you are actually aware of how to reproduce this that I could follow and try. I am resolving this for now and for the current status of this JIRA. > Empty Parquet Files created as a result of spark jobs fail when read > > > Key: SPARK-21852 > URL: https://issues.apache.org/jira/browse/SPARK-21852 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Shivam Dalmia > > I have faced an issue intermittently with certain spark jobs writing parquet > files which apparently succeed but the written .parquet directory in HDFS is > an empty directory (with no _SUCCESS and _metadata parts, even). > Surprisingly, no errors are thrown from spark dataframe writer. > However, when attempting to read this written file, spark throws the error: > {{Unable to infer schema for Parquet. It must be specified manually}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21477) Mark LocalTableScanExec's input data transient
[ https://issues.apache.org/jira/browse/SPARK-21477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149802#comment-16149802 ] Apache Spark commented on SPARK-21477: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/19101 > Mark LocalTableScanExec's input data transient > -- > > Key: SPARK-21477 > URL: https://issues.apache.org/jira/browse/SPARK-21477 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.3.0 > > > Mark the parameter rows and unsafeRow of LocalTableScanExec transient. It can > avoid serializing the unneeded objects. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery
[ https://issues.apache.org/jira/browse/SPARK-21884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149801#comment-16149801 ] Apache Spark commented on SPARK-21884: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/19101 > Fix StackOverflowError on MetadataOnlyQuery > --- > > Key: SPARK-21884 > URL: https://issues.apache.org/jira/browse/SPARK-21884 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun > > This issue aims to fix StackOverflowError in branch-2.2. In Apache master > branch, it doesn't throw StackOverflowError. > {code} > scala> spark.version > res0: String = 2.2.0 > scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY > (p)") > res1: org.apache.spark.sql.DataFrame = [] > scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION > (p=$p)")) > scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect > java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21891) Add TBLPROPERTIES to DDL statement: CREATE TABLE USING
[ https://issues.apache.org/jira/browse/SPARK-21891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21891: Assignee: Apache Spark (was: Xiao Li) > Add TBLPROPERTIES to DDL statement: CREATE TABLE USING > --- > > Key: SPARK-21891 > URL: https://issues.apache.org/jira/browse/SPARK-21891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21891) Add TBLPROPERTIES to DDL statement: CREATE TABLE USING
[ https://issues.apache.org/jira/browse/SPARK-21891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21891: Assignee: Xiao Li (was: Apache Spark) > Add TBLPROPERTIES to DDL statement: CREATE TABLE USING > --- > > Key: SPARK-21891 > URL: https://issues.apache.org/jira/browse/SPARK-21891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21891) Add TBLPROPERTIES to DDL statement: CREATE TABLE USING
[ https://issues.apache.org/jira/browse/SPARK-21891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149781#comment-16149781 ] Apache Spark commented on SPARK-21891: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/19100 > Add TBLPROPERTIES to DDL statement: CREATE TABLE USING > --- > > Key: SPARK-21891 > URL: https://issues.apache.org/jira/browse/SPARK-21891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21891) Add TBLPROPERTIES to DDL statement: CREATE TABLE USING
Xiao Li created SPARK-21891: --- Summary: Add TBLPROPERTIES to DDL statement: CREATE TABLE USING Key: SPARK-21891 URL: https://issues.apache.org/jira/browse/SPARK-21891 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Xiao Li -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21862) Add overflow check in PCA
[ https://issues.apache.org/jira/browse/SPARK-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-21862. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19078 [https://github.com/apache/spark/pull/19078] > Add overflow check in PCA > - > > Key: SPARK-21862 > URL: https://issues.apache.org/jira/browse/SPARK-21862 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > Fix For: 2.3.0 > > > We should add overflow check in PCA, otherwise it is possible to throw > `NegativeArraySizeException` when `k` and `numFeatures` are too large. > The overflow checking formula is here: > https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala#L87 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery
[ https://issues.apache.org/jira/browse/SPARK-21884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-21884. - Resolution: Duplicate This is fixed at SPARK-21477 in master branch. > Fix StackOverflowError on MetadataOnlyQuery > --- > > Key: SPARK-21884 > URL: https://issues.apache.org/jira/browse/SPARK-21884 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun > > This issue aims to fix StackOverflowError in branch-2.2. In Apache master > branch, it doesn't throw StackOverflowError. > {code} > scala> spark.version > res0: String = 2.2.0 > scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY > (p)") > res1: org.apache.spark.sql.DataFrame = [] > scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION > (p=$p)")) > scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect > java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21088) CrossValidator, TrainValidationSplit should preserve all models after fitting: Python
[ https://issues.apache.org/jira/browse/SPARK-21088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149743#comment-16149743 ] Daniel Imberman commented on SPARK-21088: - [~ajaysaini] Are you still working on this one? > CrossValidator, TrainValidationSplit should preserve all models after > fitting: Python > - > > Key: SPARK-21088 > URL: https://issues.apache.org/jira/browse/SPARK-21088 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > See parent JIRA -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20221) Port pyspark.mllib.linalg tests in pyspark/mllib/tests.py to pyspark.ml.linalg
[ https://issues.apache.org/jira/browse/SPARK-20221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149742#comment-16149742 ] Daniel Imberman commented on SPARK-20221: - I can do this one > Port pyspark.mllib.linalg tests in pyspark/mllib/tests.py to pyspark.ml.linalg > -- > > Key: SPARK-20221 > URL: https://issues.apache.org/jira/browse/SPARK-20221 > Project: Spark > Issue Type: Test > Components: ML, PySpark, Tests >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > There are several linear algebra unit tests in pyspark/mllib/tests.py which > should be ported over to pyspark/ml/tests.py. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21110) Structs should be usable in inequality filters
[ https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21110. - Resolution: Fixed Assignee: Andrew Ray Fix Version/s: 2.3.0 > Structs should be usable in inequality filters > -- > > Key: SPARK-21110 > URL: https://issues.apache.org/jira/browse/SPARK-21110 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Nicholas Chammas >Assignee: Andrew Ray >Priority: Minor > Fix For: 2.3.0 > > > It seems like a missing feature that you can't compare structs in a filter on > a DataFrame. > Here's a simple demonstration of a) where this would be useful and b) how > it's different from simply comparing each of the components of the structs. > {code} > import pyspark > from pyspark.sql.functions import col, struct, concat > spark = pyspark.sql.SparkSession.builder.getOrCreate() > df = spark.createDataFrame( > [ > ('Boston', 'Bob'), > ('Boston', 'Nick'), > ('San Francisco', 'Bob'), > ('San Francisco', 'Nick'), > ], > ['city', 'person'] > ) > pairs = ( > df.select( > struct('city', 'person').alias('p1') > ) > .crossJoin( > df.select( > struct('city', 'person').alias('p2') > ) > ) > ) > print("Everything") > pairs.show() > print("Comparing parts separately (doesn't give me what I want)") > (pairs > .where(col('p1.city') < col('p2.city')) > .where(col('p1.person') < col('p2.person')) > .show()) > print("Comparing parts together with concat (gives me what I want but is > hacky)") > (pairs > .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person')) > .show()) > print("Comparing parts together with struct (my desired solution but > currently yields an error)") > (pairs > .where(col('p1') < col('p2')) > .show()) > {code} > The last query yields the following error in Spark 2.1.1: > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to > data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint > or int or bigint or float or double or decimal or timestamp or date or string > or binary) type, not struct;; > 'Filter (p1#5 < p2#8) > +- Join Cross >:- Project [named_struct(city, city#0, person, person#1) AS p1#5] >: +- LogicalRDD [city#0, person#1] >+- Project [named_struct(city, city#0, person, person#1) AS p2#8] > +- LogicalRDD [city#0, person#1] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries
[ https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21652: Assignee: (was: Apache Spark) > Optimizer cannot reach a fixed point on certain queries > --- > > Key: SPARK-21652 > URL: https://issues.apache.org/jira/browse/SPARK-21652 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.2.0 >Reporter: Anton Okolnychyi > > The optimizer cannot reach a fixed point on the following query: > {code} > Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1") > Seq(1, 2).toDF("col").write.saveAsTable("t2") > spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 > = t2.col AND t1.col2 = t2.col").explain(true) > {code} > At some point during the optimization, InferFiltersFromConstraints infers a > new constraint '(col2#33 = col1#32)' that is appended to the join condition, > then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces > '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, > ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally > removes this predicate. However, InferFiltersFromConstraints will again infer > '(col2#33 = col1#32)' on the next iteration and the process will continue > until the limit of iterations is reached. > See below for more details > {noformat} > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints === > !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && > (col2#33 = col#34))) > :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > : +- Relation[col1#32,col2#33] parquet > : +- Relation[col1#32,col2#33] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet >+- Relation[col#34] parquet > > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin === > !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = > col#34))) Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter (col2#33 = col1#32) > !: +- Relation[col1#32,col2#33] parquet > : +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > : +- Relation[col1#32,col2#33] parquet > ! +- Relation[col#34] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > ! >+- Relation[col#34] parquet > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) >Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter (col2#33 = col1#32) >:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32)) > !: +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) > && (1 = col2#33))) : +- Relation[col1#32,col2#33] parquet > !: +- Relation[col1#32,col2#33] parquet >+- Filter ((1 = col#34) && isnotnull(col#34)) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet > ! +- Relation[col#34] parquet > > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation > === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col1#32 = col#34) && > (col2#33 = col#34)) > !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) && (col2#33 = col1#32)) :- Filter (((isnotnull(col1#32) && > isnotnull(col2#33)) && ((col1#32 = 1) && (1 = col2#33))) && (
[jira] [Assigned] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries
[ https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21652: Assignee: Apache Spark > Optimizer cannot reach a fixed point on certain queries > --- > > Key: SPARK-21652 > URL: https://issues.apache.org/jira/browse/SPARK-21652 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.2.0 >Reporter: Anton Okolnychyi >Assignee: Apache Spark > > The optimizer cannot reach a fixed point on the following query: > {code} > Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1") > Seq(1, 2).toDF("col").write.saveAsTable("t2") > spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 > = t2.col AND t1.col2 = t2.col").explain(true) > {code} > At some point during the optimization, InferFiltersFromConstraints infers a > new constraint '(col2#33 = col1#32)' that is appended to the join condition, > then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces > '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, > ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally > removes this predicate. However, InferFiltersFromConstraints will again infer > '(col2#33 = col1#32)' on the next iteration and the process will continue > until the limit of iterations is reached. > See below for more details > {noformat} > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints === > !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && > (col2#33 = col#34))) > :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > : +- Relation[col1#32,col2#33] parquet > : +- Relation[col1#32,col2#33] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet >+- Relation[col#34] parquet > > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin === > !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = > col#34))) Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter (col2#33 = col1#32) > !: +- Relation[col1#32,col2#33] parquet > : +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > : +- Relation[col1#32,col2#33] parquet > ! +- Relation[col#34] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > ! >+- Relation[col#34] parquet > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) >Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter (col2#33 = col1#32) >:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32)) > !: +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) > && (1 = col2#33))) : +- Relation[col1#32,col2#33] parquet > !: +- Relation[col1#32,col2#33] parquet >+- Filter ((1 = col#34) && isnotnull(col#34)) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet > ! +- Relation[col#34] parquet > > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation > === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col1#32 = col#34) && > (col2#33 = col#34)) > !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) && (col2#33 = col1#32)) :- Filter (((isnotnull(col1#32) && > isnotnull(col2#33)) && ((col1#32 = 1
[jira] [Commented] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries
[ https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149680#comment-16149680 ] Apache Spark commented on SPARK-21652: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/19099 > Optimizer cannot reach a fixed point on certain queries > --- > > Key: SPARK-21652 > URL: https://issues.apache.org/jira/browse/SPARK-21652 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.2.0 >Reporter: Anton Okolnychyi > > The optimizer cannot reach a fixed point on the following query: > {code} > Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1") > Seq(1, 2).toDF("col").write.saveAsTable("t2") > spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 > = t2.col AND t1.col2 = t2.col").explain(true) > {code} > At some point during the optimization, InferFiltersFromConstraints infers a > new constraint '(col2#33 = col1#32)' that is appended to the join condition, > then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces > '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, > ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally > removes this predicate. However, InferFiltersFromConstraints will again infer > '(col2#33 = col1#32)' on the next iteration and the process will continue > until the limit of iterations is reached. > See below for more details > {noformat} > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints === > !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && > (col2#33 = col#34))) > :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > : +- Relation[col1#32,col2#33] parquet > : +- Relation[col1#32,col2#33] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet >+- Relation[col#34] parquet > > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin === > !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = > col#34))) Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter (col2#33 = col1#32) > !: +- Relation[col1#32,col2#33] parquet > : +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > : +- Relation[col1#32,col2#33] parquet > ! +- Relation[col#34] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > ! >+- Relation[col#34] parquet > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) >Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter (col2#33 = col1#32) >:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32)) > !: +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) > && (1 = col2#33))) : +- Relation[col1#32,col2#33] parquet > !: +- Relation[col1#32,col2#33] parquet >+- Filter ((1 = col#34) && isnotnull(col#34)) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet > ! +- Relation[col#34] parquet > > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation > === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col1#32 = col#34) && > (col2#33 = col#34)) > !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) && (col
[jira] [Resolved] (SPARK-20676) Upload to PyPi
[ https://issues.apache.org/jira/browse/SPARK-20676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-20676. - Resolution: Fixed Fix Version/s: 2.2.0 > Upload to PyPi > -- > > Key: SPARK-20676 > URL: https://issues.apache.org/jira/browse/SPARK-20676 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.1.1, 2.2.0 >Reporter: holdenk >Assignee: holdenk > Fix For: 2.2.0 > > > Upload PySpark to PyPi. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20676) Upload to PyPi
[ https://issues.apache.org/jira/browse/SPARK-20676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149656#comment-16149656 ] holdenk commented on SPARK-20676: - Yes. > Upload to PyPi > -- > > Key: SPARK-20676 > URL: https://issues.apache.org/jira/browse/SPARK-20676 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.1.1, 2.2.0 >Reporter: holdenk >Assignee: holdenk > Fix For: 2.2.0 > > > Upload PySpark to PyPi. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-21866: --- Attachment: (was: SPIP - Image support for Apache Spark.pdf) > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels) and BGRA (4 channels). > If the image failed to load, the value is the empty string "". > * StructField("origin", StringType(), True), > ** Some information abou
[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Hunter updated SPARK-21866: --- Attachment: SPIP - Image support for Apache Spark V1.1.pdf Updated authors' list. > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels) and BGRA (4 channels). > If the image failed to load, the value is the empty string "". > * StructField("origin", StringType(), True), > ** Som
[jira] [Commented] (SPARK-21807) The getAliasedConstraints function in LogicalPlan will take a long time when number of expressions is greater than 100
[ https://issues.apache.org/jira/browse/SPARK-21807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149600#comment-16149600 ] Andrew Ash commented on SPARK-21807: For reference, here's a stacktrace I'm seeing on a cluster before this change that I think this PR will improve: {noformat} "spark-task-4" #714 prio=5 os_prio=0 tid=0x7fa368031000 nid=0x4d91 runnable [0x7fa24e592000] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.AttributeReference.equals(namedExpressions.scala:220) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.EqualNullSafe.equals(predicates.scala:505) at scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:151) at scala.collection.mutable.HashSet.addEntry(HashSet.scala:40) at scala.collection.mutable.FlatHashTable$class.growTable(FlatHashTable.scala:225) at scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:159) at scala.collection.mutable.HashSet.addEntry(HashSet.scala:40) at scala.collection.mutable.FlatHashTable$class.addElem(FlatHashTable.scala:139) at scala.collection.mutable.HashSet.addElem(HashSet.scala:40) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:59) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:40) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.AbstractSet.$plus$plus$eq(Set.scala:46) at scala.collection.mutable.HashSet.clone(HashSet.scala:83) at scala.collection.mutable.HashSet.clone(HashSet.scala:40) at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:65) at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:50) at scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141) at scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractTraversable.foldLeft(Traversable.scala:104) at scala.collection.TraversableOnce$class.$div$colon(TraversableOn
[jira] [Commented] (SPARK-21276) Update lz4-java to remove custom LZ4BlockInputStream
[ https://issues.apache.org/jira/browse/SPARK-21276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149568#comment-16149568 ] Apache Spark commented on SPARK-21276: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/18883 > Update lz4-java to remove custom LZ4BlockInputStream > - > > Key: SPARK-21276 > URL: https://issues.apache.org/jira/browse/SPARK-21276 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Trivial > Fix For: 2.3.0 > > > We currently use custom LZ4BlockInputStream to read concatenated byte stream > in shuffle > (https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/io/LZ4BlockInputStream.java#L38). > In the recent pr (https://github.com/lz4/lz4-java/pull/105), this > functionality is implemented even in lz4-java upstream. So, we might update > the lz4-java package that will be released in near future. > Issue about the next lz4-java release > https://github.com/lz4/lz4-java/issues/98 > Diff between the latest release and the master in lz4-java > https://github.com/lz4/lz4-java/compare/62f7547abb0819d1ca1e669645ee1a9d26cd60b0...6480bd9e06f92471bf400c16d4d5f3fd2afa3b3d > * fixed NPE in XXHashFactory similarly > * Don't place resources in default package to support shading > * Fixes ByteBuffer methods failing to apply arrayOffset() for array-backed > * Try to load lz4-java from java.library.path, then fallback to bundled > * Add ppc64le binary > * Add s390x JNI binding > * Add basic LZ4 Frame v1.5.0 support > * enable aarch64 support for lz4-java > * Allow unsafeInstance() for ppc64le archiecture > * Add unsafeInstance support for AArch64 > * Support 64-bit JNI build on Solaris > * Avoid over-allocating a buffer > * Allow EndMark to be incompressible for LZ4FrameInputStream. > * Concat byte stream -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21890) ObtainCredentials does not pass creds to addDelegationTokens
Sanket Reddy created SPARK-21890: Summary: ObtainCredentials does not pass creds to addDelegationTokens Key: SPARK-21890 URL: https://issues.apache.org/jira/browse/SPARK-21890 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Sanket Reddy I observed this while running a oozie job trying to connect to hbase via spark. It look like the creds are not being passed in thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53 for 2.2 release. Stack trace: Warning: Skip remote jar hdfs://axonitered-nn1.red.ygrid.yahoo.com:8020/user/schintap/spark_oozie/apps/lib/spark-starter-2.0-SNAPSHOT-jar-with-dependencies.jar. Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, Delegation Token can be issued only with kerberos or web authentication at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5858) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:687) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1003) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:448) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:999) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:881) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523) org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token can be issued only with kerberos or web authentication at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5858) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:687) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1003) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:448) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:999) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:881) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523) at org.apache.hadoop.ipc.Client.call(Client.java:1471) at org.apache.hadoop.ipc.Client.call(Client.java:1408) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy10.getDelegationToken(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:933) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy11.getDelegationToken(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1038) at org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1543) at org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:531) at org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:509) at org.apache.hadoop.hdfs.DistributedFileSystem.addDelegation
[jira] [Commented] (SPARK-21184) QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n
[ https://issues.apache.org/jira/browse/SPARK-21184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149510#comment-16149510 ] Timothy Hunter commented on SPARK-21184: [~a1ray] thank you for the report, someone should investigate about these given values. You raise some valid questions about the choice of data structures and algorithm, which were discussed during the implementation and that can certainly be revisited: - tree structures: the major constraint here is that this structure gets serialized often, due to how UDAFs work. This is why the current implementation is amortized over multiple records. Edo Liberty has published some recent work that is relevant in that area. - algorithm: we looked at t-digest (and q-digest). The main concern back then was that there was no published worst-time guarantee given a target precision. This is still the case to my knowledge. Because of that, it is hard to understand what could happen in some unusual cases - which tend to be not so unusual in big data. That being said, it looks like it is a popular and well-maintained choice now, so I am certainly open to relaxing this constraint. > QuantileSummaries implementation is wrong and QuantileSummariesSuite fails > with larger n > > > Key: SPARK-21184 > URL: https://issues.apache.org/jira/browse/SPARK-21184 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Andrew Ray > > 1. QuantileSummaries implementation does not match the paper it is supposed > to be based on. > 1a. The compress method > (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240) > merges neighboring buckets, but thats not what the paper says to do. The > paper > (http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) > describes an implicit tree structure and the compress method deletes selected > subtrees. > 1b. The paper does not discuss merging these summary data structures at all. > The following comment is in the merge method of QuantileSummaries: > {quote} // The GK algorithm is a bit unclear about it, but it seems > there is no need to adjust the > // statistics during the merging: the invariants are still respected > after the merge.{quote} > Unless I'm missing something that needs substantiation, it's not clear that > that the invariants hold. > 2. QuantileSummariesSuite fails with n = 1 (and other non trivial values) > https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27 > One possible solution if these issues can't be resolved would be to move to > an algorithm that explicitly supports merging and is well tested like > https://github.com/tdunning/t-digest -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries
[ https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-21652: - Comment: was deleted (was: Is there anything I can help here? I see that some cost-based estimation is needed. If there is an example/guide of what should be done, I can try to fix the issue.) > Optimizer cannot reach a fixed point on certain queries > --- > > Key: SPARK-21652 > URL: https://issues.apache.org/jira/browse/SPARK-21652 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.2.0 >Reporter: Anton Okolnychyi > > The optimizer cannot reach a fixed point on the following query: > {code} > Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1") > Seq(1, 2).toDF("col").write.saveAsTable("t2") > spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 > = t2.col AND t1.col2 = t2.col").explain(true) > {code} > At some point during the optimization, InferFiltersFromConstraints infers a > new constraint '(col2#33 = col1#32)' that is appended to the join condition, > then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces > '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, > ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally > removes this predicate. However, InferFiltersFromConstraints will again infer > '(col2#33 = col1#32)' on the next iteration and the process will continue > until the limit of iterations is reached. > See below for more details > {noformat} > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints === > !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && > (col2#33 = col#34))) > :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > : +- Relation[col1#32,col2#33] parquet > : +- Relation[col1#32,col2#33] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet >+- Relation[col#34] parquet > > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin === > !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = > col#34))) Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter (col2#33 = col1#32) > !: +- Relation[col1#32,col2#33] parquet > : +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > : +- Relation[col1#32,col2#33] parquet > ! +- Relation[col#34] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > ! >+- Relation[col#34] parquet > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) >Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter (col2#33 = col1#32) >:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32)) > !: +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) > && (1 = col2#33))) : +- Relation[col1#32,col2#33] parquet > !: +- Relation[col1#32,col2#33] parquet >+- Filter ((1 = col#34) && isnotnull(col#34)) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet > ! +- Relation[col#34] parquet > > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation > === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col1#32 = col#34) && > (col2#33 = col#34)) > !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) &&
[jira] [Commented] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries
[ https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149462#comment-16149462 ] Anton Okolnychyi commented on SPARK-21652: -- Is there anything I can help here? I see that some cost-based estimation is needed. If there is an example/guide of what should be done, I can try to fix the issue. > Optimizer cannot reach a fixed point on certain queries > --- > > Key: SPARK-21652 > URL: https://issues.apache.org/jira/browse/SPARK-21652 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.2.0 >Reporter: Anton Okolnychyi > > The optimizer cannot reach a fixed point on the following query: > {code} > Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1") > Seq(1, 2).toDF("col").write.saveAsTable("t2") > spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 > = t2.col AND t1.col2 = t2.col").explain(true) > {code} > At some point during the optimization, InferFiltersFromConstraints infers a > new constraint '(col2#33 = col1#32)' that is appended to the join condition, > then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces > '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, > ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally > removes this predicate. However, InferFiltersFromConstraints will again infer > '(col2#33 = col1#32)' on the next iteration and the process will continue > until the limit of iterations is reached. > See below for more details > {noformat} > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints === > !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && > (col2#33 = col#34))) > :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > : +- Relation[col1#32,col2#33] parquet > : +- Relation[col1#32,col2#33] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet >+- Relation[col#34] parquet > > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin === > !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = > col#34))) Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter (col2#33 = col1#32) > !: +- Relation[col1#32,col2#33] parquet > : +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > : +- Relation[col1#32,col2#33] parquet > ! +- Relation[col#34] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > ! >+- Relation[col#34] parquet > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) >Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter (col2#33 = col1#32) >:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32)) > !: +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) > && (1 = col2#33))) : +- Relation[col1#32,col2#33] parquet > !: +- Relation[col1#32,col2#33] parquet >+- Filter ((1 = col#34) && isnotnull(col#34)) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet > ! +- Relation[col#34] parquet > > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation > === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col1#32 = col#34) && > (col2#33 = col#34)) > !:- Filter (((isnotnull(col1#32)
[jira] [Commented] (SPARK-21842) Support Kerberos ticket renewal and creation in Mesos
[ https://issues.apache.org/jira/browse/SPARK-21842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149433#comment-16149433 ] Kalvin Chau commented on SPARK-21842: - I'm looking into implementing this feature. Is the implementation still following the design doc? Start up a periodic renewal thread in the driver then, using RPC calls to send renewed tokens from the driver/scheduler to the executors ? Or does the yarn polling method sound like something we should go with to be consistent with yarn? > Support Kerberos ticket renewal and creation in Mesos > -- > > Key: SPARK-21842 > URL: https://issues.apache.org/jira/browse/SPARK-21842 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Arthur Rand > > We at Mesosphere have written Kerberos support for Spark on Mesos. The code > to use Kerberos on a Mesos cluster has been added to Apache Spark > (SPARK-16742). This ticket is to complete the implementation and allow for > ticket renewal and creation. Specifically for long running and streaming jobs. > Mesosphere design doc (needs revision, wip): > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \
[ https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149401#comment-16149401 ] Anton Okolnychyi commented on SPARK-21850: -- [~instanceof me] yeah, you are absolutely correct, I need to double escape it. This means that a scala string "n" is "\\n" on the SQL layer and is "\n" as an actual column value, right? The LIKE pattern validation happens on actual values, which means that the last statement in the block below also fails with the same exception: {code} Seq((1, "\\nbc")).toDF("col1", "col2").write.saveAsTable("t1") spark.sql("SELECT * FROM t1").show() +++ |col1|col2| +++ | 1|\nbc| +++ spark.sql("SELECT * FROM t1 WHERE col2 = 'nbc'").show() +++ |col1|col2| +++ | 1|\nbc| +++ spark.sql("SELECT * FROM t1 WHERE col2 LIKE 'nb_'").show() // fails {code} Since "\n" as a column value corresponds to "n" in a scala string and the exception occurs even with literals, the overall behavior looks logical. Do I miss anything? I am curious to understand so any explanation is welcome. > SparkSQL cannot perform LIKE someColumn if someColumn's value contains a > backslash \ > > > Key: SPARK-21850 > URL: https://issues.apache.org/jira/browse/SPARK-21850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adrien Lavoillotte > > I have a test table looking like this: > {code:none} > spark.sql("select * from `test`.`types_basic`").show() > {code} > ||id||c_tinyint|| [...] || c_string|| > | 0| -128| [...] | string| > | 1|0| [...] |string 'with' "qu...| > | 2| 127| [...] | unicod€ strĭng| > | 3| 42| [...] |there is a \n in ...| > | 4| null| [...] |null| > Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n > string", not a line break). I would like to join another table using a LIKE > condition (to join on prefix). If I do this: > {code:none} > spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE > CONCAT(a.`c_string`, '%')").show() > {code} > I get the following error in spark 2.2 (but not in any earlier version): > {noformat} > 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 > (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the > pattern 'there is a \n in this line%' is invalid, the escape character is not > allowed to precede 'n'; > at > org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42) > at > org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51) > at > org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > It seems to me that if LIKE requires special escaping there, then it should > be provided by SparkSQL on the value of the column. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration
[ https://issues.apache.org/jira/browse/SPARK-21583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149397#comment-16149397 ] Apache Spark commented on SPARK-21583: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/19098 > Create a ColumnarBatch with ArrowColumnVectors for row based iteration > -- > > Key: SPARK-21583 > URL: https://issues.apache.org/jira/browse/SPARK-21583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler > Fix For: 2.3.0 > > > The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data. > It would be useful to be able to create a {{ColumnarBatch}} to allow row > based iteration over multiple {{ArrowColumnVectors}}. This would avoid extra > copying to translate column elements into rows and be more efficient memory > usage while increasing performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149393#comment-16149393 ] Danil Kirsanov commented on SPARK-21866: Hi Sean, echoing the previous comments: yes, this is a small project, basically just a schema and the functions for reading images. At the same time, figuring it out proved to be quite time consuming, so it's easier to agree on a common format that could be shared among different pipelines and libraries. > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by conventio
[jira] [Comment Edited] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149350#comment-16149350 ] Parth Gandhi edited comment on SPARK-21888 at 8/31/17 6:01 PM: --- The spark job runs successfully only if hbase-site.xml is placed in SPARK_CONF_DIR. If I add the xml file to --jars then it gets added to the driver classpath which is required but hbase fails to get a valid Kerberos token as the xml file is not found in the system classpath on the gateway where I launch the application. was (Author: pgandhi): The spark job runs successfully only if hbase-site.xml is placed in SPARK_CONF_DIR. If I add the xml file to --jars then it gets added to the driver classmate which is required but hbase fails to get a valid Kerberos token as the xml file is not found in the system classpath on the gateway where I launch the application. > Cannot add stuff to Client Classpath for Yarn Cluster Mode > -- > > Key: SPARK-21888 > URL: https://issues.apache.org/jira/browse/SPARK-21888 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Parth Gandhi >Priority: Minor > > While running Spark on Yarn in cluster mode, currently there is no way to add > any config files, jars etc. to Client classpath. An example for this is that > suppose you want to run an application that uses hbase. Then, unless and > until we do not copy the necessary config files required by hbase to Spark > Config folder, we cannot specify or set their exact locations in classpath on > Client end which we could do so earlier by setting the environment variable > "SPARK_CLASSPATH". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20812) Add Mesos Secrets support to the spark dispatcher
[ https://issues.apache.org/jira/browse/SPARK-20812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-20812. Resolution: Fixed Assignee: Arthur Rand (was: Apache Spark) Fix Version/s: 2.3.0 > Add Mesos Secrets support to the spark dispatcher > - > > Key: SPARK-20812 > URL: https://issues.apache.org/jira/browse/SPARK-20812 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Michael Gummelt >Assignee: Arthur Rand > Fix For: 2.3.0 > > > Mesos 1.4 will support secrets. In order to support sending keytabs through > the Spark Dispatcher, or any other secret, we need to integrate this with the > Spark Dispatcher. > The integration should include support for both file-based and env-based > secrets. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149350#comment-16149350 ] Parth Gandhi commented on SPARK-21888: -- The spark job runs successfully only if hbase-site.xml is placed in SPARK_CONF_DIR. If I add the xml file to --jars then it gets added to the driver classmate which is required but hbase fails to get a valid Kerberos token as the xml file is not found in the system classpath on the gateway where I launch the application. > Cannot add stuff to Client Classpath for Yarn Cluster Mode > -- > > Key: SPARK-21888 > URL: https://issues.apache.org/jira/browse/SPARK-21888 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Parth Gandhi >Priority: Minor > > While running Spark on Yarn in cluster mode, currently there is no way to add > any config files, jars etc. to Client classpath. An example for this is that > suppose you want to run an application that uses hbase. Then, unless and > until we do not copy the necessary config files required by hbase to Spark > Config folder, we cannot specify or set their exact locations in classpath on > Client end which we could do so earlier by setting the environment variable > "SPARK_CLASSPATH". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17107) Remove redundant pushdown rule for Union
[ https://issues.apache.org/jira/browse/SPARK-17107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149322#comment-16149322 ] Apache Spark commented on SPARK-17107: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/19097 > Remove redundant pushdown rule for Union > > > Key: SPARK-17107 > URL: https://issues.apache.org/jira/browse/SPARK-17107 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 2.1.0 > > > The Optimizer rules PushThroughSetOperations and PushDownPredicate have a > redundant rule to push down Filter through Union. We should remove it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21889) Web site headers not rendered correctly in some pages
[ https://issues.apache.org/jira/browse/SPARK-21889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-21889. Resolution: Duplicate > Web site headers not rendered correctly in some pages > - > > Key: SPARK-21889 > URL: https://issues.apache.org/jira/browse/SPARK-21889 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Some headers on pages of the Spark docs are not rendering correctly. In > http://spark.apache.org/docs/latest/configuration.html, you can find a few of > these examples under the "Memory Management" section, where it shows things > like: > {noformat} > ### Execution Behavior > {noformat} > {noformat} > ### Networking > {noformat} > And others. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21889) Web site headers not rendered correctly in some pages
[ https://issues.apache.org/jira/browse/SPARK-21889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149318#comment-16149318 ] Marcelo Vanzin commented on SPARK-21889: Ah, I only searched for open bugs, not closed ones. > Web site headers not rendered correctly in some pages > - > > Key: SPARK-21889 > URL: https://issues.apache.org/jira/browse/SPARK-21889 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Some headers on pages of the Spark docs are not rendering correctly. In > http://spark.apache.org/docs/latest/configuration.html, you can find a few of > these examples under the "Memory Management" section, where it shows things > like: > {noformat} > ### Execution Behavior > {noformat} > {noformat} > ### Networking > {noformat} > And others. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149315#comment-16149315 ] Sean Owen commented on SPARK-21888: --- For your case specifically, shouldn't hbase-site.xml be available from the cluster environment, given its name? What happens if you add the file to --jars anyway; I'd not be surprised if it just ends up on the classpath too. Or, build it into your app jar? > Cannot add stuff to Client Classpath for Yarn Cluster Mode > -- > > Key: SPARK-21888 > URL: https://issues.apache.org/jira/browse/SPARK-21888 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Parth Gandhi >Priority: Minor > > While running Spark on Yarn in cluster mode, currently there is no way to add > any config files, jars etc. to Client classpath. An example for this is that > suppose you want to run an application that uses hbase. Then, unless and > until we do not copy the necessary config files required by hbase to Spark > Config folder, we cannot specify or set their exact locations in classpath on > Client end which we could do so earlier by setting the environment variable > "SPARK_CLASSPATH". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21889) Web site headers not rendered correctly in some pages
[ https://issues.apache.org/jira/browse/SPARK-21889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149308#comment-16149308 ] Sean Owen commented on SPARK-21889: --- Yeah I think this is the stuff fixed in https://issues.apache.org/jira/browse/SPARK-19106 > Web site headers not rendered correctly in some pages > - > > Key: SPARK-21889 > URL: https://issues.apache.org/jira/browse/SPARK-21889 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Some headers on pages of the Spark docs are not rendering correctly. In > http://spark.apache.org/docs/latest/configuration.html, you can find a few of > these examples under the "Memory Management" section, where it shows things > like: > {noformat} > ### Execution Behavior > {noformat} > {noformat} > ### Networking > {noformat} > And others. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21889) Web site headers not rendered correctly in some pages
Marcelo Vanzin created SPARK-21889: -- Summary: Web site headers not rendered correctly in some pages Key: SPARK-21889 URL: https://issues.apache.org/jira/browse/SPARK-21889 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.2.0 Reporter: Marcelo Vanzin Priority: Minor Some headers on pages of the Spark docs are not rendering correctly. In http://spark.apache.org/docs/latest/configuration.html, you can find a few of these examples under the "Memory Management" section, where it shows things like: {noformat} ### Execution Behavior {noformat} {noformat} ### Networking {noformat} And others. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149263#comment-16149263 ] Parth Gandhi commented on SPARK-21888: -- Sorry I forgot to mention that, --jars certainly adds jar files to client classpath, but not config files like hbase-site.xml. > Cannot add stuff to Client Classpath for Yarn Cluster Mode > -- > > Key: SPARK-21888 > URL: https://issues.apache.org/jira/browse/SPARK-21888 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Parth Gandhi >Priority: Minor > > While running Spark on Yarn in cluster mode, currently there is no way to add > any config files, jars etc. to Client classpath. An example for this is that > suppose you want to run an application that uses hbase. Then, unless and > until we do not copy the necessary config files required by hbase to Spark > Config folder, we cannot specify or set their exact locations in classpath on > Client end which we could do so earlier by setting the environment variable > "SPARK_CLASSPATH". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21886) Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator
[ https://issues.apache.org/jira/browse/SPARK-21886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21886. - Resolution: Fixed Assignee: Jacek Laskowski Fix Version/s: 2.3.0 > Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD > logical operator > --- > > Key: SPARK-21886 > URL: https://issues.apache.org/jira/browse/SPARK-21886 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Assignee: Jacek Laskowski >Priority: Minor > Fix For: 2.3.0 > > > While exploring where {{LogicalRDD}} is created I noticed that there are a > few places that beg for {{SparkSession.internalCreateDataFrame}}. The task is > to simply re-use the method wherever possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149217#comment-16149217 ] Sean Owen commented on SPARK-21888: --- You haven't said what you tried. --jars does this. > Cannot add stuff to Client Classpath for Yarn Cluster Mode > -- > > Key: SPARK-21888 > URL: https://issues.apache.org/jira/browse/SPARK-21888 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Parth Gandhi >Priority: Minor > > While running Spark on Yarn in cluster mode, currently there is no way to add > any config files, jars etc. to Client classpath. An example for this is that > suppose you want to run an application that uses hbase. Then, unless and > until we do not copy the necessary config files required by hbase to Spark > Config folder, we cannot specify or set their exact locations in classpath on > Client end which we could do so earlier by setting the environment variable > "SPARK_CLASSPATH". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21878) Create SQLMetricsTestUtils
[ https://issues.apache.org/jira/browse/SPARK-21878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21878. - Resolution: Fixed Fix Version/s: 2.3.0 > Create SQLMetricsTestUtils > -- > > Key: SPARK-21878 > URL: https://issues.apache.org/jira/browse/SPARK-21878 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.3.0 > > > Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific > and the other SQLMetrics test cases. > Also, move two SQLMetrics test cases from sql/hive to sql/core. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149186#comment-16149186 ] Timothy Hunter commented on SPARK-21866: [~srowen] thank you for the comments. Indeed, this proposal is limited in scope on purpose, because it aims at achieving consensus around multiple libraries. For instance, the MMLSpark project from Microsoft uses this data format to interface with OpenCV (wrapped through JNI), and the Deep Learning Pipelines is going to rely on it as its primary mechanism to load and process images. Also, nothing precludes adding common transforms to this package later - it is easier to start small. Regarding the spark package, yes, it will be discontinued like the CSV parser. The aim is to offer a working library that can be tried out without having to wait for an implementation to be merged into Spark itself. > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = Struct
[jira] [Updated] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Parth Gandhi updated SPARK-21888: - Description: While running Spark on Yarn in cluster mode, currently there is no way to add any config files, jars etc. to Client classpath. An example for this is that suppose you want to run an application that uses hbase. Then, unless and until we do not copy the necessary config files required by hbase to Spark Config folder, we cannot specify or set their exact locations in classpath on Client end which we could do so earlier by setting the environment variable "SPARK_CLASSPATH". (was: While running Spark on Yarn in cluster mode, currently there is no way to add any config files, jars etc. to Client classpath. An example for this is that suppose you want to run an application that uses hbase. Then, unless and until we do not copy the necessary config files required by hbase to Spark Config folder, we cannot specify or set their exact locations in classpath on Client end which we could do earlier by setting the environment variable "SPARK_CLASSPATH".) > Cannot add stuff to Client Classpath for Yarn Cluster Mode > -- > > Key: SPARK-21888 > URL: https://issues.apache.org/jira/browse/SPARK-21888 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Parth Gandhi >Priority: Minor > > While running Spark on Yarn in cluster mode, currently there is no way to add > any config files, jars etc. to Client classpath. An example for this is that > suppose you want to run an application that uses hbase. Then, unless and > until we do not copy the necessary config files required by hbase to Spark > Config folder, we cannot specify or set their exact locations in classpath on > Client end which we could do so earlier by setting the environment variable > "SPARK_CLASSPATH". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Parth Gandhi updated SPARK-21888: - Description: While running Spark on Yarn in cluster mode, currently there is no way to add any config files, jars etc. to Client classpath. An example for this is that suppose you want to run an application that uses hbase. Then, unless and until we do not copy the necessary config files required by hbase to Spark Config folder, we cannot specify or set their exact locations in classpath on Client end which we could do earlier by setting the environment variable "SPARK_CLASSPATH". (was: While running Spark on Yarn in cluster mode, currently there is no way to add any config files, jars etc. to Client classpath. An example for this is that suppose you want to run an application that uses hbase. Then, unless and until we do not copy the config files to Spark Config folder, we cannot specify or set their exact locations in classpath on Client end which we could do earlier by setting the environment variable "SPARK_CLASSPATH".) > Cannot add stuff to Client Classpath for Yarn Cluster Mode > -- > > Key: SPARK-21888 > URL: https://issues.apache.org/jira/browse/SPARK-21888 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Parth Gandhi >Priority: Minor > > While running Spark on Yarn in cluster mode, currently there is no way to add > any config files, jars etc. to Client classpath. An example for this is that > suppose you want to run an application that uses hbase. Then, unless and > until we do not copy the necessary config files required by hbase to Spark > Config folder, we cannot specify or set their exact locations in classpath on > Client end which we could do earlier by setting the environment variable > "SPARK_CLASSPATH". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode
Parth Gandhi created SPARK-21888: Summary: Cannot add stuff to Client Classpath for Yarn Cluster Mode Key: SPARK-21888 URL: https://issues.apache.org/jira/browse/SPARK-21888 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Parth Gandhi Priority: Minor While running Spark on Yarn in cluster mode, currently there is no way to add any config files, jars etc. to Client classpath. An example for this is that suppose you want to run an application that uses hbase. Then, unless and until we do not copy the config files to Spark Config folder, we cannot specify or set their exact locations in classpath on Client end which we could do earlier by setting the environment variable "SPARK_CLASSPATH". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21885) HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference enabled
[ https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liupengcheng updated SPARK-21885: - Summary: HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference enabled (was: HiveMetastoreCatalog#InferIfNeeded too slow when caseSensitiveInference enabled) > HiveMetastoreCatalog.InferIfNeeded too slow when caseSensitiveInference > enabled > --- > > Key: SPARK-21885 > URL: https://issues.apache.org/jira/browse/SPARK-21885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0 > Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to > INFER_ONLY >Reporter: liupengcheng > Labels: slow, sql > > Currently, SparkSQL infer schema is too slow, almost take 2 minutes. > I digged into the code, and finally findout the reason: > 1. In the analysis process of LogicalPlan spark will try to infer table > schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and > it will list all the leaf files of the rootPaths(just tableLocation), and > then call `getFileBlockLocations` to turn `FileStatus` into > `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files > will take a long time, and it seems that the locations info is never used. > 2. When infer a parquet schema, if there is only one file, it will still > launch a spark job to merge schema. I think it's expensive. > Time costly stack is as follow: > {code:java} > at org.apache.hadoop.ipc.Client.call(Client.java:1403) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224) > at > org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAware
[jira] [Updated] (SPARK-21885) HiveMetastoreCatalog#InferIfNeeded too slow when caseSensitiveInference enabled
[ https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liupengcheng updated SPARK-21885: - Summary: HiveMetastoreCatalog#InferIfNeeded too slow when caseSensitiveInference enabled (was: SQL inferSchema too slow) > HiveMetastoreCatalog#InferIfNeeded too slow when caseSensitiveInference > enabled > --- > > Key: SPARK-21885 > URL: https://issues.apache.org/jira/browse/SPARK-21885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0 > Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to > INFER_ONLY >Reporter: liupengcheng > Labels: slow, sql > > Currently, SparkSQL infer schema is too slow, almost take 2 minutes. > I digged into the code, and finally findout the reason: > 1. In the analysis process of LogicalPlan spark will try to infer table > schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and > it will list all the leaf files of the rootPaths(just tableLocation), and > then call `getFileBlockLocations` to turn `FileStatus` into > `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files > will take a long time, and it seems that the locations info is never used. > 2. When infer a parquet schema, if there is only one file, it will still > launch a spark job to merge schema. I think it's expensive. > Time costly stack is as follow: > {code:java} > at org.apache.hadoop.ipc.Client.call(Client.java:1403) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224) > at > org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFi
[jira] [Comment Edited] (SPARK-21802) Make sparkR MLP summary() expose probability column
[ https://issues.apache.org/jira/browse/SPARK-21802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136234#comment-16136234 ] Weichen Xu edited comment on SPARK-21802 at 8/31/17 1:48 PM: - cc [~felixcheung] I found this already supported automatically, so in which case you find this do not show up ? was (Author: weichenxu123): cc [~felixcheung] Discuss: Should we also add support for probability to other classification algos such as LogisticRegression/LinearSVC and so on..? > Make sparkR MLP summary() expose probability column > --- > > Key: SPARK-21802 > URL: https://issues.apache.org/jira/browse/SPARK-21802 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Weichen Xu >Priority: Minor > > Make sparkR MLP summary() expose probability column -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13258) --conf properties not honored in Mesos cluster mode
[ https://issues.apache.org/jira/browse/SPARK-13258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13258. --- Resolution: Not A Problem Great, thanks for the research and follow up. > --conf properties not honored in Mesos cluster mode > --- > > Key: SPARK-13258 > URL: https://issues.apache.org/jira/browse/SPARK-13258 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Michael Gummelt > > Spark properties set on {{spark-submit}} via the deprecated > {{SPARK_JAVA_OPTS}} are passed along to the driver, but those set via the > preferred {{--conf}} are not. > For example, this results in the URI being fetched in the executor: > {{SPARK_JAVA_OPTS="-Dspark.mesos.uris=https://raw.githubusercontent.com/mesosphere/spark/master/README.md > -Dspark.mesos.executor.docker.image=mesosphere/spark:1.6.0" > ./bin/spark-submit --deploy-mode cluster --master mesos://10.0.78.140:7077 > --class org.apache.spark.examples.SparkPi > http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.5.0.jar}} > This does not: > {{SPARK_JAVA_OPTS="-Dspark.mesos.executor.docker.image=mesosphere/spark:1.6.0" > ./bin/spark-submit --deploy-mode cluster --master mesos://10.0.78.140:7077 > --conf > spark.mesos.uris=https://raw.githubusercontent.com/mesosphere/spark/master/README.md > --class org.apache.spark.examples.SparkPi > http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.5.0.jar}} > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L369 > In the above line of code, you can see that SPARK_JAVA_OPTS is passed along > to the driver, so those properties take effect. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L373 > Whereas in this line of code, you see that {{--conf}} variables are set on > {{SPARK_EXECUTOR_OPTS}}, which AFAICT has absolutely no effect because this > env var is being set on the driver, not the executor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13258) --conf properties not honored in Mesos cluster mode
[ https://issues.apache.org/jira/browse/SPARK-13258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148982#comment-16148982 ] Stavros Kontopoulos commented on SPARK-13258: - SPARK_JAVA_OPTS has been removed from the code base (https://issues.apache.org/jira/browse/SPARK-14453). I verified that this is not a bug anymore. Ssh'd to a node in a dc/os cluster and run a job in cluster mode: ./bin/spark-submit --verbose --deploy-mode cluster --master mesos://spark.marathon.mesos:14003 --conf spark.mesos.uris=https://raw.githubusercontent.com/mesosphere/spark/master/README.md --conf spark.executor.memory=1g --conf spark.executor.cores=2 --conf spark.cores.max=4 --conf spark.mesos.executor.home=/opt/spark/dist --conf spark.mesos.executor.docker.image=mesosphere/spark:1.1.1-2.2.0-hadoop-2.6 --class org.apache.spark.examples.SparkPi https://s3-eu-west-1.amazonaws.com/fdp-stavros-test/spark-examples_2.11-2.1.1.jar 1000 By looking at the driver sandbox the README is fetched: I0831 13:24:54.727625 11257 fetcher.cpp:580] Fetched 'https://raw.githubusercontent.com/mesosphere/spark/master/README.md' to '/var/lib/mesos/slave/slaves/53b59f05-430e-4837-aa14-658ca13a82d3-S0/frameworks/53b59f05-430e-4837-aa14-658ca13a82d3-0002/executors/driver-20170831132454-0007/runs/54527349-52de-49e2-b090-9587e4547b99/README.md' I0831 13:24:54.976838 11268 exec.cpp:162] Version: 1.2.2 I0831 13:24:54.983006 11275 exec.cpp:237] Executor registered on agent 53b59f05-430e-4837-aa14-658ca13a82d3-S0 17/08/31 13:24:56 INFO SparkContext: Running Spark version 2.2.0 Also looking at the executor's sandbox: W0831 13:24:58.799739 11234 fetcher.cpp:322] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: https://raw.githubusercontent.com/mesosphere/spark/master/README.md I0831 13:24:58.799767 11234 fetcher.cpp:580] Fetched 'https://raw.githubusercontent.com/mesosphere/spark/master/README.md' to '/var/lib/mesos/slave/slaves/53b59f05-430e-4837-aa14-658ca13a82d3-S2/frameworks/53b59f05-430e-4837-aa14-658ca13a82d3-0002-driver-20170831132454-0007/executors/3/runs/8aa9a13d-c6a6-4449-b37d-8b8ab32339d2/README.md' I0831 13:24:59.033010 11244 exec.cpp:162] Version: 1.2.2 I0831 13:24:59.038556 11246 exec.cpp:237] Executor registered on agent 53b59f05-430e-4837-aa14-658ca13a82d3-S2 17/08/31 13:25:00 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 7...@ip-10-0-2-43.eu-west-1.compute.internal [~susanxhuynh] [~arand] [~sowen] Not an issue anymore for versions >= 2.2.0 you just need to use the --conf option. > --conf properties not honored in Mesos cluster mode > --- > > Key: SPARK-13258 > URL: https://issues.apache.org/jira/browse/SPARK-13258 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Michael Gummelt > > Spark properties set on {{spark-submit}} via the deprecated > {{SPARK_JAVA_OPTS}} are passed along to the driver, but those set via the > preferred {{--conf}} are not. > For example, this results in the URI being fetched in the executor: > {{SPARK_JAVA_OPTS="-Dspark.mesos.uris=https://raw.githubusercontent.com/mesosphere/spark/master/README.md > -Dspark.mesos.executor.docker.image=mesosphere/spark:1.6.0" > ./bin/spark-submit --deploy-mode cluster --master mesos://10.0.78.140:7077 > --class org.apache.spark.examples.SparkPi > http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.5.0.jar}} > This does not: > {{SPARK_JAVA_OPTS="-Dspark.mesos.executor.docker.image=mesosphere/spark:1.6.0" > ./bin/spark-submit --deploy-mode cluster --master mesos://10.0.78.140:7077 > --conf > spark.mesos.uris=https://raw.githubusercontent.com/mesosphere/spark/master/README.md > --class org.apache.spark.examples.SparkPi > http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.5.0.jar}} > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L369 > In the above line of code, you can see that SPARK_JAVA_OPTS is passed along > to the driver, so those properties take effect. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L373 > Whereas in this line of code, you see that {{--conf}} variables are set on > {{SPARK_EXECUTOR_OPTS}}, which AFAICT has absolutely no effect because this > env var is being set on the driver, not the executor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21882) OutputMetrics doesn't count written bytes correctly in the saveAsHadoopDataset function
[ https://issues.apache.org/jira/browse/SPARK-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] linxiaojun updated SPARK-21882: --- Description: The first job called from saveAsHadoopDataset, running in each executor, does not calculate the writtenBytes of OutputMetrics correctly (writtenBytes is 0). The reason is that we did not initialize the callback function called to find bytes written in the right way. As usual, statisticsTable which records statistics in a FileSystem must be initialized at the beginning (this will be triggered when open SparkHadoopWriter). The solution for this issue is to adjust the order of callback function initialization. (was: The first job called from saveAsHadoopDataset, running in each executor, does not calculate the writtenBytes of OutputMetrics correctly. The reason is that we did not initialize the callback function called to find bytes written in the right way. As usual, statisticsTable which records statistics in a FileSystem must be initialized at the beginning (this will be triggered when open SparkHadoopWriter). The solution for this issue is to adjust the order of callback function initialization. ) > OutputMetrics doesn't count written bytes correctly in the > saveAsHadoopDataset function > --- > > Key: SPARK-21882 > URL: https://issues.apache.org/jira/browse/SPARK-21882 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.2.0 >Reporter: linxiaojun >Priority: Minor > Attachments: SPARK-21882.patch > > > The first job called from saveAsHadoopDataset, running in each executor, does > not calculate the writtenBytes of OutputMetrics correctly (writtenBytes is > 0). The reason is that we did not initialize the callback function called to > find bytes written in the right way. As usual, statisticsTable which records > statistics in a FileSystem must be initialized at the beginning (this will be > triggered when open SparkHadoopWriter). The solution for this issue is to > adjust the order of callback function initialization. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21887) DST on History server
[ https://issues.apache.org/jira/browse/SPARK-21887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21887. --- Resolution: Invalid > DST on History server > - > > Key: SPARK-21887 > URL: https://issues.apache.org/jira/browse/SPARK-21887 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 1.6.1 >Reporter: Iraitz Montalban >Priority: Minor > > I couldn't find any configuration or fix available so I just would like to > raise that even though log files store event timestamp correctly (UTC > timestamp) when Spark History Web UI is showing local time (GMT + 1 in our > case) is not taking into account Daylight Saving Time. Meaning that, during > summer, all our Spark applications show a 1 hour difference according to YARN. > Anybody has experienced the same? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21887) DST on History server
Iraitz Montalban created SPARK-21887: Summary: DST on History server Key: SPARK-21887 URL: https://issues.apache.org/jira/browse/SPARK-21887 Project: Spark Issue Type: Question Components: Web UI Affects Versions: 1.6.1 Reporter: Iraitz Montalban Priority: Minor I couldn't find any configuration or fix available so I just would like to raise that even though log files store event timestamp correctly (UTC timestamp) when Spark History Web UI is showing local time (GMT + 1 in our case) is not taking into account Daylight Saving Time. Meaning that, during summer, all our Spark applications show a 1 hour difference according to YARN. Anybody has experienced the same? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.
[ https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148878#comment-16148878 ] Apache Spark commented on SPARK-21869: -- User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/19096 > A cached Kafka producer should not be closed if any task is using it. > - > > Key: SPARK-21869 > URL: https://issues.apache.org/jira/browse/SPARK-21869 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > Right now a cached Kafka producer may be closed if a large task uses it for > more than 10 minutes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.
[ https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21869: Assignee: (was: Apache Spark) > A cached Kafka producer should not be closed if any task is using it. > - > > Key: SPARK-21869 > URL: https://issues.apache.org/jira/browse/SPARK-21869 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > Right now a cached Kafka producer may be closed if a large task uses it for > more than 10 minutes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21869) A cached Kafka producer should not be closed if any task is using it.
[ https://issues.apache.org/jira/browse/SPARK-21869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21869: Assignee: Apache Spark > A cached Kafka producer should not be closed if any task is using it. > - > > Key: SPARK-21869 > URL: https://issues.apache.org/jira/browse/SPARK-21869 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Right now a cached Kafka producer may be closed if a large task uses it for > more than 10 minutes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21885) SQL inferSchema too slow
[ https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148818#comment-16148818 ] Sean Owen commented on SPARK-21885: --- Please fix the title > SQL inferSchema too slow > > > Key: SPARK-21885 > URL: https://issues.apache.org/jira/browse/SPARK-21885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0 > Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to > INFER_ONLY >Reporter: liupengcheng > Labels: slow, sql > > Currently, SparkSQL infer schema is too slow, almost take 2 minutes. > I digged into the code, and finally findout the reason: > 1. In the analysis process of LogicalPlan spark will try to infer table > schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and > it will list all the leaf files of the rootPaths(just tableLocation), and > then call `getFileBlockLocations` to turn `FileStatus` into > `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files > will take a long time, and it seems that the locations info is never used. > 2. When infer a parquet schema, if there is only one file, it will still > launch a spark job to merge schema. I think it's expensive. > Time costly stack is as follow: > {code:java} > at org.apache.hadoop.ipc.Client.call(Client.java:1403) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224) > at > org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:301) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$a
[jira] [Assigned] (SPARK-21886) Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator
[ https://issues.apache.org/jira/browse/SPARK-21886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21886: Assignee: Apache Spark > Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD > logical operator > --- > > Key: SPARK-21886 > URL: https://issues.apache.org/jira/browse/SPARK-21886 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Assignee: Apache Spark >Priority: Minor > > While exploring where {{LogicalRDD}} is created I noticed that there are a > few places that beg for {{SparkSession.internalCreateDataFrame}}. The task is > to simply re-use the method wherever possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21886) Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator
[ https://issues.apache.org/jira/browse/SPARK-21886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148817#comment-16148817 ] Apache Spark commented on SPARK-21886: -- User 'jaceklaskowski' has created a pull request for this issue: https://github.com/apache/spark/pull/19095 > Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD > logical operator > --- > > Key: SPARK-21886 > URL: https://issues.apache.org/jira/browse/SPARK-21886 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > While exploring where {{LogicalRDD}} is created I noticed that there are a > few places that beg for {{SparkSession.internalCreateDataFrame}}. The task is > to simply re-use the method wherever possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21886) Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator
[ https://issues.apache.org/jira/browse/SPARK-21886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21886: Assignee: (was: Apache Spark) > Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD > logical operator > --- > > Key: SPARK-21886 > URL: https://issues.apache.org/jira/browse/SPARK-21886 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > While exploring where {{LogicalRDD}} is created I noticed that there are a > few places that beg for {{SparkSession.internalCreateDataFrame}}. The task is > to simply re-use the method wherever possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21886) Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator
Jacek Laskowski created SPARK-21886: --- Summary: Use SparkSession.internalCreateDataFrame to create Dataset with LogicalRDD logical operator Key: SPARK-21886 URL: https://issues.apache.org/jira/browse/SPARK-21886 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Jacek Laskowski Priority: Minor While exploring where {{LogicalRDD}} is created I noticed that there are a few places that beg for {{SparkSession.internalCreateDataFrame}}. The task is to simply re-use the method wherever possible. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery
[ https://issues.apache.org/jira/browse/SPARK-21884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21884: Assignee: Apache Spark > Fix StackOverflowError on MetadataOnlyQuery > --- > > Key: SPARK-21884 > URL: https://issues.apache.org/jira/browse/SPARK-21884 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > This issue aims to fix StackOverflowError in branch-2.2. In Apache master > branch, it doesn't throw StackOverflowError. > {code} > scala> spark.version > res0: String = 2.2.0 > scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY > (p)") > res1: org.apache.spark.sql.DataFrame = [] > scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION > (p=$p)")) > scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect > java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery
[ https://issues.apache.org/jira/browse/SPARK-21884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21884: Assignee: (was: Apache Spark) > Fix StackOverflowError on MetadataOnlyQuery > --- > > Key: SPARK-21884 > URL: https://issues.apache.org/jira/browse/SPARK-21884 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun > > This issue aims to fix StackOverflowError in branch-2.2. In Apache master > branch, it doesn't throw StackOverflowError. > {code} > scala> spark.version > res0: String = 2.2.0 > scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY > (p)") > res1: org.apache.spark.sql.DataFrame = [] > scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION > (p=$p)")) > scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect > java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery
[ https://issues.apache.org/jira/browse/SPARK-21884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148794#comment-16148794 ] Apache Spark commented on SPARK-21884: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19094 > Fix StackOverflowError on MetadataOnlyQuery > --- > > Key: SPARK-21884 > URL: https://issues.apache.org/jira/browse/SPARK-21884 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun > > This issue aims to fix StackOverflowError in branch-2.2. In Apache master > branch, it doesn't throw StackOverflowError. > {code} > scala> spark.version > res0: String = 2.2.0 > scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY > (p)") > res1: org.apache.spark.sql.DataFrame = [] > scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION > (p=$p)")) > scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect > java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21885) SQL inferSchema too slow
[ https://issues.apache.org/jira/browse/SPARK-21885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148793#comment-16148793 ] liupengcheng commented on SPARK-21885: -- [~smilegator] [~dongjoon] [~viirya] > SQL inferSchema too slow > > > Key: SPARK-21885 > URL: https://issues.apache.org/jira/browse/SPARK-21885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0 > Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to > INFER_ONLY >Reporter: liupengcheng > Labels: slow, sql > > Currently, SparkSQL infer schema is too slow, almost take 2 minutes. > I digged into the code, and finally findout the reason: > 1. In the analysis process of LogicalPlan spark will try to infer table > schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and > it will list all the leaf files of the rootPaths(just tableLocation), and > then call `getFileBlockLocations` to turn `FileStatus` into > `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files > will take a long time, and it seems that the locations info is never used. > 2. When infer a parquet schema, if there is only one file, it will still > launch a spark job to merge schema. I think it's expensive. > Time costly stack is as follow: > {code:java} > at org.apache.hadoop.ipc.Client.call(Client.java:1403) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224) > at > org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:301) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collecti
[jira] [Created] (SPARK-21885) SQL inferSchema too slow
liupengcheng created SPARK-21885: Summary: SQL inferSchema too slow Key: SPARK-21885 URL: https://issues.apache.org/jira/browse/SPARK-21885 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0, 2.1.0, 2.3.0 Environment: `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY Reporter: liupengcheng Currently, SparkSQL infer schema is too slow, almost take 2 minutes. I digged into the code, and finally findout the reason: 1. In the analysis process of LogicalPlan spark will try to infer table schema if `spark.sql.hive.caseSensitiveInferenceMode` set to INFER_ONLY, and it will list all the leaf files of the rootPaths(just tableLocation), and then call `getFileBlockLocations` to turn `FileStatus` into `LocatedFileStatus`. This `getFileBlockLocations` for so manny leaf files will take a long time, and it seems that the locations info is never used. 2. When infer a parquet schema, if there is only one file, it will still launch a spark job to merge schema. I think it's expensive. Time costly stack is as follow: {code:java} at org.apache.hadoop.ipc.Client.call(Client.java:1403) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:259) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy17.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1234) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1224) at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1274) at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221) at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:217) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:427) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles$3.apply(PartitioningAwareFileIndex.scala:410) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$listLeafFiles(PartitioningAwareFileIndex.scala:410) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:302) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles$1.apply(PartitioningAwareFileIndex.scala:301) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at
[jira] [Updated] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery
[ https://issues.apache.org/jira/browse/SPARK-21884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21884: -- Description: This issue aims to fix StackOverflowError in branch-2.2. In Apache master branch, it doesn't throw StackOverflowError. {code} scala> spark.version res0: String = 2.2.0 scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY (p)") res1: org.apache.spark.sql.DataFrame = [] scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION (p=$p)")) scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect java.lang.StackOverflowError at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522) {code} was: {code} scala> spark.version res0: String = 2.2.0 scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY (p)") res1: org.apache.spark.sql.DataFrame = [] scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION (p=$p)")) scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect java.lang.StackOverflowError at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522) {code} > Fix StackOverflowError on MetadataOnlyQuery > --- > > Key: SPARK-21884 > URL: https://issues.apache.org/jira/browse/SPARK-21884 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun > > This issue aims to fix StackOverflowError in branch-2.2. In Apache master > branch, it doesn't throw StackOverflowError. > {code} > scala> spark.version > res0: String = 2.2.0 > scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY > (p)") > res1: org.apache.spark.sql.DataFrame = [] > scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION > (p=$p)")) > scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect > java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21884) Fix StackOverflowError on MetadataOnlyQuery
Dongjoon Hyun created SPARK-21884: - Summary: Fix StackOverflowError on MetadataOnlyQuery Key: SPARK-21884 URL: https://issues.apache.org/jira/browse/SPARK-21884 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Dongjoon Hyun {code} scala> spark.version res0: String = 2.2.0 scala> sql("CREATE TABLE t_1000 (a INT, p INT) USING PARQUET PARTITIONED BY (p)") res1: org.apache.spark.sql.DataFrame = [] scala> (1 to 1000).foreach(p => sql(s"ALTER TABLE t_1000 ADD PARTITION (p=$p)")) scala> sql("SELECT COUNT(DISTINCT p) FROM t_1000").collect java.lang.StackOverflowError at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \
[ https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148746#comment-16148746 ] Adrien Lavoillotte commented on SPARK-21850: [~aokolnychyi] aren't you supposed to double the escape: once for the scala string, once for the SQL layer? {code:none} spark.sql("SELECT * FROM t1 WHERE col2 = 'n'").show() +++ |col1|col2| +++ | 1| \n| +++ {code} > SparkSQL cannot perform LIKE someColumn if someColumn's value contains a > backslash \ > > > Key: SPARK-21850 > URL: https://issues.apache.org/jira/browse/SPARK-21850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adrien Lavoillotte > > I have a test table looking like this: > {code:none} > spark.sql("select * from `test`.`types_basic`").show() > {code} > ||id||c_tinyint|| [...] || c_string|| > | 0| -128| [...] | string| > | 1|0| [...] |string 'with' "qu...| > | 2| 127| [...] | unicod€ strĭng| > | 3| 42| [...] |there is a \n in ...| > | 4| null| [...] |null| > Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n > string", not a line break). I would like to join another table using a LIKE > condition (to join on prefix). If I do this: > {code:none} > spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE > CONCAT(a.`c_string`, '%')").show() > {code} > I get the following error in spark 2.2 (but not in any earlier version): > {noformat} > 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 > (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the > pattern 'there is a \n in this line%' is invalid, the escape character is not > allowed to precede 'n'; > at > org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42) > at > org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51) > at > org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > It seems to me that if LIKE requires special escaping there, then it should > be provided by SparkSQL on the value of the column. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box
[ https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148713#comment-16148713 ] Saisai Shao commented on SPARK-11574: - Thanks a lot [~srowen]! > Spark should support StatsD sink out of box > --- > > Key: SPARK-11574 > URL: https://issues.apache.org/jira/browse/SPARK-11574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Xiaofeng Lin >Assignee: Xiaofeng Lin > Fix For: 2.3.0 > > > In order to run spark in production, monitoring is essential. StatsD is such > a common metric reporting mechanism that it should be supported out of the > box. This will enable publishing metrics to monitoring services like > datadog, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11574) Spark should support StatsD sink out of box
[ https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-11574: - Assignee: Xiaofeng Lin > Spark should support StatsD sink out of box > --- > > Key: SPARK-11574 > URL: https://issues.apache.org/jira/browse/SPARK-11574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Xiaofeng Lin >Assignee: Xiaofeng Lin > Fix For: 2.3.0 > > > In order to run spark in production, monitoring is essential. StatsD is such > a common metric reporting mechanism that it should be supported out of the > box. This will enable publishing metrics to monitoring services like > datadog, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box
[ https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148707#comment-16148707 ] Sean Owen commented on SPARK-11574: --- [~jerryshao] I'll make you a JIRA admin so you can add the user to the "Contributor" role. It won't show up otherwise. > Spark should support StatsD sink out of box > --- > > Key: SPARK-11574 > URL: https://issues.apache.org/jira/browse/SPARK-11574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Xiaofeng Lin > Fix For: 2.3.0 > > > In order to run spark in production, monitoring is essential. StatsD is such > a common metric reporting mechanism that it should be supported out of the > box. This will enable publishing metrics to monitoring services like > datadog, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \
[ https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148705#comment-16148705 ] Anton Okolnychyi commented on SPARK-21850: -- Then we should not be bound to the LIKE case only. I am wondering what is expected in the code below? {code} Seq((1, "\\n")).toDF("col1", "col2").write.saveAsTable("t1") spark.sql("SELECT * FROM t1").show() +++ |col1|col2| +++ | 1| \n| +++ spark.sql("SELECT * FROM t1 WHERE col2 = '\\n'").show() // does it give a wrong result? +++ |col1|col2| +++ +++ spark.sql("SELECT * FROM t1 WHERE col2 = '\n'").show() +++ |col1|col2| +++ +++ {code} Just to mention, [PR#15398|https://github.com/apache/spark/pull/15398] was also merged in 2.1 (might not be publicly available yet). I got the same exception in the 2.1 branch using the following code: {code} Seq((1, "\\n")).toDF("col1", "col2").write.saveAsTable("t1") spark.sql("SELECT * FROM t1").show() spark.sql("SELECT * FROM t1 WHERE col2 LIKE CONCAT(col2, '%')").show() {code} > SparkSQL cannot perform LIKE someColumn if someColumn's value contains a > backslash \ > > > Key: SPARK-21850 > URL: https://issues.apache.org/jira/browse/SPARK-21850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adrien Lavoillotte > > I have a test table looking like this: > {code:none} > spark.sql("select * from `test`.`types_basic`").show() > {code} > ||id||c_tinyint|| [...] || c_string|| > | 0| -128| [...] | string| > | 1|0| [...] |string 'with' "qu...| > | 2| 127| [...] | unicod€ strĭng| > | 3| 42| [...] |there is a \n in ...| > | 4| null| [...] |null| > Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n > string", not a line break). I would like to join another table using a LIKE > condition (to join on prefix). If I do this: > {code:none} > spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE > CONCAT(a.`c_string`, '%')").show() > {code} > I get the following error in spark 2.2 (but not in any earlier version): > {noformat} > 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 > (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the > pattern 'there is a \n in this line%' is invalid, the escape character is not > allowed to precede 'n'; > at > org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42) > at > org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51) > at > org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > It seems to me that if LIKE requires special escaping there, then it should > be provided by SparkSQL on the value of the column. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box
[ https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148701#comment-16148701 ] Saisai Shao commented on SPARK-11574: - Ping [~srowen]. Hi Sean I cannot assign this JIRA to Xiaofeng, since I cannot find his name in the prompt list, would you please help to check it, if possible can you please assign the JIRA to him, I already merged it in Github, thanks a lot! > Spark should support StatsD sink out of box > --- > > Key: SPARK-11574 > URL: https://issues.apache.org/jira/browse/SPARK-11574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Xiaofeng Lin > Fix For: 2.3.0 > > > In order to run spark in production, monitoring is essential. StatsD is such > a common metric reporting mechanism that it should be supported out of the > box. This will enable publishing metrics to monitoring services like > datadog, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18523) OOM killer may leave SparkContext in broken state causing Connection Refused errors
[ https://issues.apache.org/jira/browse/SPARK-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148698#comment-16148698 ] Alexander Shorin commented on SPARK-18523: -- [~kadeng] I don't have a 2.2.0 in production for now (shame on me!), but will check this. Thanks for report! > OOM killer may leave SparkContext in broken state causing Connection Refused > errors > --- > > Key: SPARK-18523 > URL: https://issues.apache.org/jira/browse/SPARK-18523 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Alexander Shorin >Assignee: Alexander Shorin > Fix For: 2.1.0 > > > When you run some memory-heavy spark job, Spark driver may consume more > memory resources than host available to provide. > In this case OOM killer comes on scene and successfully kills a spark-submit > process. > The pyspark.SparkContext is not able to handle such state of things and > becomes completely broken. > You cannot stop it as on stop it tries to call stop method of bounded java > context (jsc) and fails with Py4JError, because such process no longer exists > as like as the connection to it. > You cannot start new SparkContext because you have your broken one as active > one and pyspark still is not able to not have SparkContext as sort of > singleton. > The only thing you can do is shutdown your IPython Notebook and start it > over. Or dive into SparkContext internal attributes and reset them manually > to initial None state. > The OOM killer case is just one of the many: any reason of spark-submit crash > in the middle of something leaves SparkContext in a broken state. > Example on error log on {{sc.stop()}} in broken state: > {code} > ERROR:root:Exception while sending command. > Traceback (most recent call last): > File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line > 883, in send_command > response = connection.send_command(command) > File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line > 1040, in send_command > "Error while receiving", e, proto.ERROR_ON_RECEIVE) > Py4JNetworkError: Error while receiving > ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java > server (127.0.0.1:59911) > Traceback (most recent call last): > File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line > 963, in start > self.socket.connect((self.address, self.port)) > File "/usr/local/lib/python2.7/socket.py", line 224, in meth > return getattr(self._sock,name)(*args) > error: [Errno 61] Connection refused > --- > Py4JError Traceback (most recent call last) > in () > > 1 sc.stop() > /usr/local/share/spark/python/pyspark/context.py in stop(self) > 360 """ > 361 if getattr(self, "_jsc", None): > --> 362 self._jsc.stop() > 363 self._jsc = None > 364 if getattr(self, "_accumulatorServer", None): > /usr/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in > __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /usr/local/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 43 def deco(*a, **kw): > 44 try: > ---> 45 return f(*a, **kw) > 46 except py4j.protocol.Py4JJavaError as e: > 47 s = e.java_exception.toString() > /usr/local/lib/python2.7/site-packages/py4j/protocol.pyc in > get_return_value(answer, gateway_client, target_id, name) > 325 raise Py4JError( > 326 "An error occurred while calling {0}{1}{2}". > --> 327 format(target_id, ".", name)) > 328 else: > 329 type = answer[1] > Py4JError: An error occurred while calling o47.stop > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21883) when a stage failed,should remove it's child stages's pending state on spark UI,and mark jobs which this stage belongs to failed
[ https://issues.apache.org/jira/browse/SPARK-21883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21883. --- Resolution: Invalid I'm going to preemptively close this because it's poorly described, is reported versus a stale version and branch, and sounds like it might be a duplicate of other similar JIRAs. > when a stage failed,should remove it's child stages's pending state on spark > UI,and mark jobs which this stage belongs to failed > - > > Key: SPARK-21883 > URL: https://issues.apache.org/jira/browse/SPARK-21883 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: xianlongZhang > > When a stage failed ,we should remove it's child stages on “Pending” tab on > spark UI,and mark jobs which this stage belongs to failed,then response > client,usually ,this is ok, but with fair-schedule to run spark thrift > server, multiple sql concurrent inquiries, sometimes ,when task failed with > "FetchFailed" state,spark did not abort the failed stage and mark the job > failed, the job will hang up -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21883) when a stage failed,should remove it's child stages's pending state on spark UI,and mark jobs which this stage belongs to failed
xianlongZhang created SPARK-21883: - Summary: when a stage failed,should remove it's child stages's pending state on spark UI,and mark jobs which this stage belongs to failed Key: SPARK-21883 URL: https://issues.apache.org/jira/browse/SPARK-21883 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.1 Reporter: xianlongZhang When a stage failed ,we should remove it's child stages on “Pending” tab on spark UI,and mark jobs which this stage belongs to failed,then response client,usually ,this is ok, but with fair-schedule to run spark thrift server, multiple sql concurrent inquiries, sometimes ,when task failed with "FetchFailed" state,spark did not abort the failed stage and mark the job failed, the job will hang up -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148643#comment-16148643 ] Sean Owen commented on SPARK-21866: --- It makes some sense. I guess I'm mostly trying to match up the scope that a SPIP implies, with the relatively simple functionality here. Is this not just about a page of code to call ImageIO to parse a BufferedImage and to map its fields to a Row? That does look like the substance of https://github.com/Microsoft/spark-images/blob/master/src/main/scala/org/apache/spark/image/ImageSchema.scala Well, maybe this is just a really small SPIP. Also why: "This Spark package will also be published in a binary form on spark-packages.org ." It'd be discontinued and included in Spark right, like with the CSV parser? > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > **
[jira] [Resolved] (SPARK-21881) Again: OOM killer may leave SparkContext in broken state causing Connection Refused errors
[ https://issues.apache.org/jira/browse/SPARK-21881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21881. --- Resolution: Duplicate Let's not fork the discussion. Post on the original issue and someone can reopen if they believe it's valid. All the better if you have a pull request. > Again: OOM killer may leave SparkContext in broken state causing Connection > Refused errors > -- > > Key: SPARK-21881 > URL: https://issues.apache.org/jira/browse/SPARK-21881 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Kai Londenberg >Assignee: Alexander Shorin > > This is a duplicate of SPARK-18523, which was not really fixed for me > (PySpark 2.2.0, Python 3.5, py4j 0.10.4 ) > *Original Summary:* > When you run some memory-heavy spark job, Spark driver may consume more > memory resources than host available to provide. > In this case OOM killer comes on scene and successfully kills a spark-submit > process. > The pyspark.SparkContext is not able to handle such state of things and > becomes completely broken. > You cannot stop it as on stop it tries to call stop method of bounded java > context (jsc) and fails with Py4JError, because such process no longer exists > as like as the connection to it. > You cannot start new SparkContext because you have your broken one as active > one and pyspark still is not able to not have SparkContext as sort of > singleton. > The only thing you can do is shutdown your IPython Notebook and start it > over. Or dive into SparkContext internal attributes and reset them manually > to initial None state. > The OOM killer case is just one of the many: any reason of spark-submit crash > in the middle of something leaves SparkContext in a broken state. > *Latest Comment* > In PySpark 2.2.0 this issue was not really fixed. While I could close the > SparkContext (with an Exception message, but it was closed afterwards), I > could not reopen any new spark contexts. > *Current Workaround* > If I resetted the global SparkContext variables like this, it worked : > {code:none} > def reset_spark(): > import pyspark > from threading import RLock > pyspark.SparkContext._jvm = None > pyspark.SparkContext._gateway = None > pyspark.SparkContext._next_accum_id = 0 > pyspark.SparkContext._active_spark_context = None > pyspark.SparkContext._lock = RLock() > pyspark.SparkContext._python_includes = None > reset_spark() > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14327) Scheduler holds locks which cause huge scheulder delays and executor timeouts
[ https://issues.apache.org/jira/browse/SPARK-14327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148622#comment-16148622 ] Fei Chen commented on SPARK-14327: -- I also came across this problem. In TaskScheduler, a thread which holds the lock is blocked by some unknown factors. Other threads must wait until the lock is released. As a result, the scheduler delay takes up tens seconds or even several minutes. How can i solve this problem? > Scheduler holds locks which cause huge scheulder delays and executor timeouts > - > > Key: SPARK-14327 > URL: https://issues.apache.org/jira/browse/SPARK-14327 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.1 >Reporter: Chris Bannister > Attachments: driver.jstack > > > I have a job which after a while in one of its stages grinds to a halt, from > processing around 300k tasks in 15 minutes to less than 1000 in the next > hour. The driver ends up using 100% CPU on a single core (out of 4) and the > executors start failing to receive heartbeat responses, tasks are not > scheduled and results trickle in. > For this stage the max scheduler delay is 15 minutes, and the 75% percentile > is 4ms. > It appears that TaskScheulderImpl does most of its work whilst holding the > global synchronised lock for the class, this synchronised lock is shared > between at least, > TaskSetManager.canFetchMoreResults > TaskSchedulerImpl.handleSuccessfulTask > TaskSchedulerImpl.executorHeartbeatReceived > TaskSchedulerImpl.statusUpdate > TaskSchedulerImpl.checkSpeculatableTasks > This looks to severely limit the latency and throughput of the scheduler, and > casuses my job to straight up fail due to taking too long. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21882) OutputMetrics doesn't count written bytes correctly in the saveAsHadoopDataset function
[ https://issues.apache.org/jira/browse/SPARK-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] linxiaojun updated SPARK-21882: --- Attachment: SPARK-21882.patch SPARK-21882.patch for Spark-1.6.1 > OutputMetrics doesn't count written bytes correctly in the > saveAsHadoopDataset function > --- > > Key: SPARK-21882 > URL: https://issues.apache.org/jira/browse/SPARK-21882 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.2.0 >Reporter: linxiaojun >Priority: Minor > Attachments: SPARK-21882.patch > > > The first job called from saveAsHadoopDataset, running in each executor, does > not calculate the writtenBytes of OutputMetrics correctly. The reason is that > we did not initialize the callback function called to find bytes written in > the right way. As usual, statisticsTable which records statistics in a > FileSystem must be initialized at the beginning (this will be triggered when > open SparkHadoopWriter). The solution for this issue is to adjust the order > of callback function initialization. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21882) OutputMetrics doesn't count written bytes correctly in the saveAsHadoopDataset function
linxiaojun created SPARK-21882: -- Summary: OutputMetrics doesn't count written bytes correctly in the saveAsHadoopDataset function Key: SPARK-21882 URL: https://issues.apache.org/jira/browse/SPARK-21882 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0, 1.6.1 Reporter: linxiaojun Priority: Minor The first job called from saveAsHadoopDataset, running in each executor, does not calculate the writtenBytes of OutputMetrics correctly. The reason is that we did not initialize the callback function called to find bytes written in the right way. As usual, statisticsTable which records statistics in a FileSystem must be initialized at the beginning (this will be triggered when open SparkHadoopWriter). The solution for this issue is to adjust the order of callback function initialization. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21881) Again: OOM killer may leave SparkContext in broken state causing Connection Refused errors
[ https://issues.apache.org/jira/browse/SPARK-21881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Londenberg updated SPARK-21881: --- Description: This is a duplicate of SPARK-18523, which was not really fixed for me (PySpark 2.2.0, Python 3.5, py4j 0.10.4 ) *Original Summary:* When you run some memory-heavy spark job, Spark driver may consume more memory resources than host available to provide. In this case OOM killer comes on scene and successfully kills a spark-submit process. The pyspark.SparkContext is not able to handle such state of things and becomes completely broken. You cannot stop it as on stop it tries to call stop method of bounded java context (jsc) and fails with Py4JError, because such process no longer exists as like as the connection to it. You cannot start new SparkContext because you have your broken one as active one and pyspark still is not able to not have SparkContext as sort of singleton. The only thing you can do is shutdown your IPython Notebook and start it over. Or dive into SparkContext internal attributes and reset them manually to initial None state. The OOM killer case is just one of the many: any reason of spark-submit crash in the middle of something leaves SparkContext in a broken state. *Latest Comment* In PySpark 2.2.0 this issue was not really fixed. While I could close the SparkContext (with an Exception message, but it was closed afterwards), I could not reopen any new spark contexts. *Current Workaround* If I resetted the global SparkContext variables like this, it worked : {code:none} def reset_spark(): import pyspark from threading import RLock pyspark.SparkContext._jvm = None pyspark.SparkContext._gateway = None pyspark.SparkContext._next_accum_id = 0 pyspark.SparkContext._active_spark_context = None pyspark.SparkContext._lock = RLock() pyspark.SparkContext._python_includes = None reset_spark() {code} was: This is a duplicate of SPARK-18523, which was not really fixed for me (PySpark 2.2.0, Python 3.5, py4j 0.10.4 ) *Original Summary:* When you run some memory-heavy spark job, Spark driver may consume more memory resources than host available to provide. In this case OOM killer comes on scene and successfully kills a spark-submit process. The pyspark.SparkContext is not able to handle such state of things and becomes completely broken. You cannot stop it as on stop it tries to call stop method of bounded java context (jsc) and fails with Py4JError, because such process no longer exists as like as the connection to it. You cannot start new SparkContext because you have your broken one as active one and pyspark still is not able to not have SparkContext as sort of singleton. The only thing you can do is shutdown your IPython Notebook and start it over. Or dive into SparkContext internal attributes and reset them manually to initial None state. The OOM killer case is just one of the many: any reason of spark-submit crash in the middle of something leaves SparkContext in a broken state. *Latest Comment (for PySpark 2.2.0) * In PySpark 2.2.0 this issue was not really fixed. While I could close the SparkContext (with an Exception message, but it was closed afterwards), I could not reopen any new spark contexts. *Current Workaround* If I resetted the global SparkContext variables like this, it worked : {code:none} def reset_spark(): import pyspark from threading import RLock pyspark.SparkContext._jvm = None pyspark.SparkContext._gateway = None pyspark.SparkContext._next_accum_id = 0 pyspark.SparkContext._active_spark_context = None pyspark.SparkContext._lock = RLock() pyspark.SparkContext._python_includes = None reset_spark() {code} > Again: OOM killer may leave SparkContext in broken state causing Connection > Refused errors > -- > > Key: SPARK-21881 > URL: https://issues.apache.org/jira/browse/SPARK-21881 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Kai Londenberg >Assignee: Alexander Shorin > > This is a duplicate of SPARK-18523, which was not really fixed for me > (PySpark 2.2.0, Python 3.5, py4j 0.10.4 ) > *Original Summary:* > When you run some memory-heavy spark job, Spark driver may consume more > memory resources than host available to provide. > In this case OOM killer comes on scene and successfully kills a spark-submit > process. > The pyspark.SparkContext is not able to handle such state of things and > becomes completely broken. > You cannot stop it as on stop it tries to call stop method of bounded java > context (jsc) and fails with Py4JError, because such process no longer exists > as like as the connection to it. > You canno
[jira] [Updated] (SPARK-21881) Again: OOM killer may leave SparkContext in broken state causing Connection Refused errors
[ https://issues.apache.org/jira/browse/SPARK-21881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Londenberg updated SPARK-21881: --- Affects Version/s: (was: 1.6.1) (was: 2.0.0) > Again: OOM killer may leave SparkContext in broken state causing Connection > Refused errors > -- > > Key: SPARK-21881 > URL: https://issues.apache.org/jira/browse/SPARK-21881 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Kai Londenberg >Assignee: Alexander Shorin > > This is a duplicate of SPARK-18523, which was not really fixed for me > (PySpark 2.2.0, Python 3.5, py4j 0.10.4 ) > *Original Summary:* > When you run some memory-heavy spark job, Spark driver may consume more > memory resources than host available to provide. > In this case OOM killer comes on scene and successfully kills a spark-submit > process. > The pyspark.SparkContext is not able to handle such state of things and > becomes completely broken. > You cannot stop it as on stop it tries to call stop method of bounded java > context (jsc) and fails with Py4JError, because such process no longer exists > as like as the connection to it. > You cannot start new SparkContext because you have your broken one as active > one and pyspark still is not able to not have SparkContext as sort of > singleton. > The only thing you can do is shutdown your IPython Notebook and start it > over. Or dive into SparkContext internal attributes and reset them manually > to initial None state. > The OOM killer case is just one of the many: any reason of spark-submit crash > in the middle of something leaves SparkContext in a broken state. > *Latest Comment* > In PySpark 2.2.0 this issue was not really fixed. While I could close the > SparkContext (with an Exception message, but it was closed afterwards), I > could not reopen any new spark contexts. > *Current Workaround* > If I resetted the global SparkContext variables like this, it worked : > {code:none} > def reset_spark(): > import pyspark > from threading import RLock > pyspark.SparkContext._jvm = None > pyspark.SparkContext._gateway = None > pyspark.SparkContext._next_accum_id = 0 > pyspark.SparkContext._active_spark_context = None > pyspark.SparkContext._lock = RLock() > pyspark.SparkContext._python_includes = None > reset_spark() > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21881) Again: OOM killer may leave SparkContext in broken state causing Connection Refused errors
[ https://issues.apache.org/jira/browse/SPARK-21881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Londenberg updated SPARK-21881: --- Description: This is a duplicate of SPARK-18523, which was not really fixed for me (PySpark 2.2.0, Python 3.5, py4j 0.10.4 ) *Original Summary:* When you run some memory-heavy spark job, Spark driver may consume more memory resources than host available to provide. In this case OOM killer comes on scene and successfully kills a spark-submit process. The pyspark.SparkContext is not able to handle such state of things and becomes completely broken. You cannot stop it as on stop it tries to call stop method of bounded java context (jsc) and fails with Py4JError, because such process no longer exists as like as the connection to it. You cannot start new SparkContext because you have your broken one as active one and pyspark still is not able to not have SparkContext as sort of singleton. The only thing you can do is shutdown your IPython Notebook and start it over. Or dive into SparkContext internal attributes and reset them manually to initial None state. The OOM killer case is just one of the many: any reason of spark-submit crash in the middle of something leaves SparkContext in a broken state. *Latest Comment (for PySpark 2.2.0) * In PySpark 2.2.0 this issue was not really fixed. While I could close the SparkContext (with an Exception message, but it was closed afterwards), I could not reopen any new spark contexts. *Current Workaround* If I resetted the global SparkContext variables like this, it worked : {code:none} def reset_spark(): import pyspark from threading import RLock pyspark.SparkContext._jvm = None pyspark.SparkContext._gateway = None pyspark.SparkContext._next_accum_id = 0 pyspark.SparkContext._active_spark_context = None pyspark.SparkContext._lock = RLock() pyspark.SparkContext._python_includes = None reset_spark() {code} was: When you run some memory-heavy spark job, Spark driver may consume more memory resources than host available to provide. In this case OOM killer comes on scene and successfully kills a spark-submit process. The pyspark.SparkContext is not able to handle such state of things and becomes completely broken. You cannot stop it as on stop it tries to call stop method of bounded java context (jsc) and fails with Py4JError, because such process no longer exists as like as the connection to it. You cannot start new SparkContext because you have your broken one as active one and pyspark still is not able to not have SparkContext as sort of singleton. The only thing you can do is shutdown your IPython Notebook and start it over. Or dive into SparkContext internal attributes and reset them manually to initial None state. The OOM killer case is just one of the many: any reason of spark-submit crash in the middle of something leaves SparkContext in a broken state. Example on error log on {{sc.stop()}} in broken state: {code} ERROR:root:Exception while sending command. Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 883, in send_command response = connection.send_command(command) File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1040, in send_command "Error while receiving", e, proto.ERROR_ON_RECEIVE) Py4JNetworkError: Error while receiving ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:59911) Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 963, in start self.socket.connect((self.address, self.port)) File "/usr/local/lib/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) error: [Errno 61] Connection refused --- Py4JError Traceback (most recent call last) in () > 1 sc.stop() /usr/local/share/spark/python/pyspark/context.py in stop(self) 360 """ 361 if getattr(self, "_jsc", None): --> 362 self._jsc.stop() 363 self._jsc = None 364 if getattr(self, "_accumulatorServer", None): /usr/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /usr/local/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 43 def deco(*a, **kw): 44 try: ---> 45 return f(*a, **kw) 46 except py4j.protocol.Py4JJavaError as e: 47 s = e.java_exception.toString() /usr/local/lib/pytho