[jira] [Commented] (SPARK-4854) Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI
[ https://issues.apache.org/jira/browse/SPARK-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247938#comment-14247938 ] Shenghua Wan commented on SPARK-4854: - I was trying to debugging Spark source code to fix this issue. And I found the function name resolution might not work, where normal situation like SELECT+CustomUDTF resolves the function name as FQDN class name, but in situation like SELECT+LATERAL VIEW+CustomUDTF resolves the function name as just the alias I gave in create temporary function clause rather than FQDN. In other words, the function name to class name translation failed in that situation. A workaround trick is discovered by debugging the Spark source code. The trick is to remove the package name in the Java code of your custom UDTF, like org.xxx. In that case, your class name equals its FQDN. In addition the class name is used as the alias function name in create temporary function clause. In that case, though Spark use the unresolved alias function name, but this name can be resolved in the default name space. Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI - Key: SPARK-4854 URL: https://issues.apache.org/jira/browse/SPARK-4854 Project: Spark Issue Type: Bug Affects Versions: 1.1.0, 1.1.1 Reporter: Shenghua Wan Hello, I met a problem when using Spark sql CLI. A custom UDTF with lateral view throws ClassNotFound exception. I did a couple of experiments in same environment (spark version 1.1.0, 1.1.1): select + same custom UDTF (Passed) select + lateral view + custom UDTF (ClassNotFoundException) select + lateral view + built-in UDTF (Passed) I have done some googling there days and found one related issue ticket of Spark https://issues.apache.org/jira/browse/SPARK-4811 which is about Custom UDTFs not working in Spark SQL. It should be helpful to put actual code here to reproduce the problem. However, corporate regulations might prohibit this. So sorry about this. Directly using explode's source code in a jar will help anyway. Here is a portion of stack print when exception, just in case: java.lang.ClassNotFoundException: XXX at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:81) at org.apache.spark.sql.hive.HiveGenericUdtf.createFunction(hiveUdfs.scala:247) at org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:254) at org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:254) at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspectors$lzycompute(hiveUdfs.scala:261) at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspectors(hiveUdfs.scala:260) at org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:265) at org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes(hiveUdfs.scala:265) at org.apache.spark.sql.hive.HiveGenericUdtf.makeOutput(hiveUdfs.scala:269) at org.apache.spark.sql.catalyst.expressions.Generator.output(generators.scala:60) at org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$1.apply(basicOperators.scala:50) at org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$1.apply(basicOperators.scala:50) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.catalyst.plans.logical.Generate.generatorOutput(basicOperators.scala:50) at org.apache.spark.sql.catalyst.plans.logical.Generate.output(basicOperators.scala:60) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:79) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:79) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) the rest is omitted.
[jira] [Commented] (SPARK-4857) Add Executor Events to SparkListener
[ https://issues.apache.org/jira/browse/SPARK-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247955#comment-14247955 ] Apache Spark commented on SPARK-4857: - User 'ksakellis' has created a pull request for this issue: https://github.com/apache/spark/pull/3711 Add Executor Events to SparkListener Key: SPARK-4857 URL: https://issues.apache.org/jira/browse/SPARK-4857 Project: Spark Issue Type: Improvement Reporter: Kostas Sakellis We need to add events to the SparkListener to indicate an executor has been added or removed with corresponding information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4341) Spark need to set num-executors automatically
[ https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-4341: - Attachment: SPARK-4341.diff Spark need to set num-executors automatically - Key: SPARK-4341 URL: https://issues.apache.org/jira/browse/SPARK-4341 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Hong Shen Attachments: SPARK-4341.diff The mapreduce job can set maptask automaticlly, but in spark, we have to set num-executors, executor memory and cores. It's difficult for users to set these args, especially for the users want to use spark sql. So when user havn't set num-executors, spark should set num-executors automatically accroding to the input partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2988) Port repl to scala 2.11.
[ https://issues.apache.org/jira/browse/SPARK-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma resolved SPARK-2988. Resolution: Fixed Port repl to scala 2.11. Key: SPARK-2988 URL: https://issues.apache.org/jira/browse/SPARK-2988 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Prashant Sharma Assignee: Prashant Sharma -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2987) Adjust build system to support building with scala 2.11 and fix tests.
[ https://issues.apache.org/jira/browse/SPARK-2987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma resolved SPARK-2987. Resolution: Fixed Adjust build system to support building with scala 2.11 and fix tests. -- Key: SPARK-2987 URL: https://issues.apache.org/jira/browse/SPARK-2987 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Prashant Sharma Assignee: Prashant Sharma -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2709) Add a tool for certifying Spark API compatiblity
[ https://issues.apache.org/jira/browse/SPARK-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma resolved SPARK-2709. Resolution: Fixed Add a tool for certifying Spark API compatiblity Key: SPARK-2709 URL: https://issues.apache.org/jira/browse/SPARK-2709 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Assignee: Prashant Sharma As Spark is packaged by more and more distributors, it would be good to have a tool that verifies API compatiblity of a provided Spark package. The tool would certify that a vendor distrubtion of Spark contains all of the API's present in a particular upstream Spark version. This will help vendors make sure they remain API compliant when they make changes or back ports to Spark. It will also discourage vendors from knowingly breaking API's, because anyone can audit their distribution and see that they have removed support for certain API's. I'm hoping a tool like this will avoid API fragmentation in the Spark community. One poor man's implementation of this is that a vendor can just run the binary compatibility checks in the spark build against an upstream version of Spark. That's a pretty good start, but it means you can't come as a third party and audit a distribution. Another approach would be to have something where anyone can come in and audit a distribution even if they don't have access to the packaging and source code. That would look something like this: 1. For each release we publish a manifest of all public API's (we might borrow the MIMA string representation of bye code signatures) 2. We package an auditing tool as a jar file. 3. The user runs a tool with spark-submit that reflectively walks through all exposed Spark API's and makes sure that everything on the manifest is encountered. From the implementation side, this is just brainstorming at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1338) Create Additional Style Rules
[ https://issues.apache.org/jira/browse/SPARK-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma resolved SPARK-1338. Resolution: Fixed This is fixed by updating the scalastyle version. Applying these styles accross the code base can be a different issue. Create Additional Style Rules - Key: SPARK-1338 URL: https://issues.apache.org/jira/browse/SPARK-1338 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Prashant Sharma Fix For: 1.2.0 There are a few other rules that would be helpful to have. Also we should add tests for these rules because it's easy to get them wrong. I gave some example comparisons from a javascript style checker. Require spaces in type declarations: def foo:String = X // no def foo: String = XXX def x:Int = 100 // no val x: Int = 100 Require spaces after keywords: if(x - 3) // no if (x + 10) See: requireSpaceAfterKeywords from https://github.com/mdevils/node-jscs Disallow spaces inside of parentheses: val x = ( 3 + 5 ) // no val x = (3 + 5) See: disallowSpacesInsideParentheses from https://github.com/mdevils/node-jscs Require spaces before and after binary operators: See: requireSpaceBeforeBinaryOperators See: disallowSpaceAfterBinaryOperators from https://github.com/mdevils/node-jscs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2709) Add a tool for certifying Spark API compatiblity
[ https://issues.apache.org/jira/browse/SPARK-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248174#comment-14248174 ] Sean Owen commented on SPARK-2709: -- Was this sort of tool implemented? Add a tool for certifying Spark API compatiblity Key: SPARK-2709 URL: https://issues.apache.org/jira/browse/SPARK-2709 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Assignee: Prashant Sharma As Spark is packaged by more and more distributors, it would be good to have a tool that verifies API compatiblity of a provided Spark package. The tool would certify that a vendor distrubtion of Spark contains all of the API's present in a particular upstream Spark version. This will help vendors make sure they remain API compliant when they make changes or back ports to Spark. It will also discourage vendors from knowingly breaking API's, because anyone can audit their distribution and see that they have removed support for certain API's. I'm hoping a tool like this will avoid API fragmentation in the Spark community. One poor man's implementation of this is that a vendor can just run the binary compatibility checks in the spark build against an upstream version of Spark. That's a pretty good start, but it means you can't come as a third party and audit a distribution. Another approach would be to have something where anyone can come in and audit a distribution even if they don't have access to the packaging and source code. That would look something like this: 1. For each release we publish a manifest of all public API's (we might borrow the MIMA string representation of bye code signatures) 2. We package an auditing tool as a jar file. 3. The user runs a tool with spark-submit that reflectively walks through all exposed Spark API's and makes sure that everything on the manifest is encountered. From the implementation side, this is just brainstorming at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3955) Different versions between jackson-mapper-asl and jackson-core-asl
[ https://issues.apache.org/jira/browse/SPARK-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248176#comment-14248176 ] Sean Owen commented on SPARK-3955: -- [~jongyoul] That PR was never merged though: https://github.com/apache/spark/pull/3379 You closed yours too: https://github.com/apache/spark/pull/2818 I don't think this is resolved. Different versions between jackson-mapper-asl and jackson-core-asl -- Key: SPARK-3955 URL: https://issues.apache.org/jira/browse/SPARK-3955 Project: Spark Issue Type: Bug Components: Build, Spark Core, SQL Affects Versions: 1.1.0 Reporter: Jongyoul Lee In the parent pom.xml, specified a version of jackson-mapper-asl. This is used by sql/hive/pom.xml. When mvn assembly runs, however, jackson-mapper-asl is not same as jackson-core-asl. This is because other libraries use several versions of jackson, so other version of jackson-core-asl is assembled. Simply, fix this problem if pom.xml has a specific version information of jackson-core-asl. If it's not set, a version 1.9.11 is merged info assembly.jar and we cannot use jackson library properly. {code} [INFO] Including org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8 in the shaded jar. [INFO] Including org.codehaus.jackson:jackson-core-asl:jar:1.9.11 in the shaded jar. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4777) Some block memory after unrollSafely not count into used memory(memoryStore.entrys or unrollMemory)
[ https://issues.apache.org/jira/browse/SPARK-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248177#comment-14248177 ] Sean Owen commented on SPARK-4777: -- [~SuYan] Given your comment in your PR, do you intend this to be considered a duplicate of SPARK-3000? Some block memory after unrollSafely not count into used memory(memoryStore.entrys or unrollMemory) --- Key: SPARK-4777 URL: https://issues.apache.org/jira/browse/SPARK-4777 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: SuYan Priority: Minor Some memory not count into memory used by memoryStore or unrollMemory. Thread A after unrollsafely memory, it will release 40MB unrollMemory(40MB will used by other threads). then ThreadA wait get accountingLock to tryToPut blockA(30MB). before Thread A get accountingLock, blockA memory size is not counting into unrollMemory or memoryStore.currentMemory. IIUC, freeMemory should minus that block memory -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring
[ https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4856: - Description: We have data like: {quote} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {quote} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. was: We have data like: {code:java} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {code:xml} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. Null empty string should not be considered as StringType at begining in Json schema inferring --- Key: SPARK-4856 URL: https://issues.apache.org/jira/browse/SPARK-4856 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao We have data like: {quote} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {quote} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring
[ https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4856: - Description: We have data like: {panel} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {panel} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. was: We have data like: {quote} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {quote} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. Null empty string should not be considered as StringType at begining in Json schema inferring --- Key: SPARK-4856 URL: https://issues.apache.org/jira/browse/SPARK-4856 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao We have data like: {panel} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {panel} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring
[ https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4856: - Description: We have data like: {noformat} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {noformat} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. was: dWe have data like: {noformat} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {noformat} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. Null empty string should not be considered as StringType at begining in Json schema inferring --- Key: SPARK-4856 URL: https://issues.apache.org/jira/browse/SPARK-4856 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao We have data like: {noformat} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {noformat} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring
[ https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4856: - Description: dWe have data like: {noformat} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {noformat} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. was: We have data like: {panel} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {panel} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. Null empty string should not be considered as StringType at begining in Json schema inferring --- Key: SPARK-4856 URL: https://issues.apache.org/jira/browse/SPARK-4856 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao dWe have data like: {noformat} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {noformat} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring
[ https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4856: - Description: We have data like: {noformat} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {noformat} As empty string (the headers) will be considered as String, and it ignores the real nested data type (struct type headers in line 1), and then we will get the headers as String Type, which is not our expectation. was: We have data like: {noformat} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {noformat} As empty string (the headers) will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type headers in line 1), and also take the line 1 (the headers) as String Type, which is not our expected. Null empty string should not be considered as StringType at begining in Json schema inferring --- Key: SPARK-4856 URL: https://issues.apache.org/jira/browse/SPARK-4856 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao We have data like: {noformat} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {noformat} As empty string (the headers) will be considered as String, and it ignores the real nested data type (struct type headers in line 1), and then we will get the headers as String Type, which is not our expectation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring
[ https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4856: - Description: We have data like: {noformat} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {noformat} As empty string (the headers) will be considered as String, and it ignores the real nested data type (struct type headers in line 1), and then we will get the headers (in line 1) as String Type, which is not our expectation. was: We have data like: {noformat} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {noformat} As empty string (the headers) will be considered as String, and it ignores the real nested data type (struct type headers in line 1), and then we will get the headers as String Type, which is not our expectation. Null empty string should not be considered as StringType at begining in Json schema inferring --- Key: SPARK-4856 URL: https://issues.apache.org/jira/browse/SPARK-4856 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao We have data like: {noformat} TestSQLContext.sparkContext.parallelize( {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} :: {ip:27.31.100.29,headers:{}} :: {ip:27.31.100.29,headers:} :: Nil) {noformat} As empty string (the headers) will be considered as String, and it ignores the real nested data type (struct type headers in line 1), and then we will get the headers (in line 1) as String Type, which is not our expectation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4732) All application progress on the standalone scheduler can be halted by one systematically faulty node
[ https://issues.apache.org/jira/browse/SPARK-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248367#comment-14248367 ] Harry Brundage commented on SPARK-4732: --- Seems like it would, feel free to mark as duplicate! All application progress on the standalone scheduler can be halted by one systematically faulty node Key: SPARK-4732 URL: https://issues.apache.org/jira/browse/SPARK-4732 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Environment: - Spark Standalone scheduler Reporter: Harry Brundage We've experienced several cluster wide outages caused by unexpected system wide faults on one of our spark workers if that worker is failing systematically. By systematically, I mean that every executor launched by that worker will definitely fail due to some reason out of Spark's control like the log directory disk being completely out of space, or a permissions error for a file that's always read during executor launch. We screw up all the time on our team and cause stuff like this to happen, but because of the way the standalone scheduler allocates resources, our cluster doesn't recover gracefully from these failures. When there are more tasks to do than executors, I am pretty sure the way the scheduler works is that it just waits for more resource offers and then allocates tasks from the queue to those resources. If an executor dies immediately after starting, the worker monitor process will notice that it's dead. The master will allocate that worker's now free cores/memory to a currently running application that is below its spark.cores.max, which in our case I've observed as usually the app that just had the executor die. A new executor gets spawned on the same worker that the last one just died on, gets allocated that one task that failed, and then the whole process fails again for the same systematic reason, and lather rinse repeat. This happens 10 times or whatever the max task failure count is, and then the whole app is deemed a failure by the driver and shut down completely. This happens to us for all applications in the cluster as well. We usually run roughly as many cores as we have hadoop nodes. We also usually have many more input splits than we have tasks, which means the locality of the first few tasks which I believe determines where our executors run is well spread out over the cluster, and often covers 90-100% of nodes. This means the likelihood of any application getting an executor scheduled any broken node is quite high. After an old application goes through the above mentioned process and dies, the next application to start or not be at it's requested max capacity gets an executor scheduled on the broken node, and is promptly taken down as well. This happens over and over as well, to the point where none of our spark jobs are making any progress because of one tiny permissions mistake on one node. Now, I totally understand this is usually an error between keyboard and screen kind of situation where it is the responsibility of the people deploying spark to ensure it is deployed correctly. The systematic issues we've encountered are almost always of this nature: permissions errors, disk full errors, one node not getting a new spark jar from a configuration error, configurations being out of sync, etc. That said, disks are going to fail or half fail, fill up, node rot is going to ruin configurations, etc etc etc, and as hadoop clusters scale in size this becomes more and more likely, so I think its reasonable to ask that Spark be resilient to this kind of failure and keep on truckin'. I think a good simple fix would be to have applications, or the master, blacklist workers (not executors) at a failure count lower than the task failure count. This would also serve as a belt and suspenders fix for SPARK-4498. If the scheduler stopped trying to schedule on nodes that fail a lot, we could still make progress. These blacklist events are really important and I think would need to be well logged and surfaced in the UI, but I'd rather log and carry on than fail hard. I think the tradeoff here is that you risk blacklisting ever worker as well if there is something systematically wrong with communication or whatever else I can't imagine. Please let me know if I've misunderstood how the scheduler works or you need more information or anything like that and I'll be happy to provide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Created] (SPARK-4861) Refactory command in spark sql
wangfei created SPARK-4861: -- Summary: Refactory command in spark sql Key: SPARK-4861 URL: https://issues.apache.org/jira/browse/SPARK-4861 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1 Reporter: wangfei Fix For: 1.3.0 Fix a todo in spark sql: remove ```Command``` and use ```RunnableCommand``` instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4861) Refactory command in spark sql
[ https://issues.apache.org/jira/browse/SPARK-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248380#comment-14248380 ] Apache Spark commented on SPARK-4861: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/3712 Refactory command in spark sql -- Key: SPARK-4861 URL: https://issues.apache.org/jira/browse/SPARK-4861 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1 Reporter: wangfei Fix For: 1.3.0 Fix a todo in spark sql: remove ```Command``` and use ```RunnableCommand``` instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4862) Streaming | Setting checkpoint as a local directory results in Checkpoint RDD has different partitions error
Aniket Bhatnagar created SPARK-4862: --- Summary: Streaming | Setting checkpoint as a local directory results in Checkpoint RDD has different partitions error Key: SPARK-4862 URL: https://issues.apache.org/jira/browse/SPARK-4862 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.1, 1.1.0 Reporter: Aniket Bhatnagar Priority: Minor If the checkpoint is set as a local filesystem directory, it results in weird error messages like the following: org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[467] at apply at List.scala:318(0) has different number of partitions than original RDD MapPartitionsRDD[461] at mapPartitions at StateDStream.scala:71(56) It would be great if Spark could output better error message that better hints at what could have gone wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4863) Suspicious exception handlers
Ding Yuan created SPARK-4863: Summary: Suspicious exception handlers Key: SPARK-4863 URL: https://issues.apache.org/jira/browse/SPARK-4863 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1 Reporter: Ding Yuan Following up with the discussion in https://issues.apache.org/jira/browse/SPARK-1148, I am creating a new JIRA to report the suspicious exception handlers detected by our tool aspirator on spark-1.1.1. == WARNING: TODO; in handler. Line: 129, File: org/apache/thrift/transport/TNonblockingServerSocket.java 122: public void registerSelector(Selector selector) { 123:try { 124: // Register the server socket channel, indicating an interest in 125: // accepting new connections 126: serverSocketChannel.register(selector, SelectionKey.OP_ACCEPT); 127:} catch (ClosedChannelException e) { 128: // this shouldn't happen, ideally... 129: // TODO: decide what to do with this. 130:} 131: } == == WARNING: TODO; in handler. Line: 1583, File: org/apache/spark/SparkContext.scala 1578: val scheduler = try { 1579: val clazz = Class.forName(org.apache.spark.scheduler.cluster.YarnClusterScheduler) 1580: val cons = clazz.getConstructor(classOf[SparkContext]) 1581: cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl] 1582: } catch { 1583: // TODO: Enumerate the exact reasons why it can fail 1584: // But irrespective of it, it means we cannot proceed ! 1585: case e: Exception = { 1586: throw new SparkException(YARN mode not available ?, e) 1587: } == == WARNING 1: empty handler for exception: java.lang.Exception THERE IS NO LOG MESSAGE!!! Line: 75, File: org/apache/spark/repl/ExecutorClassLoader.scala try { val pathInDirectory = name.replace('.', '/') + .class val inputStream = { if (fileSystem != null) { fileSystem.open(new Path(directory, pathInDirectory)) } else { if (SparkEnv.get.securityManager.isAuthenticationEnabled()) { val uri = new URI(classUri + / + urlEncode(pathInDirectory)) val newuri = Utils.constructURIForAuthentication(uri, SparkEnv.get.securityManager) newuri.toURL().openStream() } else { new URL(classUri + / + urlEncode(pathInDirectory)).openStream() } } } val bytes = readAndTransformClass(name, inputStream) inputStream.close() Some(defineClass(name, bytes, 0, bytes.length)) } catch { case e: Exception = None } == == WARNING 1: empty handler for exception: java.io.IOException THERE IS NO LOG MESSAGE!!! Line: 275, File: org/apache/spark/util/Utils.scala try { dir = new File(root, spark- + UUID.randomUUID.toString) if (dir.exists() || !dir.mkdirs()) { dir = null } } catch { case e: IOException = ; } == == WARNING 1: empty handler for exception: java.lang.InterruptedException THERE IS NO LOG MESSAGE!!! Line: 172, File: parquet/org/apache/thrift/server/TNonblockingServer.java protected void joinSelector() { // wait until the selector thread exits try { selectThread_.join(); } catch (InterruptedException e) { // for now, just silently ignore. technically this means we'll have less of // a graceful shutdown as a result. } } == == WARNING 2: empty handler for exception: java.net.SocketException There are log messages.. Line: 111, File: parquet/org/apache/thrift/transport/TNonblockingSocket.java public void setTimeout(int timeout) { try { socketChannel_.socket().setSoTimeout(timeout); } catch (SocketException sx) { LOGGER.warn(Could not set socket timeout., sx); } } == == WARNING 3: empty handler for exception: java.net.SocketException There are log messages.. Line: 103, File: parquet/org/apache/thrift/transport/TServerSocket.java public void listen() throws TTransportException { // Make sure not to block on accept if (serverSocket_ != null) { try { serverSocket_.setSoTimeout(0); } catch (SocketException sx) { LOGGER.error(Could not set socket timeout., sx); } } } ==
[jira] [Updated] (SPARK-4863) Suspicious exception handlers
[ https://issues.apache.org/jira/browse/SPARK-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ding Yuan updated SPARK-4863: - Description: Following up with the discussion in https://issues.apache.org/jira/browse/SPARK-1148, I am creating a new JIRA to report the suspicious exception handlers detected by our tool aspirator on spark-1.1.1. {noformat} == WARNING: TODO; in handler. Line: 129, File: org/apache/thrift/transport/TNonblockingServerSocket.java 122: public void registerSelector(Selector selector) { 123:try { 124: // Register the server socket channel, indicating an interest in 125: // accepting new connections 126: serverSocketChannel.register(selector, SelectionKey.OP_ACCEPT); 127:} catch (ClosedChannelException e) { 128: // this shouldn't happen, ideally... 129: // TODO: decide what to do with this. 130:} 131: } == == WARNING: TODO; in handler. Line: 1583, File: org/apache/spark/SparkContext.scala 1578: val scheduler = try { 1579: val clazz = Class.forName(org.apache.spark.scheduler.cluster.YarnClusterScheduler) 1580: val cons = clazz.getConstructor(classOf[SparkContext]) 1581: cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl] 1582: } catch { 1583: // TODO: Enumerate the exact reasons why it can fail 1584: // But irrespective of it, it means we cannot proceed ! 1585: case e: Exception = { 1586: throw new SparkException(YARN mode not available ?, e) 1587: } == == WARNING 1: empty handler for exception: java.lang.Exception THERE IS NO LOG MESSAGE!!! Line: 75, File: org/apache/spark/repl/ExecutorClassLoader.scala try { val pathInDirectory = name.replace('.', '/') + .class val inputStream = { if (fileSystem != null) { fileSystem.open(new Path(directory, pathInDirectory)) } else { if (SparkEnv.get.securityManager.isAuthenticationEnabled()) { val uri = new URI(classUri + / + urlEncode(pathInDirectory)) val newuri = Utils.constructURIForAuthentication(uri, SparkEnv.get.securityManager) newuri.toURL().openStream() } else { new URL(classUri + / + urlEncode(pathInDirectory)).openStream() } } } val bytes = readAndTransformClass(name, inputStream) inputStream.close() Some(defineClass(name, bytes, 0, bytes.length)) } catch { case e: Exception = None } == == WARNING 1: empty handler for exception: java.io.IOException THERE IS NO LOG MESSAGE!!! Line: 275, File: org/apache/spark/util/Utils.scala try { dir = new File(root, spark- + UUID.randomUUID.toString) if (dir.exists() || !dir.mkdirs()) { dir = null } } catch { case e: IOException = ; } == == WARNING 1: empty handler for exception: java.lang.InterruptedException THERE IS NO LOG MESSAGE!!! Line: 172, File: parquet/org/apache/thrift/server/TNonblockingServer.java protected void joinSelector() { // wait until the selector thread exits try { selectThread_.join(); } catch (InterruptedException e) { // for now, just silently ignore. technically this means we'll have less of // a graceful shutdown as a result. } } == == WARNING 2: empty handler for exception: java.net.SocketException There are log messages.. Line: 111, File: parquet/org/apache/thrift/transport/TNonblockingSocket.java public void setTimeout(int timeout) { try { socketChannel_.socket().setSoTimeout(timeout); } catch (SocketException sx) { LOGGER.warn(Could not set socket timeout., sx); } } == == WARNING 3: empty handler for exception: java.net.SocketException There are log messages.. Line: 103, File: parquet/org/apache/thrift/transport/TServerSocket.java public void listen() throws TTransportException { // Make sure not to block on accept if (serverSocket_ != null) { try { serverSocket_.setSoTimeout(0); } catch (SocketException sx) { LOGGER.error(Could not set socket timeout., sx); } } } == == WARNING 4: empty handler for exception: java.net.SocketException There are log messages.. Line: 70, File:
[jira] [Commented] (SPARK-1148) Suggestions for exception handling (avoid potential bugs)
[ https://issues.apache.org/jira/browse/SPARK-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248707#comment-14248707 ] Ding Yuan commented on SPARK-1148: -- Glad to know these cases are addressed! As for version 0.9.0, I have reported all the cases (only three of them). I have opened another JIRA: https://issues.apache.org/jira/browse/SPARK-4863 to report the warnings on spark-1.1.1. Suggestions for exception handling (avoid potential bugs) - Key: SPARK-1148 URL: https://issues.apache.org/jira/browse/SPARK-1148 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0 Reporter: Ding Yuan Hi Spark developers, We are a group of researchers on software reliability. Recently we did a study and found that majority of the most severe failures in data-analytic systems are caused by bugs in exception handlers – that it is hard to anticipate all the possible real-world error scenarios. Therefore we built a simple checking tool that automatically detects some bug patterns that have caused some very severe real-world failures. I am reporting a few cases here. Any feedback is much appreciated! Ding = Case 1: Line: 1249, File: org/apache/spark/SparkContext.scala {noformat} 1244: val scheduler = try { 1245: val clazz = Class.forName(org.apache.spark.scheduler.cluster.YarnClusterScheduler) 1246: val cons = clazz.getConstructor(classOf[SparkContext]) 1247: cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl] 1248: } catch { 1249: // TODO: Enumerate the exact reasons why it can fail 1250: // But irrespective of it, it means we cannot proceed ! 1251: case th: Throwable = { 1252: throw new SparkException(YARN mode not available ?, th) 1253: } {noformat} The comment suggests the specific exceptions should be enumerated here. The try block could throw the following exceptions: ClassNotFoundException NegativeArraySizeException NoSuchMethodException SecurityException InstantiationException IllegalAccessException IllegalArgumentException InvocationTargetException ClassCastException == = Case 2: Line: 282, File: org/apache/spark/executor/Executor.scala {noformat} 265: case t: Throwable = { 266: val serviceTime = (System.currentTimeMillis() - taskStart).toInt 267: val metrics = attemptedTask.flatMap(t = t.metrics) 268: for (m - metrics) { 269: m.executorRunTime = serviceTime 270: m.jvmGCTime = gcTime - startGCTime 271: } 272: val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics) 273: execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason)) 274: 275: // TODO: Should we exit the whole executor here? On the one hand, the failed task may 276: // have left some weird state around depending on when the exception was thrown, but on 277: // the other hand, maybe we could detect that when future tasks fail and exit then. 278: logError(Exception in task ID + taskId, t) 279: //System.exit(1) 280: } 281: } finally { 282: // TODO: Unregister shuffle memory only for ResultTask 283: val shuffleMemoryMap = env.shuffleMemoryMap 284: shuffleMemoryMap.synchronized { 285: shuffleMemoryMap.remove(Thread.currentThread().getId) 286: } 287: runningTasks.remove(taskId) 288: } {noformat} From the comment in this Throwable exception handler it seems to suggest that the system should just exit? == = Case 3: Line: 70, File: org/apache/spark/network/netty/FileServerHandler.java {noformat} 66: try { 67: ctx.write(new DefaultFileRegion(new FileInputStream(file) 68: .getChannel(), fileSegment.offset(), fileSegment.length())); 69: } catch (Exception e) { 70: LOG.error(Exception: , e); 71: } {noformat} Exception is too general. The try block only throws FileNotFoundException. Although there is nothing wrong with it now, but later if code evolves this might cause some other exceptions to be swallowed. == -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4437) Docs for difference between WholeTextFileRecordReader and WholeCombineFileRecordReader
[ https://issues.apache.org/jira/browse/SPARK-4437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4437. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3301 [https://github.com/apache/spark/pull/3301] Docs for difference between WholeTextFileRecordReader and WholeCombineFileRecordReader -- Key: SPARK-4437 URL: https://issues.apache.org/jira/browse/SPARK-4437 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Andrew Ash Assignee: Davies Liu Fix For: 1.3.0 Tracking per this dev@ thread: {quote} On Sun, Nov 16, 2014 at 4:49 PM, Reynold Xin r...@databricks.com wrote: I don't think the code is immediately obvious. Davies - I think you added the code, and Josh reviewed it. Can you guys explain and maybe submit a patch to add more documentation on the whole thing? Thanks. On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad vibhanshugs...@gmail.com wrote: Hello Everyone, I am going through the source code of rdd and Record readers There are found 2 classes 1. WholeTextFileRecordReader 2. WholeCombineFileRecordReader ( extends CombineFileRecordReader ) The description of both the classes is perfectly similar. I am not able to understand why we have 2 classes. Is CombineFileRecordReader providing some extra advantage? Regards Vibhanshu {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4855) Python tests for hypothesis testing
[ https://issues.apache.org/jira/browse/SPARK-4855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4855. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3679 [https://github.com/apache/spark/pull/3679] Python tests for hypothesis testing --- Key: SPARK-4855 URL: https://issues.apache.org/jira/browse/SPARK-4855 Project: Spark Issue Type: Test Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Ben Cook Priority: Minor Fix For: 1.3.0 Add Python unit tests for Chi-Squared hypothesis testing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4839) Adding documentations about dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4839. Resolution: Duplicate Closing this as a duplicate Adding documentations about dynamic resource allocation --- Key: SPARK-4839 URL: https://issues.apache.org/jira/browse/SPARK-4839 Project: Spark Issue Type: Sub-task Components: Documentation, Spark Core, YARN Affects Versions: 1.2.0 Reporter: Tsuyoshi OZAWA Fix For: 1.2.0 There are not docs about dynamicAllocation. We should add them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit
[ https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248802#comment-14248802 ] Joseph K. Bradley commented on SPARK-4846: -- Changing vectorSize sounds too aggressive to me. I'd vote for either the simple solution (throw a nice error), or an efficient method which chooses minCount automatically. For the latter, this might work: 1. Try the current method. Catch errors during the collect() for vocab and during the array allocations. If there is no error, skip step 2. 2. If there is an error, do 1 pass over the data to collect stats (e.g., a histogram). Use those stats to choose a reasonable minCount. Choose the vocab again, etc. 3. After the big array allocations, the algorithm can continue as before. When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit --- Key: SPARK-4846 URL: https://issues.apache.org/jira/browse/SPARK-4846 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one partition. The corpus contains about 300 million words and its vocabulary size is about 10 million. Reporter: Joseph Tang Priority: Critical Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4864) Add documentation to Netty-based configs
Aaron Davidson created SPARK-4864: - Summary: Add documentation to Netty-based configs Key: SPARK-4864 URL: https://issues.apache.org/jira/browse/SPARK-4864 Project: Spark Issue Type: Bug Components: Documentation Reporter: Aaron Davidson Assignee: Aaron Davidson Currently there is no public documentation for the NettyBlockTransferService or various configuration options of the network package. We should add some. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4864) Add documentation to Netty-based configs
[ https://issues.apache.org/jira/browse/SPARK-4864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248820#comment-14248820 ] Apache Spark commented on SPARK-4864: - User 'aarondav' has created a pull request for this issue: https://github.com/apache/spark/pull/3713 Add documentation to Netty-based configs Key: SPARK-4864 URL: https://issues.apache.org/jira/browse/SPARK-4864 Project: Spark Issue Type: Bug Components: Documentation Reporter: Aaron Davidson Assignee: Aaron Davidson Currently there is no public documentation for the NettyBlockTransferService or various configuration options of the network package. We should add some. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3405) EC2 cluster creation on VPC
[ https://issues.apache.org/jira/browse/SPARK-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3405. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 2872 [https://github.com/apache/spark/pull/2872] EC2 cluster creation on VPC --- Key: SPARK-3405 URL: https://issues.apache.org/jira/browse/SPARK-3405 Project: Spark Issue Type: New Feature Components: EC2 Affects Versions: 1.0.2 Environment: Ubuntu 12.04 Reporter: Dawson Reid Priority: Minor Fix For: 1.3.0 It would be very useful to be able to specify the EC2 VPC in which the Spark cluster should be created. When creating a Spark cluster on AWS via the spark-ec2 script there is no way to specify a VPC id of the VPC you would like the cluster to be created in. The script always creates the cluster in the default VPC. In my case I have deleted the default VPC and the spark-ec2 script errors out with the following : Setting up security groups... Creating security group test-master ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeVPCIdNotSpecified/CodeMessageNo default VPC for this user/Message/Error/ErrorsRequestID312a2281-81a1-4d3c-ba10-0593a886779d/RequestID/Response Traceback (most recent call last): File ./spark_ec2.py, line 860, in module main() File ./spark_ec2.py, line 852, in main real_main() File ./spark_ec2.py, line 735, in real_main conn, opts, cluster_name) File ./spark_ec2.py, line 247, in launch_cluster master_group = get_or_make_group(conn, cluster_name + -master) File ./spark_ec2.py, line 143, in get_or_make_group return conn.create_security_group(name, Spark EC2 group) File /home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py, line 2011, in create_security_group File /home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py, line 925, in get_object boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeVPCIdNotSpecified/CodeMessageNo default VPC for this user/Message/Error/ErrorsRequestID312a2281-81a1-4d3c-ba10-0593a886779d/RequestID/Response -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2611) VPC Issue while creating an ec2 cluster
[ https://issues.apache.org/jira/browse/SPARK-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248836#comment-14248836 ] Josh Rosen commented on SPARK-2611: --- For the folks watching this issue: SPARK-3405 has now been merged into {{master}}! VPC Issue while creating an ec2 cluster --- Key: SPARK-2611 URL: https://issues.apache.org/jira/browse/SPARK-2611 Project: Spark Issue Type: Bug Components: EC2, PySpark Affects Versions: 1.0.0 Environment: Debian wheezy Reporter: Anass BENSRHIR I'm having a critical issue while creating an ec2 cluster , here is the input command and output i got : ./spark-ec2 -i ~/amazonhdp.pem -k amazonhdp -s 4 -t m1.small launch hadoopi Setting up security groups... Creating security group hadoopi-master Creating security group hadoopi-slaves ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidGroup.NotFound/CodeMessageThe security group 'sg-2ec1b84b' does not exist/Message/Error/ErrorsRequestID6554c7a8-f68a-4032-ad63-65106e2de9b3/RequestID/Response Traceback (most recent call last): File ./spark_ec2.py, line 856, in module main() File ./spark_ec2.py, line 848, in main real_main() File ./spark_ec2.py, line 731, in real_main conn, opts, cluster_name) File ./spark_ec2.py, line 252, in launch_cluster master_group.authorize('tcp', 50030, 50030, '0.0.0.0/0') File /opt/spark-1.0.1-bin-hadoop2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/securitygroup.py, line 184, in authorize File /opt/spark-1.0.1-bin-hadoop2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py, line 2181, in authorize_security_group File /opt/spark-1.0.1-bin-hadoop2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py, line 944, in get_status boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidGroup.NotFound/CodeMessageThe security group 'sg-2ec1b84b' does not exist/Message/Error/ErrorsRequestID6554c7a8-f68a-4032-ad63-65106e2de9b3/RequestID/Response -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4865) rdds exposed to sql context via registerTempTable are not listed via thrift jdbc show tables
Misha Chernetsov created SPARK-4865: --- Summary: rdds exposed to sql context via registerTempTable are not listed via thrift jdbc show tables Key: SPARK-4865 URL: https://issues.apache.org/jira/browse/SPARK-4865 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Misha Chernetsov -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4700) Add Http support to Spark Thrift server
[ https://issues.apache.org/jira/browse/SPARK-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4700. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3672 [https://github.com/apache/spark/pull/3672] Add Http support to Spark Thrift server --- Key: SPARK-4700 URL: https://issues.apache.org/jira/browse/SPARK-4700 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.3.0, 1.2.1 Environment: Linux and Windows Reporter: Judy Nash Fix For: 1.3.0 Original Estimate: 48h Remaining Estimate: 48h Currently thrift only supports TCP connection. The JIRA is to add HTTP support to spark thrift server in addition to the TCP protocol. Both TCP and HTTP are supported by Hive today. HTTP is more secure and used often in Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1148) Suggestions for exception handling (avoid potential bugs)
[ https://issues.apache.org/jira/browse/SPARK-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248941#comment-14248941 ] Sean Owen commented on SPARK-1148: -- [~d.yuan] I don't know that these have been addressed; some issues like it have been, if I recall correctly. Whatever the status, I think it's useful to focus on master and/or later branches. Would it be OK to close this in favor of SPARK-4863? Suggestions for exception handling (avoid potential bugs) - Key: SPARK-1148 URL: https://issues.apache.org/jira/browse/SPARK-1148 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0 Reporter: Ding Yuan Hi Spark developers, We are a group of researchers on software reliability. Recently we did a study and found that majority of the most severe failures in data-analytic systems are caused by bugs in exception handlers – that it is hard to anticipate all the possible real-world error scenarios. Therefore we built a simple checking tool that automatically detects some bug patterns that have caused some very severe real-world failures. I am reporting a few cases here. Any feedback is much appreciated! Ding = Case 1: Line: 1249, File: org/apache/spark/SparkContext.scala {noformat} 1244: val scheduler = try { 1245: val clazz = Class.forName(org.apache.spark.scheduler.cluster.YarnClusterScheduler) 1246: val cons = clazz.getConstructor(classOf[SparkContext]) 1247: cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl] 1248: } catch { 1249: // TODO: Enumerate the exact reasons why it can fail 1250: // But irrespective of it, it means we cannot proceed ! 1251: case th: Throwable = { 1252: throw new SparkException(YARN mode not available ?, th) 1253: } {noformat} The comment suggests the specific exceptions should be enumerated here. The try block could throw the following exceptions: ClassNotFoundException NegativeArraySizeException NoSuchMethodException SecurityException InstantiationException IllegalAccessException IllegalArgumentException InvocationTargetException ClassCastException == = Case 2: Line: 282, File: org/apache/spark/executor/Executor.scala {noformat} 265: case t: Throwable = { 266: val serviceTime = (System.currentTimeMillis() - taskStart).toInt 267: val metrics = attemptedTask.flatMap(t = t.metrics) 268: for (m - metrics) { 269: m.executorRunTime = serviceTime 270: m.jvmGCTime = gcTime - startGCTime 271: } 272: val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics) 273: execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason)) 274: 275: // TODO: Should we exit the whole executor here? On the one hand, the failed task may 276: // have left some weird state around depending on when the exception was thrown, but on 277: // the other hand, maybe we could detect that when future tasks fail and exit then. 278: logError(Exception in task ID + taskId, t) 279: //System.exit(1) 280: } 281: } finally { 282: // TODO: Unregister shuffle memory only for ResultTask 283: val shuffleMemoryMap = env.shuffleMemoryMap 284: shuffleMemoryMap.synchronized { 285: shuffleMemoryMap.remove(Thread.currentThread().getId) 286: } 287: runningTasks.remove(taskId) 288: } {noformat} From the comment in this Throwable exception handler it seems to suggest that the system should just exit? == = Case 3: Line: 70, File: org/apache/spark/network/netty/FileServerHandler.java {noformat} 66: try { 67: ctx.write(new DefaultFileRegion(new FileInputStream(file) 68: .getChannel(), fileSegment.offset(), fileSegment.length())); 69: } catch (Exception e) { 70: LOG.error(Exception: , e); 71: } {noformat} Exception is too general. The try block only throws FileNotFoundException. Although there is nothing wrong with it now, but later if code evolves this might cause some other exceptions to be swallowed. == -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4866) Support StructType as key in MapType
Davies Liu created SPARK-4866: - Summary: Support StructType as key in MapType Key: SPARK-4866 URL: https://issues.apache.org/jira/browse/SPARK-4866 Project: Spark Issue Type: Bug Components: PySpark, SQL Reporter: Davies Liu http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-Applying-schema-to-a-dictionary-with-a-Tuple-as-key-td20716.html Hi Guys, Im running a spark cluster in AWS with Spark 1.1.0 in EC2 I am trying to convert a an RDD with tuple (u'string', int , {(int, int): int, (int, int): int}) to a schema rdd using the schema: {code} fields = [StructField('field1',StringType(),True), StructField('field2',IntegerType(),True), StructField('field3',MapType(StructType([StructField('field31',IntegerType(),True), StructField('field32',IntegerType(),True)]),IntegerType(),True),True) ] schema = StructType(fields) # generate the schemaRDD with the defined schema schemaRDD = sqc.applySchema(RDD, schema) {code} But when I add field3 to the schema, it throws an execption: {code} Traceback (most recent call last): File stdin, line 1, in module File /root/spark/python/pyspark/rdd.py, line 1153, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /root/spark/python/pyspark/context.py, line 770, in runJob it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage 28.0 (TID 710, ip-172-31-29-120.ec2.internal): net.razorvine.pickle.PickleException: couldn't introspect javabean: java.lang.IllegalArgumentException: wrong number of arguments net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_map(Pickler.java:321) net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:412) net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:412) net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.dump(Pickler.java:95) net.razorvine.pickle.Pickler.dumps(Pickler.java:80) org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$2.apply(SchemaRDD.scala:417) org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$2.apply(SchemaRDD.scala:417) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:331) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at
[jira] [Resolved] (SPARK-1148) Suggestions for exception handling (avoid potential bugs)
[ https://issues.apache.org/jira/browse/SPARK-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1148. -- Resolution: Duplicate Suggestions for exception handling (avoid potential bugs) - Key: SPARK-1148 URL: https://issues.apache.org/jira/browse/SPARK-1148 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0 Reporter: Ding Yuan Hi Spark developers, We are a group of researchers on software reliability. Recently we did a study and found that majority of the most severe failures in data-analytic systems are caused by bugs in exception handlers – that it is hard to anticipate all the possible real-world error scenarios. Therefore we built a simple checking tool that automatically detects some bug patterns that have caused some very severe real-world failures. I am reporting a few cases here. Any feedback is much appreciated! Ding = Case 1: Line: 1249, File: org/apache/spark/SparkContext.scala {noformat} 1244: val scheduler = try { 1245: val clazz = Class.forName(org.apache.spark.scheduler.cluster.YarnClusterScheduler) 1246: val cons = clazz.getConstructor(classOf[SparkContext]) 1247: cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl] 1248: } catch { 1249: // TODO: Enumerate the exact reasons why it can fail 1250: // But irrespective of it, it means we cannot proceed ! 1251: case th: Throwable = { 1252: throw new SparkException(YARN mode not available ?, th) 1253: } {noformat} The comment suggests the specific exceptions should be enumerated here. The try block could throw the following exceptions: ClassNotFoundException NegativeArraySizeException NoSuchMethodException SecurityException InstantiationException IllegalAccessException IllegalArgumentException InvocationTargetException ClassCastException == = Case 2: Line: 282, File: org/apache/spark/executor/Executor.scala {noformat} 265: case t: Throwable = { 266: val serviceTime = (System.currentTimeMillis() - taskStart).toInt 267: val metrics = attemptedTask.flatMap(t = t.metrics) 268: for (m - metrics) { 269: m.executorRunTime = serviceTime 270: m.jvmGCTime = gcTime - startGCTime 271: } 272: val reason = ExceptionFailure(t.getClass.getName, t.toString, t.getStackTrace, metrics) 273: execBackend.statusUpdate(taskId, TaskState.FAILED, ser.serialize(reason)) 274: 275: // TODO: Should we exit the whole executor here? On the one hand, the failed task may 276: // have left some weird state around depending on when the exception was thrown, but on 277: // the other hand, maybe we could detect that when future tasks fail and exit then. 278: logError(Exception in task ID + taskId, t) 279: //System.exit(1) 280: } 281: } finally { 282: // TODO: Unregister shuffle memory only for ResultTask 283: val shuffleMemoryMap = env.shuffleMemoryMap 284: shuffleMemoryMap.synchronized { 285: shuffleMemoryMap.remove(Thread.currentThread().getId) 286: } 287: runningTasks.remove(taskId) 288: } {noformat} From the comment in this Throwable exception handler it seems to suggest that the system should just exit? == = Case 3: Line: 70, File: org/apache/spark/network/netty/FileServerHandler.java {noformat} 66: try { 67: ctx.write(new DefaultFileRegion(new FileInputStream(file) 68: .getChannel(), fileSegment.offset(), fileSegment.length())); 69: } catch (Exception e) { 70: LOG.error(Exception: , e); 71: } {noformat} Exception is too general. The try block only throws FileNotFoundException. Although there is nothing wrong with it now, but later if code evolves this might cause some other exceptions to be swallowed. == -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4847) extraStrategies cannot take effect in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4847. - Resolution: Fixed Fix Version/s: 1.2.1 Issue resolved by pull request 3698 [https://github.com/apache/spark/pull/3698] extraStrategies cannot take effect in SQLContext Key: SPARK-4847 URL: https://issues.apache.org/jira/browse/SPARK-4847 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Saisai Shao Fix For: 1.2.1 Because strategies is initialized when SparkPlanner is created, so later added extraStrategies cannot be added into strategies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4812) SparkPlan.codegenEnabled may be initialized to a wrong value
[ https://issues.apache.org/jira/browse/SPARK-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4812. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3660 [https://github.com/apache/spark/pull/3660] SparkPlan.codegenEnabled may be initialized to a wrong value Key: SPARK-4812 URL: https://issues.apache.org/jira/browse/SPARK-4812 Project: Spark Issue Type: Bug Components: SQL Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fix For: 1.3.0 The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, which can be override by subclasses. Here is a simple example to show this issue. {code} scala :paste // Entering paste mode (ctrl-D to finish) abstract class Foo { protected val sqlContext = Foo val codegenEnabled: Boolean = { println(sqlContext) // it will call subclass's `sqlContext` which has not yet been initialized. if (sqlContext != null) { true } else { false } } } class Bar extends Foo { override val sqlContext = Bar } println(new Bar().codegenEnabled) // Exiting paste mode, now interpreting. null false defined class Foo defined class Bar scala {code} We should make `sqlContext` `final` to prevent subclasses from overriding it incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4767) Add support for launching in a specified placement group to spark ec2 scripts.
[ https://issues.apache.org/jira/browse/SPARK-4767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4767. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3623 [https://github.com/apache/spark/pull/3623] Add support for launching in a specified placement group to spark ec2 scripts. -- Key: SPARK-4767 URL: https://issues.apache.org/jira/browse/SPARK-4767 Project: Spark Issue Type: Improvement Components: EC2 Reporter: holdenk Priority: Trivial Fix For: 1.3.0 Original Estimate: 0.5h Remaining Estimate: 0.5h The Spark EC2 scripts don't currently allow users to specify a placement group. We should add this. EC2 placement groups are described in http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html This should not require updating the mesos/spark repo since its self contained in spark_ec2.py. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4767) Add support for launching in a specified placement group to spark ec2 scripts.
[ https://issues.apache.org/jira/browse/SPARK-4767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4767: -- Assignee: Holden Karau Add support for launching in a specified placement group to spark ec2 scripts. -- Key: SPARK-4767 URL: https://issues.apache.org/jira/browse/SPARK-4767 Project: Spark Issue Type: Improvement Components: EC2 Reporter: holdenk Assignee: Holden Karau Priority: Trivial Fix For: 1.3.0 Original Estimate: 0.5h Remaining Estimate: 0.5h The Spark EC2 scripts don't currently allow users to specify a placement group. We should add this. EC2 placement groups are described in http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html This should not require updating the mesos/spark repo since its self contained in spark_ec2.py. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4815) ThriftServer use only one SessionState to run sql using hive
[ https://issues.apache.org/jira/browse/SPARK-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-4815. --- Resolution: Duplicate ThriftServer use only one SessionState to run sql using hive - Key: SPARK-4815 URL: https://issues.apache.org/jira/browse/SPARK-4815 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: guowei ThriftServer use only one SessionState to run sql using hive, though it from different hive sessions. This will make mistakes: For example, one user run use database in one beeline client. the database in other beeline change too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4527) Add BroadcastNestedLoopJoin operator selection testsuite
[ https://issues.apache.org/jira/browse/SPARK-4527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4527. - Resolution: Fixed Fix Version/s: (was: 1.1.0) 1.3.0 Issue resolved by pull request 3395 [https://github.com/apache/spark/pull/3395] Add BroadcastNestedLoopJoin operator selection testsuite Key: SPARK-4527 URL: https://issues.apache.org/jira/browse/SPARK-4527 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.1.0 Reporter: XiaoJing wang Priority: Minor Fix For: 1.3.0 Original Estimate: 0.05h Remaining Estimate: 0.05h In `JoinSuite` add `BroadcastNestedLoopJoin` operator selection -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4837) NettyBlockTransferService does not abide by spark.blockManager.port config option
[ https://issues.apache.org/jira/browse/SPARK-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-4837: --- Target Version/s: 1.3.0, 1.2.1 (was: 1.2.1) NettyBlockTransferService does not abide by spark.blockManager.port config option - Key: SPARK-4837 URL: https://issues.apache.org/jira/browse/SPARK-4837 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Blocker The NettyBlockTransferService always binds to a random port, and does not use the spark.blockManager.port config as specified. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4867) UDF clean up
Michael Armbrust created SPARK-4867: --- Summary: UDF clean up Key: SPARK-4867 URL: https://issues.apache.org/jira/browse/SPARK-4867 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Blocker Right now our support and internal implementation of many functions has a few issues. Specifically: - UDFS don't know their input types and thus don't do type coercion. - We hard code a bunch of built in functions into the parser. This is bad because in SQL it creates new reserved words for things that aren't actually keywords. Also it means that for each function we need to add support to both SQLContext and HiveContext separately. For this JIRA I propose we do the following: - Change the interfaces for registerFunction and ScalaUdf to include types for the input arguments as well as the output type. - Add a rule to analysis that does type coercion for UDFs. - Add a parse rule for functions to SQLParser. - Rewrite all the UDFs that are currently hacked into the various parsers using this new functionality. Depending on how big this refactoring becomes we could split parts 12 from part 3 above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4483) Optimization about reduce memory costs during the HashOuterJoin
[ https://issues.apache.org/jira/browse/SPARK-4483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4483. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3375 [https://github.com/apache/spark/pull/3375] Optimization about reduce memory costs during the HashOuterJoin --- Key: SPARK-4483 URL: https://issues.apache.org/jira/browse/SPARK-4483 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Yi Tian Priority: Minor Fix For: 1.3.0 In {{HashOuterJoin.scala}}, spark read data from both side of join operation before zip them together. It is a waste for memory. We are trying to read data from only one side, put them into hashmap, and then generate the {{JoinedRow}} with data from other side one by one. Currently, we could only do this optimization for {{left outer join}} and {{right outer join}}. For {{full outer join}}, we will do somthing in another issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4827) Max iterations (100) reached for batch Resolution with deeply nested projects and project *s
[ https://issues.apache.org/jira/browse/SPARK-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4827. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3674 [https://github.com/apache/spark/pull/3674] Max iterations (100) reached for batch Resolution with deeply nested projects and project *s Key: SPARK-4827 URL: https://issues.apache.org/jira/browse/SPARK-4827 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4269) Make wait time in BroadcastHashJoin configurable
[ https://issues.apache.org/jira/browse/SPARK-4269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4269. - Resolution: Fixed Fix Version/s: (was: 1.2.0) 1.3.0 Issue resolved by pull request 3133 [https://github.com/apache/spark/pull/3133] Make wait time in BroadcastHashJoin configurable Key: SPARK-4269 URL: https://issues.apache.org/jira/browse/SPARK-4269 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Jacky Li Fix For: 1.3.0 In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to wait for the execution and broadcast of the small table. In my opinion, it should be a configurable value since broadcast may exceed 5 minutes in some case, like in a busy/congested network environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4777) Some block memory after unrollSafely not count into used memory(memoryStore.entrys or unrollMemory)
[ https://issues.apache.org/jira/browse/SPARK-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249330#comment-14249330 ] SuYan commented on SPARK-4777: -- Sean Owen, Hi, I intended to close that patch, but after having discussed with SPARK-3000 author, he says, current SPARK-3000 is have not merge into spark yet, so the problem is still in current code, so let's keep my patch still be open. If there are not a need to keep it open, tell me, I will close it, thanks. Some block memory after unrollSafely not count into used memory(memoryStore.entrys or unrollMemory) --- Key: SPARK-4777 URL: https://issues.apache.org/jira/browse/SPARK-4777 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: SuYan Priority: Minor Some memory not count into memory used by memoryStore or unrollMemory. Thread A after unrollsafely memory, it will release 40MB unrollMemory(40MB will used by other threads). then ThreadA wait get accountingLock to tryToPut blockA(30MB). before Thread A get accountingLock, blockA memory size is not counting into unrollMemory or memoryStore.currentMemory. IIUC, freeMemory should minus that block memory -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3955) Different versions between jackson-mapper-asl and jackson-core-asl
[ https://issues.apache.org/jira/browse/SPARK-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249369#comment-14249369 ] Apache Spark commented on SPARK-3955: - User 'jongyoul' has created a pull request for this issue: https://github.com/apache/spark/pull/3716 Different versions between jackson-mapper-asl and jackson-core-asl -- Key: SPARK-3955 URL: https://issues.apache.org/jira/browse/SPARK-3955 Project: Spark Issue Type: Bug Components: Build, Spark Core, SQL Affects Versions: 1.1.0 Reporter: Jongyoul Lee In the parent pom.xml, specified a version of jackson-mapper-asl. This is used by sql/hive/pom.xml. When mvn assembly runs, however, jackson-mapper-asl is not same as jackson-core-asl. This is because other libraries use several versions of jackson, so other version of jackson-core-asl is assembled. Simply, fix this problem if pom.xml has a specific version information of jackson-core-asl. If it's not set, a version 1.9.11 is merged info assembly.jar and we cannot use jackson library properly. {code} [INFO] Including org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8 in the shaded jar. [INFO] Including org.codehaus.jackson:jackson-core-asl:jar:1.9.11 in the shaded jar. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249381#comment-14249381 ] Victor Tso commented on SPARK-4105: --- Similarly, I hit this in 1.1.1: Job aborted due to stage failure: Task 0 in stage 3919.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3919.0 (TID 2467, preprod-e1-sw22s.paxata.com): java.io.IOException: failed to uncompress the chunk: FAILED_TO_UNCOMPRESS(5) org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:384) java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.storage.BlockManager$LazyProxyIterator$1.hasNext(BlockManager.scala:1171) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:48) org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle - Key: SPARK-4105 URL: https://issues.apache.org/jira/browse/SPARK-4105 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during shuffle read. Here's a sample stacktrace from an executor: {code} 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 33053) java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) at
[jira] [Comment Edited] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249381#comment-14249381 ] Victor Tso edited comment on SPARK-4105 at 12/17/14 3:11 AM: - Similarly, I hit this in 1.1.1: {code} Job aborted due to stage failure: Task 0 in stage 3919.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3919.0 (TID 2467, ...): java.io.IOException: failed to uncompress the chunk: FAILED_TO_UNCOMPRESS(5) org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:384) java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.storage.BlockManager$LazyProxyIterator$1.hasNext(BlockManager.scala:1171) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144) org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:48) org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) {code} was (Author: silvermast): Similarly, I hit this in 1.1.1: Job aborted due to stage failure: Task 0 in stage 3919.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3919.0 (TID 2467, preprod-e1-sw22s.paxata.com): java.io.IOException: failed to uncompress the chunk: FAILED_TO_UNCOMPRESS(5) org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:384) java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
[jira] [Created] (SPARK-4868) Twitter DStream.map() throws Task not serializable
Nicholas Chammas created SPARK-4868: --- Summary: Twitter DStream.map() throws Task not serializable Key: SPARK-4868 URL: https://issues.apache.org/jira/browse/SPARK-4868 Project: Spark Issue Type: Bug Components: Spark Shell, Streaming Affects Versions: 1.1.1 Environment: * Spark 1.1.1 * EC2 cluster with 1 slave spun up using {{spark-ec2}} * twitter4j 3.0.3 * {{spark-shell}} called with {{--jars}} argument to load {{spark-streaming-twitter_2.10-1.0.0.jar}} as well as all the twitter4j jars. Reporter: Nicholas Chammas Priority: Minor _(Continuing the discussion [started here on the Spark user list|http://apache-spark-user-list.1001560.n3.nabble.com/NotSerializableException-in-Spark-Streaming-td5725.html].)_ The following Spark Streaming code throws a serialization exception I do not understand. {code} import twitter4j.auth.{Authorization, OAuthAuthorization} import twitter4j.conf.ConfigurationBuilder import org.apache.spark.streaming.{Seconds, Minutes, StreamingContext} import org.apache.spark.streaming.twitter.TwitterUtils def getAuth(): Option[Authorization] = { System.setProperty(twitter4j.oauth.consumerKey, consumerKey) System.setProperty(twitter4j.oauth.consumerSecret, consumerSecret) System.setProperty(twitter4j.oauth.accessToken, accessToken) System.setProperty(twitter4j.oauth.accessTokenSecret, accessTokenSecret) Some(new OAuthAuthorization(new ConfigurationBuilder().build())) } def noop(a: Any): Any = { a } val ssc = new StreamingContext(sc, Seconds(5)) val liveTweetObjects = TwitterUtils.createStream(ssc, getAuth()) val liveTweets = liveTweetObjects.map(_.getText) liveTweets.map(t = noop(t)).print() // exception here ssc.start() {code} So before I even start the StreamingContext, I get the following stack trace: {code} scala liveTweets.map(t = noop(t)).print() org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1264) at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:438) at $iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC.init(console:32) at $iwC$$iwC.init(console:34) at $iwC.init(console:36) at init(console:38) at .init(console:42) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329) at
[jira] [Commented] (SPARK-4868) Twitter DStream.map() throws Task not serializable
[ https://issues.apache.org/jira/browse/SPARK-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249387#comment-14249387 ] Nicholas Chammas commented on SPARK-4868: - cc [~tdas], [~adav] Twitter DStream.map() throws Task not serializable Key: SPARK-4868 URL: https://issues.apache.org/jira/browse/SPARK-4868 Project: Spark Issue Type: Bug Components: Spark Shell, Streaming Affects Versions: 1.1.1 Environment: * Spark 1.1.1 * EC2 cluster with 1 slave spun up using {{spark-ec2}} * twitter4j 3.0.3 * {{spark-shell}} called with {{--jars}} argument to load {{spark-streaming-twitter_2.10-1.0.0.jar}} as well as all the twitter4j jars. Reporter: Nicholas Chammas Priority: Minor _(Continuing the discussion [started here on the Spark user list|http://apache-spark-user-list.1001560.n3.nabble.com/NotSerializableException-in-Spark-Streaming-td5725.html].)_ The following Spark Streaming code throws a serialization exception I do not understand. {code} import twitter4j.auth.{Authorization, OAuthAuthorization} import twitter4j.conf.ConfigurationBuilder import org.apache.spark.streaming.{Seconds, Minutes, StreamingContext} import org.apache.spark.streaming.twitter.TwitterUtils def getAuth(): Option[Authorization] = { System.setProperty(twitter4j.oauth.consumerKey, consumerKey) System.setProperty(twitter4j.oauth.consumerSecret, consumerSecret) System.setProperty(twitter4j.oauth.accessToken, accessToken) System.setProperty(twitter4j.oauth.accessTokenSecret, accessTokenSecret) Some(new OAuthAuthorization(new ConfigurationBuilder().build())) } def noop(a: Any): Any = { a } val ssc = new StreamingContext(sc, Seconds(5)) val liveTweetObjects = TwitterUtils.createStream(ssc, getAuth()) val liveTweets = liveTweetObjects.map(_.getText) liveTweets.map(t = noop(t)).print() // exception here ssc.start() {code} So before I even start the StreamingContext, I get the following stack trace: {code} scala liveTweets.map(t = noop(t)).print() org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1264) at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:438) at $iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC.init(console:32) at $iwC$$iwC.init(console:34) at $iwC.init(console:36) at init(console:38) at .init(console:42) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at
[jira] [Created] (SPARK-4869) The variable names in IF statement of SQL doesn't resolve to its value.
Ajay created SPARK-4869: --- Summary: The variable names in IF statement of SQL doesn't resolve to its value. Key: SPARK-4869 URL: https://issues.apache.org/jira/browse/SPARK-4869 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1 Reporter: Ajay Priority: Blocker We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we have to have nested “if” statements. But, spark sql is not able to resolve the variable names in final evaluation but the literal values are working. Please fix this bug. This works: sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,1) as ROLL_BACKWARD FROM OUTER_RDD) This doesn’t : sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4869) The variable names in IF statement of Spark SQL doesn't resolve to its value.
[ https://issues.apache.org/jira/browse/SPARK-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajay updated SPARK-4869: Summary: The variable names in IF statement of Spark SQL doesn't resolve to its value. (was: The variable names in IF statement of SQL doesn't resolve to its value. ) The variable names in IF statement of Spark SQL doesn't resolve to its value. -- Key: SPARK-4869 URL: https://issues.apache.org/jira/browse/SPARK-4869 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1 Reporter: Ajay Priority: Blocker We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we have to have nested “if” statements. But, spark sql is not able to resolve the variable names in final evaluation but the literal values are working. Please fix this bug. This works: sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,1) as ROLL_BACKWARD FROM OUTER_RDD) This doesn’t : sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4869) The variable names in IF statement of Spark SQL doesn't resolve to its value.
[ https://issues.apache.org/jira/browse/SPARK-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajay updated SPARK-4869: Description: We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we have to have nested “if” statements. But, spark sql is not able to resolve the variable names in final evaluation but the literal values are working. An Unresolved Attributes error is being thrown. Please fix this bug. This works: sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,1) as ROLL_BACKWARD FROM OUTER_RDD) This doesn’t : sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD) was: We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we have to have nested “if” statements. But, spark sql is not able to resolve the variable names in final evaluation but the literal values are working. Please fix this bug. This works: sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,1) as ROLL_BACKWARD FROM OUTER_RDD) This doesn’t : sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD) The variable names in IF statement of Spark SQL doesn't resolve to its value. -- Key: SPARK-4869 URL: https://issues.apache.org/jira/browse/SPARK-4869 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.1 Reporter: Ajay Priority: Blocker We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we have to have nested “if” statements. But, spark sql is not able to resolve the variable names in final evaluation but the literal values are working. An Unresolved Attributes error is being thrown. Please fix this bug. This works: sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,1) as ROLL_BACKWARD FROM OUTER_RDD) This doesn’t : sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4744) Short Circuit evaluation for AND OR in code gen
[ https://issues.apache.org/jira/browse/SPARK-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4744. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3606 [https://github.com/apache/spark/pull/3606] Short Circuit evaluation for AND OR in code gen - Key: SPARK-4744 URL: https://issues.apache.org/jira/browse/SPARK-4744 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4798) Refactor Parquet test suites
[ https://issues.apache.org/jira/browse/SPARK-4798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4798. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3644 [https://github.com/apache/spark/pull/3644] Refactor Parquet test suites Key: SPARK-4798 URL: https://issues.apache.org/jira/browse/SPARK-4798 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.1.1, 1.2.0 Reporter: Cheng Lian Fix For: 1.3.0 Current {{ParquetQuerySuite}} implementation is too verbose and is hard to add new test cases. Would be good to refactor it to enable faster Parquet support iteration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4720) Remainder should also return null if the divider is 0.
[ https://issues.apache.org/jira/browse/SPARK-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4720. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3581 [https://github.com/apache/spark/pull/3581] Remainder should also return null if the divider is 0. -- Key: SPARK-4720 URL: https://issues.apache.org/jira/browse/SPARK-4720 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Fix For: 1.3.0 This is a follow-up of SPARK-4593. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4866) Support StructType as key in MapType
[ https://issues.apache.org/jira/browse/SPARK-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4866. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3714 [https://github.com/apache/spark/pull/3714] Support StructType as key in MapType Key: SPARK-4866 URL: https://issues.apache.org/jira/browse/SPARK-4866 Project: Spark Issue Type: Bug Components: PySpark, SQL Reporter: Davies Liu Fix For: 1.3.0 http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-Applying-schema-to-a-dictionary-with-a-Tuple-as-key-td20716.html Hi Guys, Im running a spark cluster in AWS with Spark 1.1.0 in EC2 I am trying to convert a an RDD with tuple (u'string', int , {(int, int): int, (int, int): int}) to a schema rdd using the schema: {code} fields = [StructField('field1',StringType(),True), StructField('field2',IntegerType(),True), StructField('field3',MapType(StructType([StructField('field31',IntegerType(),True), StructField('field32',IntegerType(),True)]),IntegerType(),True),True) ] schema = StructType(fields) # generate the schemaRDD with the defined schema schemaRDD = sqc.applySchema(RDD, schema) {code} But when I add field3 to the schema, it throws an execption: {code} Traceback (most recent call last): File stdin, line 1, in module File /root/spark/python/pyspark/rdd.py, line 1153, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /root/spark/python/pyspark/context.py, line 770, in runJob it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage 28.0 (TID 710, ip-172-31-29-120.ec2.internal): net.razorvine.pickle.PickleException: couldn't introspect javabean: java.lang.IllegalArgumentException: wrong number of arguments net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_map(Pickler.java:321) net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:412) net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:412) net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) net.razorvine.pickle.Pickler.save(Pickler.java:125) net.razorvine.pickle.Pickler.dump(Pickler.java:95) net.razorvine.pickle.Pickler.dumps(Pickler.java:80) org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$2.apply(SchemaRDD.scala:417) org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$2.apply(SchemaRDD.scala:417) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:331) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at
[jira] [Commented] (SPARK-4140) Document the dynamic allocation feature
[ https://issues.apache.org/jira/browse/SPARK-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249499#comment-14249499 ] Tsuyoshi OZAWA commented on SPARK-4140: --- How about converting this issue into one of subtasks of SPARK-3174? It's easier to track. Document the dynamic allocation feature --- Key: SPARK-4140 URL: https://issues.apache.org/jira/browse/SPARK-4140 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or This blocks on SPARK-3795 and SPARK-3822. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4618) Make foreign DDL commands options case-insensitive
[ https://issues.apache.org/jira/browse/SPARK-4618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4618. - Resolution: Fixed Make foreign DDL commands options case-insensitive -- Key: SPARK-4618 URL: https://issues.apache.org/jira/browse/SPARK-4618 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.3.0 Make foreign DDL commands options case-insensitive So flowing cmd worked ``` create temporary table normal_parquet USING org.apache.spark.sql.parquet OPTIONS ( PATH '/xxx/data' ) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4618) Make foreign DDL commands options case-insensitive
[ https://issues.apache.org/jira/browse/SPARK-4618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4618: Assignee: wangfei Make foreign DDL commands options case-insensitive -- Key: SPARK-4618 URL: https://issues.apache.org/jira/browse/SPARK-4618 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Assignee: wangfei Fix For: 1.3.0 Make foreign DDL commands options case-insensitive So flowing cmd worked ``` create temporary table normal_parquet USING org.apache.spark.sql.parquet OPTIONS ( PATH '/xxx/data' ) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4870) Add version information to driver log
Zhang, Liye created SPARK-4870: -- Summary: Add version information to driver log Key: SPARK-4870 URL: https://issues.apache.org/jira/browse/SPARK-4870 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Zhang, Liye Driver log doesn't include spark version information, version info is important in testing different spark version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4870) Add version information to driver log
[ https://issues.apache.org/jira/browse/SPARK-4870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249519#comment-14249519 ] Apache Spark commented on SPARK-4870: - User 'liyezhang556520' has created a pull request for this issue: https://github.com/apache/spark/pull/3717 Add version information to driver log - Key: SPARK-4870 URL: https://issues.apache.org/jira/browse/SPARK-4870 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Zhang, Liye Driver log doesn't include spark version information, version info is important in testing different spark version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4868) Twitter DStream.map() throws Task not serializable
[ https://issues.apache.org/jira/browse/SPARK-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249582#comment-14249582 ] Nicholas Chammas commented on SPARK-4868: - Replacing {{noop()}} with this {code} object Test { def noop(a: Any): Any = { a } } {code} and calling {{liveTweets.map(t = Test.noop(t)).print()}} yields a similar stack trace. Creating the streaming context as follows {code} @transient val ssc = new StreamingContext(sc, Seconds(5)) {code} and trying either the original {{noop()}} or {{Test.noop()}} (with REPL restarts in between) yields a slightly more interesting trace: {code} scala liveTweets.map(t = Test.noop(t)).print() // exception here org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1264) at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:438) at $iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC.init(console:32) at $iwC$$iwC.init(console:34) at $iwC.init(console:36) at init(console:38) at .init(console:42) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.NotSerializableException: Object of org.apache.spark.streaming.twitter.TwitterInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects. at org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:416) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927) at org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:403) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at
[jira] [Created] (SPARK-4872) Provide sample format of training/test data in MLlib programming guide
zhang jun wei created SPARK-4872: Summary: Provide sample format of training/test data in MLlib programming guide Key: SPARK-4872 URL: https://issues.apache.org/jira/browse/SPARK-4872 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.1.1 Reporter: zhang jun wei I suggest: in samples of the online programming guide of MLlib, it's better to give examples in the real life data, and list the translated data format for the model to consume. The problem blocking me is how to translate the real life data into the format which MLLib can understand correctly. Here is one sample, I want to use NaiveBayes to train and predict tennis-play decision, the original data is: Weather | Temperature | Humidity | Wind = Decision to play tennis Sunny | Hot | High | No = No Sunny | Hot | High | Yes= No Cloudy| Normal | Normal | No = Yes Rainy | Cold | Normal | Yes= No Now, from my understanding, one potential translation is: 1) put every feature value word into a line: Sunny Cloudy Rainy Hot Normal Cold High Normal Yes No 2) map them to numbers: 1 2 3 4 5 6 7 8 9 10 3) map decision labels to numbers: 0 - No 1 - Yes 4) set the value to 1 if it appears, or 0 if not, for the above example, here is the data format for MLUtils.loadLibSVMFile to use: 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:0 10:1 0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:1 10:0 1 1:0 2:1 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:1 0 1:0 2:0 3:1 4:0 5:0 6:1 7:0 8:1 9:1 10:0 == Is this a correct understanding? And another way I can image is: 1) put every feature name into a line: Weather Temperature Humidity Wind 2) map them to numbers: 1 2 3 4 3) map decision labels to numbers: 0 - No 1 - Yes 4) map each value of each feature to a number (e.g. Sunny to 1, Cloudy to 2, Rainy to 3; Hot to 1, Normal to 2, Cold to 3; High to 1, Normal to 2; Yes to 1, No to 2) for the above example, here is the data format for MLUtils.loadLibSVMFile to use: 0 1:1 2:1 3:1 4:2 0 1:1 2:1 3:1 4:1 1 1:2 2:2 3:2 4:2 0 1:3 2:3 3:2 4:1 == but when I read the source code in NaiveBayes.scala, seems this is not correct, I am not sure though... So which data format translation way is correct? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3541) Improve ALS internal storage
[ https://issues.apache.org/jira/browse/SPARK-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249578#comment-14249578 ] Apache Spark commented on SPARK-3541: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3720 Improve ALS internal storage Key: SPARK-3541 URL: https://issues.apache.org/jira/browse/SPARK-3541 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Original Estimate: 96h Remaining Estimate: 96h The internal storage of ALS uses many small objects, which increases the GC pressure and makes ALS difficult to scale to very large scale, e.g., 50 billion ratings. In such cases, the full GC may take more than 10 minutes to finish. That is longer than the default heartbeat timeout and hence executors will be removed under default settings. We can use primitive arrays to reduce the number of objects significantly. This requires big change to the ALS implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org