[jira] [Commented] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194376#comment-14194376 ] Cheng Lian commented on SPARK-4199: --- Hi [~huangjs], which version/commit are you using? Could you please provide, for example, a {{spark-shell}} session snippet that helps reproduce this issue? Just tried both 1.1.0 and the most recent master (https://github.com/apache/spark/tree/76386e1a23c55a58c0aeea67820aab2bac71b24b) with this under {{spark-shell}}: {code} import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.catalyst.types._ import java.sql.Date val sparkContext = sc import sparkContext._ val hiveContext = new HiveContext(sparkContext) import hiveContext._ sql(DROP TABLE IF EXISTS xxx) {code} The only ERROR log (which is expected) I found is: {code} 14/11/03 17:12:56 ERROR metadata.Hive: NoSuchObjectException(message:default.xxx table not found) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1560) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) at com.sun.proxy.$Proxy16.get_table(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ... {code} And the DROP statement itself completed successfully. Drop table if exists raises table not found exception in HiveContext -- Key: SPARK-4199 URL: https://issues.apache.org/jira/browse/SPARK-4199 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Try this: sql(DROP TABLE IF EXISTS some_table) The exception looks like this: 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS some_table 14/11/02 19:55:29 INFO ParseDriver: Parse Completed 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 end=1414986929678 duration=0 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: Table not found some_table org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:44) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-1442) Add Window function support
[ https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guowei updated SPARK-1442: -- Attachment: (was: Window Function.pdf) Add Window function support --- Key: SPARK-1442 URL: https://issues.apache.org/jira/browse/SPARK-1442 Project: Spark Issue Type: New Feature Components: SQL Reporter: Chengxiang Li Attachments: Window Function.pdf similiar to Hive, add window function support for catalyst. https://issues.apache.org/jira/browse/HIVE-4197 https://issues.apache.org/jira/browse/HIVE-896 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1442) Add Window function support
[ https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guowei updated SPARK-1442: -- Attachment: Window Function.pdf Add Window function support --- Key: SPARK-1442 URL: https://issues.apache.org/jira/browse/SPARK-1442 Project: Spark Issue Type: New Feature Components: SQL Reporter: Chengxiang Li Attachments: Window Function.pdf similiar to Hive, add window function support for catalyst. https://issues.apache.org/jira/browse/HIVE-4197 https://issues.apache.org/jira/browse/HIVE-896 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4203) Partition directories in random order when inserting into hive table
Matthew Taylor created SPARK-4203: - Summary: Partition directories in random order when inserting into hive table Key: SPARK-4203 URL: https://issues.apache.org/jira/browse/SPARK-4203 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.2.0 Reporter: Matthew Taylor When doing an insert into hive table with partitions the folders written to the file system are in a random order instead of the order defined in table creation. Seems that the loadPartition method in Hive.java has a MapString,String parameter but expects to be called with a map that has a defined ordering such as LinkedHashMap. Have a patch which I will do a PR for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4204) Utils.exceptionString only return the information for the outermost exception
Shixiong Zhu created SPARK-4204: --- Summary: Utils.exceptionString only return the information for the outermost exception Key: SPARK-4204 URL: https://issues.apache.org/jira/browse/SPARK-4204 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Minor An exception may contain some inner exceptions, but Utils.exceptionString only return the information for the outermost exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4204) Utils.exceptionString only return the information for the outermost exception
[ https://issues.apache.org/jira/browse/SPARK-4204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194507#comment-14194507 ] Apache Spark commented on SPARK-4204: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3073 Utils.exceptionString only return the information for the outermost exception - Key: SPARK-4204 URL: https://issues.apache.org/jira/browse/SPARK-4204 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Minor An exception may contain some inner exceptions, but Utils.exceptionString only return the information for the outermost exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4205) Timestamp and Date objects with comparison operators
Marc Culler created SPARK-4205: -- Summary: Timestamp and Date objects with comparison operators Key: SPARK-4205 URL: https://issues.apache.org/jira/browse/SPARK-4205 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Marc Culler Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Heller updated SPARK-2691: Attachment: spark-docker.patch Here is the patch for the changes needed to support docker images in the fine grained backend. The approach taken here was to just populate the DockerInfo of the container info if some properties were set in the properties file. This has no support for versions of mesos which do not support docker, so it is very incomplete. Additionally there is only support for image name, and volumes. For volumes, you just provide a string, which takes a value similar in form to the argument to 'docker run -v', ie. it is a comma separated list of [host:]container[:mode] options. I think this is sufficient, it parallels the command line, and so should be familiar. Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194513#comment-14194513 ] Chris Heller edited comment on SPARK-2691 at 11/3/14 1:08 PM: -- Here is the patch for the changes needed to support docker images in the fine grained backend. The approach taken here was to just populate the DockerInfo of the container info if some properties were set in the properties file. This has no support for versions of mesos which do not support docker, so it is very incomplete. Additionally there is only support for image name, and volumes. For volumes, you just provide a string, which takes a value similar in form to the argument to 'docker run -v', ie. it is a comma separated list of [host:]container[:mode] options. I think this is sufficient, it parallels the command line, and so should be familiar. I would suggest, for all options of the DockerInfo exposed, to mirror how those options are set on the docker command line. was (Author: chrisheller): Here is the patch for the changes needed to support docker images in the fine grained backend. The approach taken here was to just populate the DockerInfo of the container info if some properties were set in the properties file. This has no support for versions of mesos which do not support docker, so it is very incomplete. Additionally there is only support for image name, and volumes. For volumes, you just provide a string, which takes a value similar in form to the argument to 'docker run -v', ie. it is a comma separated list of [host:]container[:mode] options. I think this is sufficient, it parallels the command line, and so should be familiar. Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4206) BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it
Imran Rashid created SPARK-4206: --- Summary: BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it Key: SPARK-4206 URL: https://issues.apache.org/jira/browse/SPARK-4206 Project: Spark Issue Type: Bug Reporter: Imran Rashid Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4206) BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it
[ https://issues.apache.org/jira/browse/SPARK-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194550#comment-14194550 ] Sean Owen commented on SPARK-4206: -- I think there was a discussion about this and the consensus was that these aren't anything to worry about and can be info-level messages? BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it - Key: SPARK-4206 URL: https://issues.apache.org/jira/browse/SPARK-4206 Project: Spark Issue Type: Bug Reporter: Imran Rashid Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4206) BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it
[ https://issues.apache.org/jira/browse/SPARK-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-4206: Description: When running in local mode, you often get log warning messages like: WARN storage.BlockManager: Block input-0-1415022975000 already exists on this machine; not re-adding it (eg., try running the TwitterPopularTags example in local mode) I think these warning messages are pretty unsettling for a new user, and should be removed. If they are truly innocuous, they should be changed to logInfo, or maybe even logDebug. Or if they might actually indicate a problem, we should find the root cause and fix it. I *think* the problem is caused by a replication level 1 when running in local mode. In BlockManager.doPut, first the block is put locally: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L692 and then if the replication level 1, a request is sent out to replicate the block: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L827 However, in local mode, there isn't anywhere else to replicate the block; the request comes back to the same node, which then issues the warning that the block has already been added. If that analysis is right, the easy fix would be to make sure replicationLevel = 1 in local mode. But, its a little disturbing that a replication request could result in an attempt to replicate on the same node -- and that if something is wrong, we only issue a warning and keep going. If this really the culprit, then it might be worth taking a closer look at the logic of replication. Environment: local mode, branch-1.1 master BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it - Key: SPARK-4206 URL: https://issues.apache.org/jira/browse/SPARK-4206 Project: Spark Issue Type: Bug Environment: local mode, branch-1.1 master Reporter: Imran Rashid Priority: Minor When running in local mode, you often get log warning messages like: WARN storage.BlockManager: Block input-0-1415022975000 already exists on this machine; not re-adding it (eg., try running the TwitterPopularTags example in local mode) I think these warning messages are pretty unsettling for a new user, and should be removed. If they are truly innocuous, they should be changed to logInfo, or maybe even logDebug. Or if they might actually indicate a problem, we should find the root cause and fix it. I *think* the problem is caused by a replication level 1 when running in local mode. In BlockManager.doPut, first the block is put locally: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L692 and then if the replication level 1, a request is sent out to replicate the block: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L827 However, in local mode, there isn't anywhere else to replicate the block; the request comes back to the same node, which then issues the warning that the block has already been added. If that analysis is right, the easy fix would be to make sure replicationLevel = 1 in local mode. But, its a little disturbing that a replication request could result in an attempt to replicate on the same node -- and that if something is wrong, we only issue a warning and keep going. If this really the culprit, then it might be worth taking a closer look at the logic of replication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4206) BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it
[ https://issues.apache.org/jira/browse/SPARK-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194574#comment-14194574 ] Imran Rashid commented on SPARK-4206: - thanks Sean -- sorry I accidentally created the issue before fleshing it out, I added more info now. apologies if I missed the previous discussion, I didn't find anything in jira or spark-dev -- do you mind pointing me at it if you do find something? I had always assumed they were no big deal earlier as well, but after taking a closer look, I'm not so sure. If my explanation is correct, at the very least we ought to be able to eliminate the warning entirely by setting the replication level = 1 in local mode. BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it - Key: SPARK-4206 URL: https://issues.apache.org/jira/browse/SPARK-4206 Project: Spark Issue Type: Bug Environment: local mode, branch-1.1 master Reporter: Imran Rashid Priority: Minor When running in local mode, you often get log warning messages like: WARN storage.BlockManager: Block input-0-1415022975000 already exists on this machine; not re-adding it (eg., try running the TwitterPopularTags example in local mode) I think these warning messages are pretty unsettling for a new user, and should be removed. If they are truly innocuous, they should be changed to logInfo, or maybe even logDebug. Or if they might actually indicate a problem, we should find the root cause and fix it. I *think* the problem is caused by a replication level 1 when running in local mode. In BlockManager.doPut, first the block is put locally: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L692 and then if the replication level 1, a request is sent out to replicate the block: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L827 However, in local mode, there isn't anywhere else to replicate the block; the request comes back to the same node, which then issues the warning that the block has already been added. If that analysis is right, the easy fix would be to make sure replicationLevel = 1 in local mode. But, its a little disturbing that a replication request could result in an attempt to replicate on the same node -- and that if something is wrong, we only issue a warning and keep going. If this really the culprit, then it might be worth taking a closer look at the logic of replication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194576#comment-14194576 ] Andre Schumacher commented on SPARK-2620: - I also bumped into this issue (on Spark 1.1.0) and it is kind of extremely annoying although it only affects the REPL. Is anybody actively working on reolving this? Given it's already a few months old: are there any blockers for making this work? Matei mentioned the way code is wrapped inside the REPL. case class cannot be used as key for reduce --- Key: SPARK-2620 URL: https://issues.apache.org/jira/browse/SPARK-2620 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Environment: reproduced on spark-shell local[4] Reporter: Gerard Maas Priority: Critical Labels: case-class, core Using a case class as a key doesn't seem to work properly on Spark 1.0.0 A minimal example: case class P(name:String) val ps = Array(P(alice), P(bob), P(charly), P(bob)) sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), (P(bob),1), (P(abe),1), (P(charly),1)) In contrast to the expected behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4207) Query which has syntax like 'not like' is not working in Spark SQL
Ravindra Pesala created SPARK-4207: -- Summary: Query which has syntax like 'not like' is not working in Spark SQL Key: SPARK-4207 URL: https://issues.apache.org/jira/browse/SPARK-4207 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Queries which has 'not like' is not working in Spark SQL. Same works in Spark HiveQL. {code} sql(SELECT * FROM records where value not like 'val%') {code} The above query fails with below exception {code} Exception in thread main java.lang.RuntimeException: [1.39] failure: ``IN'' expected but `like' found SELECT * FROM records where value not like 'val%' ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75) at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:186) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194625#comment-14194625 ] Eduardo Jimenez commented on SPARK-2691: Thanks! docker-cli format it is then, as I agree its better. There might be some fields that are required by Mesos.proto but not required by Docker, and in those cases I'll stick to the Mesos requirements. Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194632#comment-14194632 ] Chris Heller commented on SPARK-2691: - That seems reasonable. In fact, the volumes field of a ContainerInfo is not part of the DockerInfo structure, but since there is only a DOCKER type of ContainerInfo at the moment, and since the volumes field is described perfectly by the 'docker run -v' syntax, it seems OK to repurpose it here. Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194634#comment-14194634 ] Tom Arnfeld edited comment on SPARK-2691 at 11/3/14 3:42 PM: - Thanks so much for the patches here [~ChrisHeller]! We'd literally just sat down to implement this. Is there a github pull request with this patch? It would also be really great if it were possible to specify extra environment variables to be given to the executor container. was (Author: tarnfeld): Thanks so much for the patches here [~ChrisHeller]! We'd literally just sat down to implement this. Is there a github pull request with this patch? Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194634#comment-14194634 ] Tom Arnfeld commented on SPARK-2691: Thanks so much for the patches here [~ChrisHeller]! We'd literally just sat down to implement this. Is there a github pull request with this patch? Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194637#comment-14194637 ] Chris Heller commented on SPARK-2691: - +1 on passing in environment. great idea. There isn't a pull at the moment, I didn't feel the patch was complete enough for that (the lack of support for coarse mode and the total disregard for mesos pre 0.20, make the patch a little fragile) -- but I'll happily create one if you'd like. What is there has been in use on our cluster for a while now, and I would really love to have this be part of upstream. Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194646#comment-14194646 ] Tom Arnfeld commented on SPARK-2691: Awesome. We're keen to get spark up and running on our cluster to share with Hadoop so are going to be working on this now. Would you mind if we took this patch and made it a bit more fully featured (env variables, pre 0.20 support, coarse mode) and opened a pull request to spark? Just wondering how we can make this as frictionless as possible and not overlap with any work you're doing on the patch. We're very keen to get this ready and merged into the spark master branch. Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194649#comment-14194649 ] Tom Arnfeld commented on SPARK-2691: Also [~ChrisHeller] if you're not shipping spark as an executor URI it'd be awesome if you were able to share the {{Dockerfile}} (or the rough outline of it) you're using to build the spark docker image. That'd help us (and others i'm sure) greatly. Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194634#comment-14194634 ] Tom Arnfeld edited comment on SPARK-2691 at 11/3/14 3:57 PM: - Thanks so much for the patches here [~ChrisHeller]/[~yoeduardoj]! We'd literally just sat down to implement this. Is there a github pull request with this patch? It would also be really great if it were possible to specify extra environment variables to be given to the executor container. was (Author: tarnfeld): Thanks so much for the patches here [~ChrisHeller]! We'd literally just sat down to implement this. Is there a github pull request with this patch? It would also be really great if it were possible to specify extra environment variables to be given to the executor container. Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194646#comment-14194646 ] Tom Arnfeld edited comment on SPARK-2691 at 11/3/14 3:56 PM: - Awesome. We're keen to get spark up and running on our cluster to share with Hadoop so are going to be working on this now. Would anyone ([~yoeduardoj]?) mind if we took this patch and made it a bit more fully featured (env variables, pre 0.20 support, coarse mode) and opened a pull request to spark? Just wondering how we can make this as frictionless as possible and not overlap with any work you're doing on the patch. We're very keen to get this ready and merged into the spark master branch. was (Author: tarnfeld): Awesome. We're keen to get spark up and running on our cluster to share with Hadoop so are going to be working on this now. Would you mind if we took this patch and made it a bit more fully featured (env variables, pre 0.20 support, coarse mode) and opened a pull request to spark? Just wondering how we can make this as frictionless as possible and not overlap with any work you're doing on the patch. We're very keen to get this ready and merged into the spark master branch. Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194655#comment-14194655 ] Eduardo Jimenez commented on SPARK-2691: Go for it. It would be helpful if most of the possible mesos arguments are supported (see ContainerInfo in https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto) One thing I was trying to do is create a set of common Mesos primitives, as the code to create ContainerInfo is going to be very similar for both fine-grained and coarse-grained mode. Just my $.02. Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194669#comment-14194669 ] Eduardo Jimenez commented on SPARK-2691: I would also say don't only support Docker. I was doing this in a way that makes it possible to mount volumes using a Mesos container as well (cgroups I think), but I haven't actually tried it yet (want to use the mesos sandbox for spark files, but at the same time I want to use multiple work directories to span across several disks) Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4208) stack over flow error while using sqlContext.sql
milq created SPARK-4208: --- Summary: stack over flow error while using sqlContext.sql Key: SPARK-4208 URL: https://issues.apache.org/jira/browse/SPARK-4208 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.1.0 Environment: windows 7 , prebuilt spark-1.1.0-bin-hadoop2.3 Reporter: milq error happens when using sqlContext.sql 14/11/03 18:54:43 INFO BlockManager: Removing block broadcast_1 14/11/03 18:54:43 INFO MemoryStore: Block broadcast_1 of size 2976 dropped from memory (free 28010260 14/11/03 18:54:43 INFO ContextCleaner: Cleaned broadcast 1 root |-- firstName : string (nullable = true) |-- lastNameX: string (nullable = true) Exception in thread main java.lang.StackOverflowError at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194679#comment-14194679 ] Chris Heller commented on SPARK-2691: - Ok here is the patch as a PR: https://github.com/apache/spark/pull/3074 [~tarnfeld] feel free to expand on this patch. I was looking at the code today and realized the coarse mode support should be trivial (just setting a ContainerInfo inside the TaskInfo created) -- it just cannot reuse the fine-grained code path in its current form since that assumes passing of an ExecutorInfo, but it could easily be generalized over a ContainerInfo instead. We are not shipping the spark image as an executor URI, instead spark is bundled in the image. It is just a stock spark needed in the image, a simple docker file would look like (assuming you have a spark tar ball and libmesos in your directory with the Dockerfile): {noformat} FROM ubuntu RUN apt-get -y update RUN apt-get -y install default-jre-headless RUN apt-get -y install python2.7 ADD spark-1.1.0-bin-hadoop1.tgz / RUN mv /spark-1.1.0-bin-hadoop1 /spark COPY libmesos-0.20.1.so /usr/lib/libmesos.so ENV SPARK_HOME /spark ENV MESOS_JAVA_NATIVE_LIBRARY /usr/lib/libmesos.so CMD ps -ef {noformat} [~yoeduardoj] one awesome thing, which is actually beyond the scope of docker support, but still related to mesos would be the ability to support configuration of what role and attributes in a mesos offer are filtered by spark -- but this is not relevant just wanted to bring it up while folks are digging into the mesos backend code. Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Attachments: spark-docker.patch Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4207) Query which has syntax like 'not like' is not working in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194714#comment-14194714 ] Apache Spark commented on SPARK-4207: - User 'ravipesala' has created a pull request for this issue: https://github.com/apache/spark/pull/3075 Query which has syntax like 'not like' is not working in Spark SQL -- Key: SPARK-4207 URL: https://issues.apache.org/jira/browse/SPARK-4207 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Assignee: Ravindra Pesala Queries which has 'not like' is not working in Spark SQL. Same works in Spark HiveQL. {code} sql(SELECT * FROM records where value not like 'val%') {code} The above query fails with below exception {code} Exception in thread main java.lang.RuntimeException: [1.39] failure: ``IN'' expected but `like' found SELECT * FROM records where value not like 'val%' ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75) at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:186) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4205) Timestamp and Date objects with comparison operators
[ https://issues.apache.org/jira/browse/SPARK-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194777#comment-14194777 ] Apache Spark commented on SPARK-4205: - User 'culler' has created a pull request for this issue: https://github.com/apache/spark/pull/3066 Timestamp and Date objects with comparison operators Key: SPARK-4205 URL: https://issues.apache.org/jira/browse/SPARK-4205 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Marc Culler Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4209) Support UDT in UDF
Xiangrui Meng created SPARK-4209: Summary: Support UDT in UDF Key: SPARK-4209 URL: https://issues.apache.org/jira/browse/SPARK-4209 Project: Spark Issue Type: New Feature Components: SQL Reporter: Xiangrui Meng UDF doesn't recognize functions defined with UDTs. Before execution, an SQL internal datum should be converted to Scala types, and after execution, the result should be converted back to internal format (maybe this part is already done). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4206) BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it
[ https://issues.apache.org/jira/browse/SPARK-4206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194793#comment-14194793 ] Imran Rashid commented on SPARK-4206: - a little more confirmation: If I change this line https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala#L56 to use a storage level without replication, the warnings go away. I'd like to change it so that 1) when a storage level is initially requested, the user gets a warning if they request some impossible amount of replication b/c there aren't that many nodes in the cluster, and the storage level is auto-downgraded 2) the warning turns into an exception. (2) is a little scary / ambitious ... but if there is *another* cause for this, I'd like to find out rather than just have it get ignored again. BlockManager warnings in local mode: Block $blockId already exists on this machine; not re-adding it - Key: SPARK-4206 URL: https://issues.apache.org/jira/browse/SPARK-4206 Project: Spark Issue Type: Bug Environment: local mode, branch-1.1 master Reporter: Imran Rashid Priority: Minor When running in local mode, you often get log warning messages like: WARN storage.BlockManager: Block input-0-1415022975000 already exists on this machine; not re-adding it (eg., try running the TwitterPopularTags example in local mode) I think these warning messages are pretty unsettling for a new user, and should be removed. If they are truly innocuous, they should be changed to logInfo, or maybe even logDebug. Or if they might actually indicate a problem, we should find the root cause and fix it. I *think* the problem is caused by a replication level 1 when running in local mode. In BlockManager.doPut, first the block is put locally: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L692 and then if the replication level 1, a request is sent out to replicate the block: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L827 However, in local mode, there isn't anywhere else to replicate the block; the request comes back to the same node, which then issues the warning that the block has already been added. If that analysis is right, the easy fix would be to make sure replicationLevel = 1 in local mode. But, its a little disturbing that a replication request could result in an attempt to replicate on the same node -- and that if something is wrong, we only issue a warning and keep going. If this really the culprit, then it might be worth taking a closer look at the logic of replication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4203) Partition directories in random order when inserting into hive table
[ https://issues.apache.org/jira/browse/SPARK-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194843#comment-14194843 ] Apache Spark commented on SPARK-4203: - User 'tbfenet' has created a pull request for this issue: https://github.com/apache/spark/pull/3076 Partition directories in random order when inserting into hive table Key: SPARK-4203 URL: https://issues.apache.org/jira/browse/SPARK-4203 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.2.0 Reporter: Matthew Taylor When doing an insert into hive table with partitions the folders written to the file system are in a random order instead of the order defined in table creation. Seems that the loadPartition method in Hive.java has a MapString,String parameter but expects to be called with a map that has a defined ordering such as LinkedHashMap. Have a patch which I will do a PR for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4201) Can't use concat() on partition column in where condition (Hive compatibility problem)
[ https://issues.apache.org/jira/browse/SPARK-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194847#comment-14194847 ] Venkata Ramana G commented on SPARK-4201: - I found the same is working on latest master, please confirm. Can't use concat() on partition column in where condition (Hive compatibility problem) -- Key: SPARK-4201 URL: https://issues.apache.org/jira/browse/SPARK-4201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0, 1.1.0 Environment: Hive 0.12+hadoop 2.4/hadoop 2.2 +spark 1.1 Reporter: dongxu Priority: Minor Labels: com The team used hive to query,we try to move it to spark-sql. when I search sentences like that. select count(1) from gulfstream_day_driver_base_2 where concat(year,month,day) = '20140929'; It can't work ,but it work well in hive. I have to rewrite the sql to select count(1) from gulfstream_day_driver_base_2 where year = 2014 and month = 09 day= 29. There are some error log. 14/11/03 15:05:03 ERROR SparkSQLDriver: Failed in [select count(1) from gulfstream_day_driver_base_2 where concat(year,month,day) = '20140929'] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Aggregate false, [], [SUM(PartialCount#1390L) AS c_0#1337L] Exchange SinglePartition Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 16 more Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
[jira] [Created] (SPARK-4210) Add Extra-Trees algorithm to MLlib
Vincent Botta created SPARK-4210: Summary: Add Extra-Trees algorithm to MLlib Key: SPARK-4210 URL: https://issues.apache.org/jira/browse/SPARK-4210 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Vincent Botta This task will add Extra-Trees support to Spark MLlib. The implementation could be inspired from the current Random Forest algorithm. This algorithm is expected to be particularly suited as sorting of attributes is not required as opposed to to the original Random Forest approach (with similar and/or better predictive power). The tasks involves: - Code implementation - Unit tests - Functional tests - Performance tests - Documentation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-2426: Affects Version/s: (was: 1.0.0) 1.2.0 Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-2426: Affects Version/s: (was: 1.2.0) 1.3.0 Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2938) Support SASL authentication in Netty network module
[ https://issues.apache.org/jira/browse/SPARK-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194944#comment-14194944 ] Reynold Xin commented on SPARK-2938: I think we are still going to add it to 1.2 (assuming the change is not too invasive - which it shouldn't be). Would be great you review it once the pull request is ready in the next couple of days. Support SASL authentication in Netty network module --- Key: SPARK-2938 URL: https://issues.apache.org/jira/browse/SPARK-2938 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1070) Add check for JIRA ticket in the Github pull request title/summary with CI
[ https://issues.apache.org/jira/browse/SPARK-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194952#comment-14194952 ] Nicholas Chammas commented on SPARK-1070: - [~hsaputra] - The [Spark PR Board | https://spark-prs.appspot.com/] automatically parses JIRA ticket IDs in the PR titles. Does that address the need behind this request? cc [~joshrosen] Add check for JIRA ticket in the Github pull request title/summary with CI -- Key: SPARK-1070 URL: https://issues.apache.org/jira/browse/SPARK-1070 Project: Spark Issue Type: Task Components: Build Reporter: Henry Saputra Assignee: Mark Hamstra Priority: Minor As part of discussion in the dev@ list to add audit trail of Spark's Github pull requests (PR) to JIRA, need to add check maybe in the Jenkins CI to verify that the PRs contain JIRA ticket number in the title/ summary. There are maybe some PRs that may not need ticket so probably add support for some magic keyword to bypass the check. But this should be done in rare cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1070) Add check for JIRA ticket in the Github pull request title/summary with CI
[ https://issues.apache.org/jira/browse/SPARK-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194961#comment-14194961 ] Henry Saputra commented on SPARK-1070: -- [~nchammas], way back when Patrick propose the right way to send PR there was a discussion to force PR to have JIRA rocket prefix in the summary. This ticket is filed to address that issue/ idea. Add check for JIRA ticket in the Github pull request title/summary with CI -- Key: SPARK-1070 URL: https://issues.apache.org/jira/browse/SPARK-1070 Project: Spark Issue Type: Task Components: Build Reporter: Henry Saputra Assignee: Mark Hamstra Priority: Minor As part of discussion in the dev@ list to add audit trail of Spark's Github pull requests (PR) to JIRA, need to add check maybe in the Jenkins CI to verify that the PRs contain JIRA ticket number in the title/ summary. There are maybe some PRs that may not need ticket so probably add support for some magic keyword to bypass the check. But this should be done in rare cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4211) Spark POM hive-0.13.1 profile sets incorrect hive version property
Fi created SPARK-4211: - Summary: Spark POM hive-0.13.1 profile sets incorrect hive version property Key: SPARK-4211 URL: https://issues.apache.org/jira/browse/SPARK-4211 Project: Spark Issue Type: Improvement Components: Build Reporter: Fi The fix in SPARK-3826 added a new maven profile 'hive-0.13.1'. By default, it sets the maven property to `hive.version=0.13.1a`. This special hive version resolves dependency issues with Hive 0.13+ However, when explicitly specifying the hive-0.13.1 maven profile, the 'hive.version=0.13.1' property would be set instead of 'hive.version=0.13.1a' e.g. mvn -Phive -Phive=0.13.1 Also see: https://github.com/apache/spark/pull/2685 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4211) Spark POM hive-0.13.1 profile sets incorrect hive version property
[ https://issues.apache.org/jira/browse/SPARK-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fi updated SPARK-4211: -- Fix Version/s: 1.2.0 Spark POM hive-0.13.1 profile sets incorrect hive version property -- Key: SPARK-4211 URL: https://issues.apache.org/jira/browse/SPARK-4211 Project: Spark Issue Type: Improvement Components: Build Reporter: Fi Fix For: 1.2.0 The fix in SPARK-3826 added a new maven profile 'hive-0.13.1'. By default, it sets the maven property to `hive.version=0.13.1a`. This special hive version resolves dependency issues with Hive 0.13+ However, when explicitly specifying the hive-0.13.1 maven profile, the 'hive.version=0.13.1' property would be set instead of 'hive.version=0.13.1a' e.g. mvn -Phive -Phive=0.13.1 Also see: https://github.com/apache/spark/pull/2685 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4211) Spark POM hive-0.13.1 profile sets incorrect hive version property
[ https://issues.apache.org/jira/browse/SPARK-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194973#comment-14194973 ] Apache Spark commented on SPARK-4211: - User 'coderfi' has created a pull request for this issue: https://github.com/apache/spark/pull/3072 Spark POM hive-0.13.1 profile sets incorrect hive version property -- Key: SPARK-4211 URL: https://issues.apache.org/jira/browse/SPARK-4211 Project: Spark Issue Type: Improvement Components: Build Reporter: Fi Fix For: 1.2.0 The fix in SPARK-3826 added a new maven profile 'hive-0.13.1'. By default, it sets the maven property to `hive.version=0.13.1a`. This special hive version resolves dependency issues with Hive 0.13+ However, when explicitly specifying the hive-0.13.1 maven profile, the 'hive.version=0.13.1' property would be set instead of 'hive.version=0.13.1a' e.g. mvn -Phive -Phive=0.13.1 Also see: https://github.com/apache/spark/pull/2685 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2938) Support SASL authentication in Netty network module
[ https://issues.apache.org/jira/browse/SPARK-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2938: --- Priority: Blocker (was: Major) Support SASL authentication in Netty network module --- Key: SPARK-2938 URL: https://issues.apache.org/jira/browse/SPARK-2938 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Aaron Davidson Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang closed SPARK-4199. Resolution: Invalid Drop table if exists raises table not found exception in HiveContext -- Key: SPARK-4199 URL: https://issues.apache.org/jira/browse/SPARK-4199 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Try this: sql(DROP TABLE IF EXISTS some_table) The exception looks like this: 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS some_table 14/11/02 19:55:29 INFO ParseDriver: Parse Completed 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 end=1414986929678 duration=0 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: Table not found some_table org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:44) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195045#comment-14195045 ] Jianshi Huang commented on SPARK-4199: -- Turned out it was caused by wrong version of datanucleus jars in my spark build directory. Somehow I have two versions of datanucleus... After removing the wrong version, now all works. Thanks! Jianshi Drop table if exists raises table not found exception in HiveContext -- Key: SPARK-4199 URL: https://issues.apache.org/jira/browse/SPARK-4199 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Try this: sql(DROP TABLE IF EXISTS some_table) The exception looks like this: 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS some_table 14/11/02 19:55:29 INFO ParseDriver: Parse Completed 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 end=1414986929678 duration=0 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: Table not found some_table org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:44) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4212) Actor not found
Davies Liu created SPARK-4212: - Summary: Actor not found Key: SPARK-4212 URL: https://issues.apache.org/jira/browse/SPARK-4212 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Davies Liu tried to run a PySpark test, but it hanged: NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 14/11/03 12:32:58 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver@dm:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: dm/192.168.1.11:7077 14/11/03 12:32:58 ERROR OneForOneStrategy: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@dm:7077/), Path(/user/HeartbeatReceiver)] akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@dm:7077/), Path(/user/HeartbeatReceiver)] at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65) at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74) at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267) at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508) at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541) at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531) at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87) at akka.remote.EndpointWriter.postStop(Endpoint.scala:561) at akka.actor.Actor$class.aroundPostStop(Actor.scala:475) at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415) at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210) at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172) at akka.actor.ActorCell.terminate(ActorCell.scala:369) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462) ... 8 more ^CTraceback (most recent call last): File python/pyspark/tests.py, line 1627, in module unittest.main() File //anaconda/lib/python2.7/unittest/main.py, line 95, in __init__ self.runTests() File //anaconda/lib/python2.7/unittest/main.py, line 232, in runTests self.result = testRunner.run(self.test) File //anaconda/lib/python2.7/unittest/runner.py, line 151, in run test(result) File //anaconda/lib/python2.7/unittest/suite.py, line 70, in __call__ return self.run(*args, **kwds) File //anaconda/lib/python2.7/unittest/suite.py, line 108, in run test(result) File //anaconda/lib/python2.7/unittest/suite.py, line 70, in __call__ return self.run(*args, **kwds) File //anaconda/lib/python2.7/unittest/suite.py, line
[jira] [Commented] (SPARK-4186) Support binaryFiles and binaryRecords API in Python
[ https://issues.apache.org/jira/browse/SPARK-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195054#comment-14195054 ] Apache Spark commented on SPARK-4186: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3078 Support binaryFiles and binaryRecords API in Python --- Key: SPARK-4186 URL: https://issues.apache.org/jira/browse/SPARK-4186 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core Reporter: Matei Zaharia Assignee: Davies Liu After SPARK-2759, we should expose these methods in Python. Shouldn't be too hard to add. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4211) Spark POM hive-0.13.1 profile sets incorrect hive version property
[ https://issues.apache.org/jira/browse/SPARK-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4211. - Resolution: Fixed Issue resolved by pull request 3072 [https://github.com/apache/spark/pull/3072] Spark POM hive-0.13.1 profile sets incorrect hive version property -- Key: SPARK-4211 URL: https://issues.apache.org/jira/browse/SPARK-4211 Project: Spark Issue Type: Improvement Components: Build Reporter: Fi Fix For: 1.2.0 The fix in SPARK-3826 added a new maven profile 'hive-0.13.1'. By default, it sets the maven property to `hive.version=0.13.1a`. This special hive version resolves dependency issues with Hive 0.13+ However, when explicitly specifying the hive-0.13.1 maven profile, the 'hive.version=0.13.1' property would be set instead of 'hive.version=0.13.1a' e.g. mvn -Phive -Phive=0.13.1 Also see: https://github.com/apache/spark/pull/2685 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3541) Improve ALS internal storage
[ https://issues.apache.org/jira/browse/SPARK-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3541: - Target Version/s: 1.3.0 (was: 1.2.0) Improve ALS internal storage Key: SPARK-3541 URL: https://issues.apache.org/jira/browse/SPARK-3541 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Original Estimate: 96h Remaining Estimate: 96h The internal storage of ALS uses many small objects, which increases the GC pressure and makes ALS difficult to scale to very large scale, e.g., 50 billion ratings. In such cases, the full GC may take more than 10 minutes to finish. That is longer than the default heartbeat timeout and hence executors will be removed under default settings. We can use primitive arrays to reduce the number of objects significantly. This requires big change to the ALS implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4207) Query which has syntax like 'not like' is not working in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-4207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4207. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3075 [https://github.com/apache/spark/pull/3075] Query which has syntax like 'not like' is not working in Spark SQL -- Key: SPARK-4207 URL: https://issues.apache.org/jira/browse/SPARK-4207 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Ravindra Pesala Assignee: Ravindra Pesala Fix For: 1.2.0 Queries which has 'not like' is not working in Spark SQL. Same works in Spark HiveQL. {code} sql(SELECT * FROM records where value not like 'val%') {code} The above query fails with below exception {code} Exception in thread main java.lang.RuntimeException: [1.39] failure: ``IN'' expected but `like' found SELECT * FROM records where value not like 'val%' ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75) at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:186) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3594) try more rows during inferSchema
[ https://issues.apache.org/jira/browse/SPARK-3594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3594. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2716 [https://github.com/apache/spark/pull/2716] try more rows during inferSchema Key: SPARK-3594 URL: https://issues.apache.org/jira/browse/SPARK-3594 Project: Spark Issue Type: Improvement Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.2.0 If there are some empty values in the first row of RDD of Row, the inferSchema will failed. It's better to try with more rows, combine them together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4210) Add Extra-Trees algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195108#comment-14195108 ] Manish Amde commented on SPARK-4210: [~0asa] Thanks for the creating the JIRA. From the scikit-learn documentation: As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias. This might lead to interesting implementation tradeoffs. Could you please discuss how you plan to implement the findBestSplit method for this. Also, please note down the related literature (it's a relatively new algorithm) so that people not familiar with this algorithm can understand the suitability of this algorithm for MLlib. [~mengxr] Could you please assign the ticket to [~0asa]? Add Extra-Trees algorithm to MLlib -- Key: SPARK-4210 URL: https://issues.apache.org/jira/browse/SPARK-4210 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Vincent Botta This task will add Extra-Trees support to Spark MLlib. The implementation could be inspired from the current Random Forest algorithm. This algorithm is expected to be particularly suited as sorting of attributes is not required as opposed to to the original Random Forest approach (with similar and/or better predictive power). The tasks involves: - Code implementation - Unit tests - Functional tests - Performance tests - Documentation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM
[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195146#comment-14195146 ] Cody Koeninger commented on SPARK-3146: --- I think this PR is an elegant way to solve SPARK-2388, which is an otherwise blocking bug for our usage of kafka. Absent a concrete design for doing something equivalent for all InputDStreams, I'd encourage merging it. Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM Key: SPARK-3146 URL: https://issues.apache.org/jira/browse/SPARK-3146 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2, 1.1.0 Reporter: Saisai Shao Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195168#comment-14195168 ] Apache Spark commented on SPARK-1021: - User 'erikerlandson' has created a pull request for this issue: https://github.com/apache/spark/pull/3079 sortByKey() launches a cluster job when it shouldn't Key: SPARK-1021 URL: https://issues.apache.org/jira/browse/SPARK-1021 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0 Reporter: Andrew Ash Assignee: Erik Erlandson Labels: starter Fix For: 1.2.0 The sortByKey() method is listed as a transformation, not an action, in the documentation. But it launches a cluster job regardless. http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html Some discussion on the mailing list suggested that this is a problem with the rdd.count() call inside Partitioner.scala's rangeBounds method. https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 Josh Rosen suggests that rangeBounds should be made into a lazy variable: {quote} I wonder whether making RangePartitoner .rangeBounds into a lazy val would fix this (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). We'd need to make sure that rangeBounds() is never called before an action is performed. This could be tricky because it's called in the RangePartitioner.equals() method. Maybe it's sufficient to just compare the number of partitions, the ids of the RDDs used to create the RangePartitioner, and the sort ordering. This still supports the case where I range-partition one RDD and pass the same partitioner to a different RDD. It breaks support for the case where two range partitioners created on different RDDs happened to have the same rangeBounds(), but it seems unlikely that this would really harm performance since it's probably unlikely that the range partitioners are equal by chance. {quote} Can we please make this happen? I'll send a PR on GitHub to start the discussion and testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4152) Avoid data change in CTAS while table already existed
[ https://issues.apache.org/jira/browse/SPARK-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4152. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3013 [https://github.com/apache/spark/pull/3013] Avoid data change in CTAS while table already existed - Key: SPARK-4152 URL: https://issues.apache.org/jira/browse/SPARK-4152 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Priority: Minor Fix For: 1.2.0 CREATE TABLE t1 (a String); CREATE TABLE t1 AS SELECT key FROM src; -- throw exception CREATE TABLE if not exists t1 AS SELECT key FROM src; -- expect do nothing, actually will overwrite the t1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4213) SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators
Terry Siu created SPARK-4213: Summary: SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators Key: SPARK-4213 URL: https://issues.apache.org/jira/browse/SPARK-4213 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: CDH5.2, Hive 0.13.1, Spark 1.2 snapshot (commit hash 76386e1a23c) Reporter: Terry Siu Fix For: 1.2.0 When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error: scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$) Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters. To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB): create table sparkbug ( id int, event string ) stored as parquet; Insert some sample data: insert into table sparkbug select 1, '2011-06-18' from some table limit 1; insert into table sparkbug select 2, '2012-01-01' from some table limit 1; Launch a spark shell and create a HiveContext to the metastore where the table above is located. import org.apache.spark.sql._ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.hive.HiveContext val hc = new HiveContext(sc) hc.setConf(spark.sql.shuffle.partitions, 10) hc.setConf(spark.sql.hive.convertMetastoreParquet, true) hc.setConf(spark.sql.parquet.compression.codec, snappy) import hc._ hc.hql(select * from db.sparkbug where event = '2011-12-01') A scala.MatchError will appear in the output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4213) SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators
[ https://issues.apache.org/jira/browse/SPARK-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4213: Priority: Blocker (was: Major) SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators - Key: SPARK-4213 URL: https://issues.apache.org/jira/browse/SPARK-4213 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: CDH5.2, Hive 0.13.1, Spark 1.2 snapshot (commit hash 76386e1a23c) Reporter: Terry Siu Priority: Blocker Fix For: 1.2.0 When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error: scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$) Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters. To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB): create table sparkbug ( id int, event string ) stored as parquet; Insert some sample data: insert into table sparkbug select 1, '2011-06-18' from some table limit 1; insert into table sparkbug select 2, '2012-01-01' from some table limit 1; Launch a spark shell and create a HiveContext to the metastore where the table above is located. import org.apache.spark.sql._ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.hive.HiveContext val hc = new HiveContext(sc) hc.setConf(spark.sql.shuffle.partitions, 10) hc.setConf(spark.sql.hive.convertMetastoreParquet, true) hc.setConf(spark.sql.parquet.compression.codec, snappy) import hc._ hc.hql(select * from db.sparkbug where event = '2011-12-01') A scala.MatchError will appear in the output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3238) Commas/spaces/dashes are not escaped properly when transferring schema information to parquet readers
[ https://issues.apache.org/jira/browse/SPARK-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3238. - Resolution: Fixed Assignee: Cheng Lian I think this was fixed by the conversion to JSON for serializing schema Commas/spaces/dashes are not escaped properly when transferring schema information to parquet readers - Key: SPARK-3238 URL: https://issues.apache.org/jira/browse/SPARK-3238 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Michael Armbrust Assignee: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2883: Target Version/s: 1.3.0 (was: 1.2.0) Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: Bug Components: Input/Output, SQL Reporter: Zhan Zhang Priority: Blocker Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png, orc.diff Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3618) Store analyzed plans for temp tables
[ https://issues.apache.org/jira/browse/SPARK-3618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-3618: --- Assignee: Michael Armbrust Store analyzed plans for temp tables Key: SPARK-3618 URL: https://issues.apache.org/jira/browse/SPARK-3618 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Right now we store unanalyzed logical plans for temporary tables. However this means that changes to session state (e.g., the current database) could result in tables becoming inaccessible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3618) Store analyzed plans for temp tables
[ https://issues.apache.org/jira/browse/SPARK-3618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3618. - Resolution: Fixed Fix Version/s: 1.2.0 This was done as part of the caching overhaul. Store analyzed plans for temp tables Key: SPARK-3618 URL: https://issues.apache.org/jira/browse/SPARK-3618 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.2.0 Right now we store unanalyzed logical plans for temporary tables. However this means that changes to session state (e.g., the current database) could result in tables becoming inaccessible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet
[ https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3575: Target Version/s: 1.3.0 (was: 1.2.0) Hive Schema is ignored when using convertMetastoreParquet - Key: SPARK-3575 URL: https://issues.apache.org/jira/browse/SPARK-3575 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian This can cause problems when for example one of the columns is defined as TINYINT. A class cast exception will be thrown since the parquet table scan produces INTs while the rest of the execution is expecting bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet
[ https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3575: Priority: Critical (was: Major) Hive Schema is ignored when using convertMetastoreParquet - Key: SPARK-3575 URL: https://issues.apache.org/jira/browse/SPARK-3575 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical This can cause problems when for example one of the columns is defined as TINYINT. A class cast exception will be thrown since the parquet table scan produces INTs while the rest of the execution is expecting bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet
[ https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3575: Target Version/s: 1.2.0 (was: 1.3.0) Hive Schema is ignored when using convertMetastoreParquet - Key: SPARK-3575 URL: https://issues.apache.org/jira/browse/SPARK-3575 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian This can cause problems when for example one of the columns is defined as TINYINT. A class cast exception will be thrown since the parquet table scan produces INTs while the rest of the execution is expecting bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3440) HiveServer2 and CLI should retrieve Hive result set schema
[ https://issues.apache.org/jira/browse/SPARK-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3440: Target Version/s: 1.3.0 (was: 1.2.0) HiveServer2 and CLI should retrieve Hive result set schema -- Key: SPARK-3440 URL: https://issues.apache.org/jira/browse/SPARK-3440 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.0 Reporter: Cheng Lian When executing Hive native queries/commands with {{HiveContext.runHive}}, Spark SQL only calls {{Driver.getResults}} and returns a {{Seq\[String\]}}. The schema of the result set is not retrieved, and thus not possible to split the row string into proper columns and assign column names to them. For example, currently all {{NativeCommand}} only returns a single column named {{result}}. For existing Hive applications that rely on result set schemas, this breaks compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class)
[ https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195240#comment-14195240 ] Michael Armbrust commented on SPARK-2710: - Now that its been merged, it would be great if this feature could be implemented using the DataSource API. Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class) --- Key: SPARK-2710 URL: https://issues.apache.org/jira/browse/SPARK-2710 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Teng Qiu Spark SQL can take Parquet files or JSON files as a table directly (without given a case class to define the schema) as a component named SQL, it should also be able to take a ResultSet from RDBMS easily. i find that there is a JdbcRDD in core: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala so i want to make some small change in this file to allow SQLContext to read the MetaData from the PreparedStatement (read metadata do not need to execute the query really). Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his MetaData. In the further, maybe we can add a feature in sql-shell, so that user can using spark-thrift-server join tables from different sources such as: {code} CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password initQuery bound ... CREATE TABLE parquet_files AS PARQUET hdfs://tmp/parquet_table/ SELECT parquet_files.colX, jdbc_tbl1.colY FROM parquet_files JOIN jdbc_tbl1 ON (parquet_files.id = jdbc_tbl1.id) {code} I think such a feature will be useful, like facebook Presto engine does. oh, and there is a small bug in JdbcRDD in compute(), method close() {code} if (null != conn ! stmt.isClosed()) conn.close() {code} should be {code} if (null != conn ! conn.isClosed()) conn.close() {code} just a small write error :) but such a close method will never be able to close conn... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class)
[ https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2710: Target Version/s: 1.3.0 (was: 1.2.0) Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class) --- Key: SPARK-2710 URL: https://issues.apache.org/jira/browse/SPARK-2710 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Teng Qiu Spark SQL can take Parquet files or JSON files as a table directly (without given a case class to define the schema) as a component named SQL, it should also be able to take a ResultSet from RDBMS easily. i find that there is a JdbcRDD in core: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala so i want to make some small change in this file to allow SQLContext to read the MetaData from the PreparedStatement (read metadata do not need to execute the query really). Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his MetaData. In the further, maybe we can add a feature in sql-shell, so that user can using spark-thrift-server join tables from different sources such as: {code} CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password initQuery bound ... CREATE TABLE parquet_files AS PARQUET hdfs://tmp/parquet_table/ SELECT parquet_files.colX, jdbc_tbl1.colY FROM parquet_files JOIN jdbc_tbl1 ON (parquet_files.id = jdbc_tbl1.id) {code} I think such a feature will be useful, like facebook Presto engine does. oh, and there is a small bug in JdbcRDD in compute(), method close() {code} if (null != conn ! stmt.isClosed()) conn.close() {code} should be {code} if (null != conn ! conn.isClosed()) conn.close() {code} just a small write error :) but such a close method will never be able to close conn... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2902) Change default options to be more agressive
[ https://issues.apache.org/jira/browse/SPARK-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2902. - Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Michael Armbrust (was: Cheng Lian) Change default options to be more agressive --- Key: SPARK-2902 URL: https://issues.apache.org/jira/browse/SPARK-2902 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Assignee: Michael Armbrust Fix For: 1.2.0 Compression for in-memory columnar storage is disabled by default, it's time to enable it. Also, it help alleviating OOM mentioned in SPARK-2650 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2775) HiveContext does not support dots in column names.
[ https://issues.apache.org/jira/browse/SPARK-2775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2775: Target Version/s: 1.3.0 (was: 1.2.0) HiveContext does not support dots in column names. --- Key: SPARK-2775 URL: https://issues.apache.org/jira/browse/SPARK-2775 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai When you try the following snippet in hive/console. {code} val data = sc.parallelize(Seq({key.number1: value1, key.number2: value2})) jsonRDD(data).registerAsTable(jt) hql(select `key.number1` from jt) {code} You will find the name of key.number1 cannot be resolved. {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'key.number1, tree: Project ['key.number1] LowerCaseSchema Subquery jt SparkLogicalPlan (ExistingRdd [key.number1#8,key.number2#9], MappedRDD[17] at map at JsonRDD.scala:37) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3720) support ORC in spark sql
[ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195287#comment-14195287 ] René X Parra commented on SPARK-3720: - [~zhazhan] should this JIRA ticket be closed (marked as duplicate) of SPARK-2883 ?? support ORC in spark sql Key: SPARK-3720 URL: https://issues.apache.org/jira/browse/SPARK-3720 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Attachments: orc.diff The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as: 1 a single file as the output of each task, which reduces the NameNode's load 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) 3 light-weight indexes stored within the file skip row groups that don't pass predicate filtering seek to a given row 4 block-mode compression based on data type run-length encoding for integer columns dictionary encoding for string columns 5 concurrent reads of the same file using separate RecordReaders 6 ability to split files without scanning for markers 7 bound the amount of memory needed for reading or writing 8 metadata stored using Protocol Buffers, which allows addition and removal of fields Now spark sql support Parquet, support ORC provide people more opts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2360. - Resolution: Won't Fix Hey Hossein, I'm going to close this since I think we have decided this feature would work best as a separate library using the new Data Source API. CSV import to SchemaRDDs Key: SPARK-2360 URL: https://issues.apache.org/jira/browse/SPARK-2360 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Hossein Falaki I think the first step it to design the interface that we want to present to users. Mostly this is defining options when importing. Off the top of my head: - What is the separator? - Provide column names or infer them from the first row. - how to handle multiple files with possibly different schemas - do we have a method to let users specify the datatypes of the columns or are they just strings? - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2870: Target Version/s: 1.3.0 (was: 1.2.0) Thorough schema inference directly on RDDs of Python dictionaries - Key: SPARK-2870 URL: https://issues.apache.org/jira/browse/SPARK-2870 Project: Spark Issue Type: Improvement Components: PySpark, SQL Reporter: Nicholas Chammas h4. Background I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. They process JSON text directly and infer a schema that covers the entire source data set. This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types. For example: {code} {a: 5} {a: cow} {code} To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned {{SQLContext.json...()}} methods do this very well. h4. Feature Request What we need is for {{SQlContext.inferSchema()}} to do this, too. Alternatively, we need a new {{SQLContext}} method that works on RDDs of Python dictionaries and does something functionally equivalent to this: {code} SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) {code} As of 1.0.2, [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable. h4. Example Use Case * You have some JSON text data that you want to analyze using Spark SQL. * You would use one of the {{SQLContext.json...()}} methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation. * You deserialize the JSON objects to Python {{dict}} s and filter out the bad ones. You now have an RDD of dictionaries. * From this RDD, you want a SchemaRDD that captures the schema for the whole data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195294#comment-14195294 ] Josh Rosen commented on SPARK-4133: --- Do you happen to be creating multiple running SparkContexts in the same JVM by any chance? PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0 -- Key: SPARK-4133 URL: https://issues.apache.org/jira/browse/SPARK-4133 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: Antonio Jesus Navarro Priority: Blocker Snappy related problems found when trying to upgrade existing Spark Streaming App from 1.0.2 to 1.1.0. We can not run an existing 1.0.2 spark app if upgraded to 1.1.0 IOException is thrown by snappy (parsing_error(2)) {code} Executor task launch worker-0 DEBUG storage.BlockManager - Getting local block broadcast_0 Executor task launch worker-0 DEBUG storage.BlockManager - Level for block broadcast_0 is StorageLevel(true, true, false, true, 1) Executor task launch worker-0 DEBUG storage.BlockManager - Getting block broadcast_0 from memory Executor task launch worker-0 DEBUG storage.BlockManager - Getting local block broadcast_0 Executor task launch worker-0 DEBUG executor.Executor - Task 0's epoch is 0 Executor task launch worker-0 DEBUG storage.BlockManager - Block broadcast_0 not registered locally Executor task launch worker-0 INFO broadcast.TorrentBroadcast - Started reading broadcast variable 0 sparkDriver-akka.actor.default-dispatcher-4 INFO receiver.ReceiverSupervisorImpl - Registered receiver 0 Executor task launch worker-0 INFO util.RecurringTimer - Started timer for BlockGenerator at time 1414656492400 Executor task launch worker-0 INFO receiver.BlockGenerator - Started BlockGenerator Thread-87 INFO receiver.BlockGenerator - Started block pushing thread Executor task launch worker-0 INFO receiver.ReceiverSupervisorImpl - Starting receiver sparkDriver-akka.actor.default-dispatcher-5 INFO scheduler.ReceiverTracker - Registered receiver for stream 0 from akka://sparkDriver Executor task launch worker-0 INFO kafka.KafkaReceiver - Starting Kafka Consumer Stream with group: stratioStreaming Executor task launch worker-0 INFO kafka.KafkaReceiver - Connecting to Zookeeper: node.stratio.com:2181 sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from Actor[akka://sparkDriver/deadLetters] sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from Actor[akka://sparkDriver/deadLetters] sparkDriver-akka.actor.default-dispatcher-6 DEBUG local.LocalActor - [actor] received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from Actor[akka://sparkDriver/deadLetters] sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] handled message (8.442354 ms) StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from Actor[akka://sparkDriver/deadLetters] sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] handled message (8.412421 ms) StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from Actor[akka://sparkDriver/deadLetters] sparkDriver-akka.actor.default-dispatcher-6 DEBUG local.LocalActor - [actor] handled message (8.385471 ms) StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from Actor[akka://sparkDriver/deadLetters] Executor task launch worker-0 INFO utils.VerifiableProperties - Verifying properties Executor task launch worker-0 INFO utils.VerifiableProperties - Property group.id is overridden to stratioStreaming Executor task launch worker-0 INFO utils.VerifiableProperties - Property zookeeper.connect is overridden to node.stratio.com:2181 Executor task launch worker-0 INFO utils.VerifiableProperties - Property zookeeper.connection.timeout.ms is overridden to 1 Executor task launch worker-0 INFO broadcast.TorrentBroadcast - Reading broadcast variable 0 took 0.033998997 s Executor task launch worker-0 INFO consumer.ZookeeperConsumerConnector - [stratioStreaming_ajn-stratio-1414656492293-8ecb3e3a], Connecting to zookeeper instance at node.stratio.com:2181 Executor task launch worker-0 DEBUG zkclient.ZkConnection - Creating new ZookKeeper instance to connect to node.stratio.com:2181. ZkClient-EventThread-169-node.stratio.com:2181 INFO zkclient.ZkEventThread - Starting ZkClient event thread. Executor task launch worker-0 INFO zookeeper.ZooKeeper - Initiating client connection, connectString=node.stratio.com:2181
[jira] [Updated] (SPARK-2863) Emulate Hive type coercion in native reimplementations of Hive functions
[ https://issues.apache.org/jira/browse/SPARK-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2863: Target Version/s: 1.3.0 (was: 1.2.0) Emulate Hive type coercion in native reimplementations of Hive functions Key: SPARK-2863 URL: https://issues.apache.org/jira/browse/SPARK-2863 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Assignee: William Benton Native reimplementations of Hive functions no longer have the same type-coercion behavior as they would if executed via Hive. As [Michael Armbrust points out|https://github.com/apache/spark/pull/1750#discussion_r15790970], queries like {{SELECT SQRT(2) FROM src LIMIT 1}} succeed in Hive but fail if {{SQRT}} is implemented natively. Spark SQL should have Hive-compatible type coercions for arguments to natively-implemented functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195301#comment-14195301 ] René X Parra commented on SPARK-2883: - [~marmbrus] I see this was changed from Version 1.2.0 to Version 1.3.0 ... what is acceptance criteria to get this into 1.2.0 (or now, apparently 1.3.0) ? Perhaps [~zhazhan] you can provide some guidance on what needs to be done? Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: Bug Components: Input/Output, SQL Reporter: Zhan Zhang Priority: Blocker Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png, orc.diff Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195308#comment-14195308 ] Michael Armbrust commented on SPARK-2883: - The merge deadline for 1.2.0 was on Saturday so only critical bug fixes can go in at this point. If there is a bug and a fix for using the ORC SerDe I would still consider including that in 1.2.0. Regarding the native support similar to what is already done for parquet, the existing PR needs to be updated to use the Data Sources API that was added in 1.2.0. I'll have time to do a more thorough review of that PR after we release 1.2.0 Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: Bug Components: Input/Output, SQL Reporter: Zhan Zhang Priority: Blocker Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png, orc.diff Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4178) Hadoop input metrics ignore bytes read in RecordReader instantiation
[ https://issues.apache.org/jira/browse/SPARK-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4178. Resolution: Fixed Assignee: Sandy Ryza Hadoop input metrics ignore bytes read in RecordReader instantiation Key: SPARK-4178 URL: https://issues.apache.org/jira/browse/SPARK-4178 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4214) With dynamic allocation, avoid outstanding requests for more executors than pending tasks need
Sandy Ryza created SPARK-4214: - Summary: With dynamic allocation, avoid outstanding requests for more executors than pending tasks need Key: SPARK-4214 URL: https://issues.apache.org/jira/browse/SPARK-4214 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Dynamic allocation tries to allocate more executors while we have pending tasks remaining. Our current policy can end up with more outstanding executor requests than needed to fulfill all the pending tasks. Capping the executor requests to the number of cores needed to fulfill all pending tasks would make dynamic allocation behavior less sensitive to settings for maxExecutors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3720) support ORC in spark sql
[ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195402#comment-14195402 ] Zhan Zhang commented on SPARK-3720: --- [~neoword] wangfei and me are working together to solve this issue. The initial consolidation is done, but there are some feature not available yet, e.g., predictor pushdown. By the way, my original patch only works on local mode, and the predictor is not working as expected due to some bug. After the current hive-0.13.1 in the upstream is stable, I will patch the diff to this PR so that it has full feature support. support ORC in spark sql Key: SPARK-3720 URL: https://issues.apache.org/jira/browse/SPARK-3720 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Attachments: orc.diff The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as: 1 a single file as the output of each task, which reduces the NameNode's load 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) 3 light-weight indexes stored within the file skip row groups that don't pass predicate filtering seek to a given row 4 block-mode compression based on data type run-length encoding for integer columns dictionary encoding for string columns 5 concurrent reads of the same file using separate RecordReaders 6 ability to split files without scanning for markers 7 bound the amount of memory needed for reading or writing 8 metadata stored using Protocol Buffers, which allows addition and removal of fields Now spark sql support Parquet, support ORC provide people more opts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195427#comment-14195427 ] Zhan Zhang commented on SPARK-2883: --- [~neoword] , As [~marmbrus] mentioned, the PR need to reconstruct to fit to the data sources api. Wangfei and me have consolidated our work into https://github.com/apache/spark/pull/2576. Now the major missing part in that patch is the predictor pushdown, which has some problem in my old patch. After hive-0.13 is stable, I will patch the predictor pushdown to the PR so it has full feature support. Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: Bug Components: Input/Output, SQL Reporter: Zhan Zhang Priority: Blocker Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png, orc.diff Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3797) Run the shuffle service inside the YARN NodeManager as an AuxiliaryService
[ https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195466#comment-14195466 ] Apache Spark commented on SPARK-3797: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/3082 Run the shuffle service inside the YARN NodeManager as an AuxiliaryService -- Key: SPARK-3797 URL: https://issues.apache.org/jira/browse/SPARK-3797 Project: Spark Issue Type: Sub-task Components: YARN Affects Versions: 1.1.0 Reporter: Patrick Wendell Assignee: Andrew Or It's also worth considering running the shuffle service in a YARN container beside the executor(s) on each node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4215) Allow requesting executors only on Yarn (for now)
Andrew Or created SPARK-4215: Summary: Allow requesting executors only on Yarn (for now) Key: SPARK-4215 URL: https://issues.apache.org/jira/browse/SPARK-4215 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Currently if the user attempts to call `sc.requestExecutors` on, say, standalone mode, it just fails silently. We must at the very least log a warning to say it's only available for Yarn currently, or maybe even throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4215) Allow requesting executors only on Yarn (for now)
[ https://issues.apache.org/jira/browse/SPARK-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4215: - Description: Currently if the user attempts to call `sc.requestExecutors` or enables dynamic allocation on, say, standalone mode, it just fails silently. We must at the very least log a warning to say it's only available for Yarn currently, or maybe even throw an exception. (was: Currently if the user attempts to call `sc.requestExecutors` on, say, standalone mode, it just fails silently. We must at the very least log a warning to say it's only available for Yarn currently, or maybe even throw an exception.) Allow requesting executors only on Yarn (for now) - Key: SPARK-4215 URL: https://issues.apache.org/jira/browse/SPARK-4215 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Currently if the user attempts to call `sc.requestExecutors` or enables dynamic allocation on, say, standalone mode, it just fails silently. We must at the very least log a warning to say it's only available for Yarn currently, or maybe even throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4213) SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators
[ https://issues.apache.org/jira/browse/SPARK-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195480#comment-14195480 ] Apache Spark commented on SPARK-4213: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/3083 SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators - Key: SPARK-4213 URL: https://issues.apache.org/jira/browse/SPARK-4213 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: CDH5.2, Hive 0.13.1, Spark 1.2 snapshot (commit hash 76386e1a23c) Reporter: Terry Siu Priority: Blocker Fix For: 1.2.0 When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error: scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$) Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters. To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB): create table sparkbug ( id int, event string ) stored as parquet; Insert some sample data: insert into table sparkbug select 1, '2011-06-18' from some table limit 1; insert into table sparkbug select 2, '2012-01-01' from some table limit 1; Launch a spark shell and create a HiveContext to the metastore where the table above is located. import org.apache.spark.sql._ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.hive.HiveContext val hc = new HiveContext(sc) hc.setConf(spark.sql.shuffle.partitions, 10) hc.setConf(spark.sql.hive.convertMetastoreParquet, true) hc.setConf(spark.sql.parquet.compression.codec, snappy) import hc._ hc.hql(select * from db.sparkbug where event = '2011-12-01') A scala.MatchError will appear in the output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4216) Eliminate Jenkins GitHub posts from AMPLab
[ https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195482#comment-14195482 ] Nicholas Chammas commented on SPARK-4216: - cc [~shaneknapp] [~joshrosen] Eliminate Jenkins GitHub posts from AMPLab -- Key: SPARK-4216 URL: https://issues.apache.org/jira/browse/SPARK-4216 Project: Spark Issue Type: Bug Components: Build, Project Infra Reporter: Nicholas Chammas Priority: Minor * [Real Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873361] * [Imposter Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873366] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab
[ https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-4216: Summary: Eliminate duplicate Jenkins GitHub posts from AMPLab (was: Eliminate Jenkins GitHub posts from AMPLab) Eliminate duplicate Jenkins GitHub posts from AMPLab Key: SPARK-4216 URL: https://issues.apache.org/jira/browse/SPARK-4216 Project: Spark Issue Type: Bug Components: Build, Project Infra Reporter: Nicholas Chammas Priority: Minor * [Real Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873361] * [Imposter Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873366] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4166) Display the executor ID in the Web UI when ExecutorLostFailure happens
[ https://issues.apache.org/jira/browse/SPARK-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195542#comment-14195542 ] Apache Spark commented on SPARK-4166: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3085 Display the executor ID in the Web UI when ExecutorLostFailure happens -- Key: SPARK-4166 URL: https://issues.apache.org/jira/browse/SPARK-4166 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Minor Fix For: 1.2.0 Now when ExecutorLostFailure happens, it only displays ExecutorLostFailure (executor lost) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4210) Add Extra-Trees algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195552#comment-14195552 ] Joseph K. Bradley commented on SPARK-4210: -- [~0asa] For the API, do you plan for this to be a new algorithm, or a set of new parameters for RandomForest? For the API, I could imagine a few options: * RandomForest parameters: Provide ExtraTrees implicitly by allowing users to tweak parameters of RandomForest. This seems best if users would want to tweak several parameters on their own. * RandomForest.trainExtraTrees() method: Provide a new train() method which calls RandomForest but constrains parameters to fit the ExtraTrees algorithm. This seems best if people would look for the ExtraTrees name and if we can simplify the API by constraining the set of parameters they can call ExtraTrees with. I vote for this choice, if possible. * ExtraTrees class: Provide an entirely new class. This seems non-ideal to me. For the implementation, I would hope it could be implemented by generalizing RandomForest with a new splitting option. I'm not sure how much change to the internals it would require; if it's too much, it might merit a new underlying implementation. Hopefully not though! What are your thoughts? Add Extra-Trees algorithm to MLlib -- Key: SPARK-4210 URL: https://issues.apache.org/jira/browse/SPARK-4210 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Vincent Botta This task will add Extra-Trees support to Spark MLlib. The implementation could be inspired from the current Random Forest algorithm. This algorithm is expected to be particularly suited as sorting of attributes is not required as opposed to to the original Random Forest approach (with similar and/or better predictive power). The tasks involves: - Code implementation - Unit tests - Functional tests - Performance tests - Documentation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4163) When fetching blocks unsuccessfully, Web UI only displays Fetch failure
[ https://issues.apache.org/jira/browse/SPARK-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195575#comment-14195575 ] Apache Spark commented on SPARK-4163: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3086 When fetching blocks unsuccessfully, Web UI only displays Fetch failure - Key: SPARK-4163 URL: https://issues.apache.org/jira/browse/SPARK-4163 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Affects Versions: 1.0.0, 1.1.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Fix For: 1.2.0 When fetching blocks unsuccessfully, Web UI only displays Fetch failure. It's hard to find out the cause of the fetch failure. Web UI should display the stack track for the fetch failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-611) Allow JStack to be run from web UI
[ https://issues.apache.org/jira/browse/SPARK-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-611: Affects Version/s: 1.0.0 Allow JStack to be run from web UI -- Key: SPARK-611 URL: https://issues.apache.org/jira/browse/SPARK-611 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Josh Rosen Fix For: 1.2.0 Huge debugging improvement if the standalone mode dashboard can run jstack and show it on the web page for a slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-611) Allow JStack to be run from web UI
[ https://issues.apache.org/jira/browse/SPARK-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-611. --- Resolution: Fixed Fix Version/s: 1.2.0 Allow JStack to be run from web UI -- Key: SPARK-611 URL: https://issues.apache.org/jira/browse/SPARK-611 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Josh Rosen Fix For: 1.2.0 Huge debugging improvement if the standalone mode dashboard can run jstack and show it on the web page for a slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2938) Support SASL authentication in Netty network module
[ https://issues.apache.org/jira/browse/SPARK-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195599#comment-14195599 ] Apache Spark commented on SPARK-2938: - User 'aarondav' has created a pull request for this issue: https://github.com/apache/spark/pull/3087 Support SASL authentication in Netty network module --- Key: SPARK-2938 URL: https://issues.apache.org/jira/browse/SPARK-2938 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Aaron Davidson Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2938) Support SASL authentication in Netty network module
[ https://issues.apache.org/jira/browse/SPARK-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2938: - Target Version/s: 1.2.0 Support SASL authentication in Netty network module --- Key: SPARK-2938 URL: https://issues.apache.org/jira/browse/SPARK-2938 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Reynold Xin Assignee: Aaron Davidson Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2938) Support SASL authentication in Netty network module
[ https://issues.apache.org/jira/browse/SPARK-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2938: - Affects Version/s: 1.1.0 Support SASL authentication in Netty network module --- Key: SPARK-2938 URL: https://issues.apache.org/jira/browse/SPARK-2938 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Reynold Xin Assignee: Aaron Davidson Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2938) Support SASL authentication in Netty network module
[ https://issues.apache.org/jira/browse/SPARK-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2938: - Affects Version/s: (was: 1.1.0) 1.2.0 Support SASL authentication in Netty network module --- Key: SPARK-2938 URL: https://issues.apache.org/jira/browse/SPARK-2938 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Reynold Xin Assignee: Aaron Davidson Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4217) Result of SparkSQL is incorrect after a table join and group by operation
peter.zhang created SPARK-4217: -- Summary: Result of SparkSQL is incorrect after a table join and group by operation Key: SPARK-4217 URL: https://issues.apache.org/jira/browse/SPARK-4217 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Environment: Hadoop 2.2.0 Spark1.1 Reporter: peter.zhang Priority: Critical I runed a test using same SQL script in SparkSQL, Shark and Hive environment as below --- select c.theyear, sum(b.amount) from tblstock a join tblStockDetail b on a.ordernumber = b.ordernumber join tbldate c on a.dateid = c.dateid group by c.theyear; result of hive/shark: theyear _c1 20041403018 20055557850 20067203061 200711300432 200812109328 20095365447 2010188944 result of SparkSQL: 2010210924 20043265696 200513247234 200613670416 200716711974 200814670698 20096322137 I'll attach test data soon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4217) Result of SparkSQL is incorrect after a table join and group by operation
[ https://issues.apache.org/jira/browse/SPARK-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peter.zhang updated SPARK-4217: --- Description: I runed a test using same SQL script in SparkSQL, Shark and Hive environment as below --- select c.theyear, sum(b.amount) from tblstock a join tblStockDetail b on a.ordernumber = b.ordernumber join tbldate c on a.dateid = c.dateid group by c.theyear; result of hive/shark: theyear _c1 20041403018 20055557850 20067203061 200711300432 200812109328 20095365447 2010188944 result of SparkSQL: 2010210924 20043265696 200513247234 200613670416 200716711974 200814670698 20096322137 was: I runed a test using same SQL script in SparkSQL, Shark and Hive environment as below --- select c.theyear, sum(b.amount) from tblstock a join tblStockDetail b on a.ordernumber = b.ordernumber join tbldate c on a.dateid = c.dateid group by c.theyear; result of hive/shark: theyear _c1 20041403018 20055557850 20067203061 200711300432 200812109328 20095365447 2010188944 result of SparkSQL: 2010210924 20043265696 200513247234 200613670416 200716711974 200814670698 20096322137 I'll attach test data soon Result of SparkSQL is incorrect after a table join and group by operation - Key: SPARK-4217 URL: https://issues.apache.org/jira/browse/SPARK-4217 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Environment: Hadoop 2.2.0 Spark1.1 Reporter: peter.zhang Priority: Critical Attachments: TestScript.sql, saledata.zip I runed a test using same SQL script in SparkSQL, Shark and Hive environment as below --- select c.theyear, sum(b.amount) from tblstock a join tblStockDetail b on a.ordernumber = b.ordernumber join tbldate c on a.dateid = c.dateid group by c.theyear; result of hive/shark: theyear _c1 2004 1403018 2005 5557850 2006 7203061 2007 11300432 2008 12109328 2009 5365447 2010 188944 result of SparkSQL: 2010 210924 2004 3265696 2005 13247234 2006 13670416 2007 16711974 2008 14670698 2009 6322137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4192) Support UDT in Python
[ https://issues.apache.org/jira/browse/SPARK-4192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4192. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3068 [https://github.com/apache/spark/pull/3068] Support UDT in Python - Key: SPARK-4192 URL: https://issues.apache.org/jira/browse/SPARK-4192 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Fix For: 1.2.0 This is a sub-task of SPARK-3572 for UDT support in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab
[ https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195711#comment-14195711 ] shane knapp commented on SPARK-4216: actually, you got the 'real' and 'impostor' jenkins bots backwards. amplab hosts not only spark, but many other projects as well. :) the github pull request builder (hereafter known as ghprb) only allows one bot per jenkins instance. spark works around this by using their own bot, with injected oauth tokens, which uses the ghprb/github api to post additional messages. the primary bot (amplab jenkins) also posts automagically. two solutions: 1) we could just use amplab jenkins, instead of spark qa. the other research projects do NOT want to use the sparkqa bot. 2) i'm sure that the ghprb folks wouldn't mind a PR to add multi-bot support. thoughts? personally, i'd lean towards (1). Eliminate duplicate Jenkins GitHub posts from AMPLab Key: SPARK-4216 URL: https://issues.apache.org/jira/browse/SPARK-4216 Project: Spark Issue Type: Bug Components: Build, Project Infra Reporter: Nicholas Chammas Priority: Minor * [Real Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873361] * [Imposter Jenkins | https://github.com/apache/spark/pull/2988#issuecomment-60873366] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org