[jira] [Commented] (SPARK-5387) parquet writer runs into OOM during writing when number of rows is large
[ https://issues.apache.org/jira/browse/SPARK-5387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366973#comment-14366973 ] Chaozhong Yang commented on SPARK-5387: --- I also encountered the same issue. parquet writer runs into OOM during writing when number of rows is large Key: SPARK-5387 URL: https://issues.apache.org/jira/browse/SPARK-5387 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.1.1 Reporter: Shirley Wu When the number of records is large in RDD, the saveAsParquet will have OOM. Here is the stack trace: 15/01/23 10:00:02 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, hdc2-s3.niara.com): java.lang.OutOfMemoryError: Java heap space parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65) parquet.bytes.CapacityByteArrayOutputStream.init(CapacityByteArrayOutputStream.java:57) parquet.column.values.rle.RunLengthBitPackingHybridEncoder.init(RunLengthBitPackingHybridEncoder.java:125) parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.init(RunLengthBitPackingHybridValuesWriter.java:36) parquet.column.ParquetProperties.getColumnDescriptorValuesWriter(ParquetProperties.java:61) parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:73) parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68) parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56) parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:124) parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:315) parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:106) parquet.hadoop.InternalParquetRecordWriter.checkBlockSizeReached(InternalParquetRecordWriter.java:126) parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:117) parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:303) org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) It seems the writeShard() API needs to flush to disk periodically. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5818) unable to use add jar in hql
[ https://issues.apache.org/jira/browse/SPARK-5818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366811#comment-14366811 ] Venkata Ramana G commented on SPARK-5818: - TranslatingClassLoader is used for Spark-shell, while current hive's add jar can work only with URLClassLoader. So jar has to be directly added to spark driver's class loader or its parent loader, in case of spark-shell I am working the same. unable to use add jar in hql -- Key: SPARK-5818 URL: https://issues.apache.org/jira/browse/SPARK-5818 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: pengxu In the spark 1.2.1 and 1.2.0, it's unable the use the hive command add jar in hql. It seems that the problem in spark-2219 is still existed. the problem can be reproduced as described in the below. Suppose the jar file is named brickhouse-0.6.0.jar and is placed in the /tmp directory {code} spark-shellimport org.apache.spark.sql.hive._ spark-shellval sqlContext = new HiveContext(sc) spark-shellimport sqlContext._ spark-shellhql(add jar /tmp/brickhouse-0.6.0.jar) {code} the error message is showed as blow {code:title=Error Log} 15/02/15 01:36:31 ERROR SessionState: Unable to register /tmp/brickhouse-0.6.0.jar Exception: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to java.net.URLClassLoader java.lang.ClassCastException: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to java.net.URLClassLoader at org.apache.hadoop.hive.ql.exec.Utilities.addToClassPath(Utilities.java:1921) at org.apache.hadoop.hive.ql.session.SessionState.registerJar(SessionState.java:599) at org.apache.hadoop.hive.ql.session.SessionState$ResourceType$2.preHook(SessionState.java:658) at org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:732) at org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:717) at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:54) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:319) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276) at org.apache.spark.sql.hive.execution.AddJar.sideEffectResult$lzycompute(commands.scala:74) at org.apache.spark.sql.hive.execution.AddJar.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.AddJar.execute(commands.scala:68) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:102) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:106) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:37) at $line30.$read$$iwC$$iwC$$iwC$$iwC.init(console:39) at $line30.$read$$iwC$$iwC$$iwC.init(console:41) at $line30.$read$$iwC$$iwC.init(console:43) at $line30.$read$$iwC.init(console:45) at $line30.$read.init(console:47) at $line30.$read$.init(console:51) at $line30.$read$.clinit(console) at $line30.$eval$.init(console:7) at $line30.$eval$.clinit(console) at $line30.$eval.$print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at
[jira] [Created] (SPARK-6396) Add timeout control for broadcast
Jun Fang created SPARK-6396: --- Summary: Add timeout control for broadcast Key: SPARK-6396 URL: https://issues.apache.org/jira/browse/SPARK-6396 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Affects Versions: 1.3.0, 1.3.1 Reporter: Jun Fang Priority: Minor TorrentBroadcast uses fetchBlockSync method of BlockTransferService.scala which call Await.result(result.future, Duration.Inf). In production environment this may cause a hang out when driver and executor are in different date centers. A timeout here would be better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6388) Spark 1.3 + Hadoop 2.6 Can't work on Java 8_40
[ https://issues.apache.org/jira/browse/SPARK-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366982#comment-14366982 ] Sean Owen commented on SPARK-6388: -- I am just using JDK8 to compile not targeting Java 8 in javac, and running on Java 8. I don't think there is a Java 8 problem in how Spark uses Hadoop or how Spark runs. You can see Spark + Hadoop even has Java 8-specific tests you can enable if you want. Spark 1.3 + Hadoop 2.6 Can't work on Java 8_40 -- Key: SPARK-6388 URL: https://issues.apache.org/jira/browse/SPARK-6388 Project: Spark Issue Type: Bug Components: Block Manager, Spark Submit, YARN Affects Versions: 1.3.0 Environment: 1. Linux version 3.16.0-30-generic (buildd@komainu) (gcc version 4.9.1 (Ubuntu 4.9.1-16ubuntu6) ) #40-Ubuntu SMP Mon Jan 12 22:06:37 UTC 2015 2. Oracle Java 8 update 40 for Linux X64 3. Scala 2.10.5 4. Hadoop 2.6 (pre-build version) Reporter: John Original Estimate: 24h Remaining Estimate: 24h I build Apache Spark 1.3 munally. --- JAVA_HOME=PATH_TO_JAVA8 mvn clean package -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests --- Something goes wrong, akka always tell me --- 15/03/17 21:28:10 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkYarnAM@Server2:42161] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. --- I build another version of Spark 1.3 + Hadoop 2.6 under Java 7. Everything goes well. Logs --- 15/03/17 21:27:06 INFO spark.SparkContext: Running Spark version 1.3.0 15/03/17 21:27:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/17 21:27:08 INFO spark.SecurityManager: Changing view Servers to: hduser 15/03/17 21:27:08 INFO spark.SecurityManager: Changing modify Servers to: hduser 15/03/17 21:27:08 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui Servers disabled; users with view permissions: Set(hduser); users with modify permissions: Set(hduser) 15/03/17 21:27:08 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/03/17 21:27:08 INFO Remoting: Starting remoting 15/03/17 21:27:09 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@Server3:37951] 15/03/17 21:27:09 INFO util.Utils: Successfully started service 'sparkDriver' on port 37951. 15/03/17 21:27:09 INFO spark.SparkEnv: Registering MapOutputTracker 15/03/17 21:27:09 INFO spark.SparkEnv: Registering BlockManagerMaster 15/03/17 21:27:09 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-0db692bb-cd02-40c8-a8f0-3813c6da18e2/blockmgr-a1d0ad23-ab76-4177-80a0-a6f982a64d80 15/03/17 21:27:09 INFO storage.MemoryStore: MemoryStore started with capacity 265.1 MB 15/03/17 21:27:09 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-502ef3f8-b8cd-45cf-b1df-97df297cdb35/httpd-6303e24d-4b2b-4614-bb1d-74e8d331189b 15/03/17 21:27:09 INFO spark.HttpServer: Starting HTTP Server 15/03/17 21:27:09 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/03/17 21:27:10 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:48000 15/03/17 21:27:10 INFO util.Utils: Successfully started service 'HTTP file server' on port 48000. 15/03/17 21:27:10 INFO spark.SparkEnv: Registering OutputCommitCoordinator 15/03/17 21:27:10 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/03/17 21:27:10 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 15/03/17 21:27:10 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 15/03/17 21:27:10 INFO ui.SparkUI: Started SparkUI at http://Server3:4040 15/03/17 21:27:10 INFO spark.SparkContext: Added JAR file:/home/hduser/spark-java2.jar at http://192.168.11.42:48000/jars/spark-java2.jar with timestamp 1426598830307 15/03/17 21:27:10 INFO client.RMProxy: Connecting to ResourceManager at Server3/192.168.11.42:8050 15/03/17 21:27:11 INFO yarn.Client: Requesting a new application from cluster with 3 NodeManagers 15/03/17 21:27:11 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 15/03/17 21:27:11 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 15/03/17 21:27:11 INFO yarn.Client: Setting up container launch context for our AM 15/03/17 21:27:11 INFO yarn.Client: Preparing resources for our AM container 15/03/17 21:27:12 INFO yarn.Client: Uploading resource file:/home/hduser/spark-1.3.0/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.6.0.jar -
[jira] [Created] (SPARK-6397) Check the missingInput simply
Yadong Qi created SPARK-6397: Summary: Check the missingInput simply Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yadong Qi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6396) Add timeout control for broadcast
[ https://issues.apache.org/jira/browse/SPARK-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366799#comment-14366799 ] Dale Richardson commented on SPARK-6396: If nobody else is looking at this one I'll have a look at it. Add timeout control for broadcast - Key: SPARK-6396 URL: https://issues.apache.org/jira/browse/SPARK-6396 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Affects Versions: 1.3.0, 1.3.1 Reporter: Jun Fang Priority: Minor TorrentBroadcast uses fetchBlockSync method of BlockTransferService.scala which call Await.result(result.future, Duration.Inf). In production environment this may cause a hang out when driver and executor are in different date centers. A timeout here would be better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6397) Check the missingInput simply
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366918#comment-14366918 ] Apache Spark commented on SPARK-6397: - User 'watermen' has created a pull request for this issue: https://github.com/apache/spark/pull/5082 Check the missingInput simply - Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yadong Qi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6195) Specialized in-memory column type for fixed-precision decimal
[ https://issues.apache.org/jira/browse/SPARK-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6195: -- Summary: Specialized in-memory column type for fixed-precision decimal (was: Specialized in-memory column type for decimal) Specialized in-memory column type for fixed-precision decimal - Key: SPARK-6195 URL: https://issues.apache.org/jira/browse/SPARK-6195 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1, 1.3.0, 1.2.1 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.4.0 When building in-memory columnar representation, decimal values are currently serialized via a generic serializer, which is unnecessarily slow. Since some decimals (precision 19) can be represented as annotated long values, we should add a specialized fixed-precision decimal column type to speed up in-memory decimal serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6195) Specialized in-memory column type for decimal
[ https://issues.apache.org/jira/browse/SPARK-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6195: -- Description: When building in-memory columnar representation, decimal values are currently serialized via a generic serializer, which is unnecessarily slow. Since some decimals (precision 19) can be represented as annotated long values, we should add a specialized fixed-precision decimal column type to speed up in-memory decimal serialization. (was: When building in-memory columnar representation, decimal values are currently serialized via a generic serializer, which is unnecessarily slow. Since decimals are actually annotated long values, we should add a specialized decimal column type to speed up in-memory decimal serialization.) Specialized in-memory column type for decimal - Key: SPARK-6195 URL: https://issues.apache.org/jira/browse/SPARK-6195 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1, 1.3.0, 1.2.1 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.4.0 When building in-memory columnar representation, decimal values are currently serialized via a generic serializer, which is unnecessarily slow. Since some decimals (precision 19) can be represented as annotated long values, we should add a specialized fixed-precision decimal column type to speed up in-memory decimal serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6396) Add timeout control for broadcast
[ https://issues.apache.org/jira/browse/SPARK-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366836#comment-14366836 ] Jun Fang commented on SPARK-6396: - I am working on it right now, sooner i will give a pull request. Add timeout control for broadcast - Key: SPARK-6396 URL: https://issues.apache.org/jira/browse/SPARK-6396 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Affects Versions: 1.3.0, 1.3.1 Reporter: Jun Fang Priority: Minor TorrentBroadcast uses fetchBlockSync method of BlockTransferService.scala which call Await.result(result.future, Duration.Inf). In production environment this may cause a hang out when driver and executor are in different date centers. A timeout here would be better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6397) Check the missingInput simply
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6397: - Priority: Minor (was: Major) There's no description and no real info in the title. JIRAs need to state the issue briefly but specifically. Check the missingInput simply - Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yadong Qi Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6396) Add timeout control for broadcast
[ https://issues.apache.org/jira/browse/SPARK-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366849#comment-14366849 ] Dale Richardson commented on SPARK-6396: No problems. Add timeout control for broadcast - Key: SPARK-6396 URL: https://issues.apache.org/jira/browse/SPARK-6396 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Affects Versions: 1.3.0, 1.3.1 Reporter: Jun Fang Priority: Minor TorrentBroadcast uses fetchBlockSync method of BlockTransferService.scala which call Await.result(result.future, Duration.Inf). In production environment this may cause a hang out when driver and executor are in different date centers. A timeout here would be better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6398) Improve utility of GaussianMixture for higer dimensional data
[ https://issues.apache.org/jira/browse/SPARK-6398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367028#comment-14367028 ] Travis Galoppo commented on SPARK-6398: --- Please assign to me Improve utility of GaussianMixture for higer dimensional data - Key: SPARK-6398 URL: https://issues.apache.org/jira/browse/SPARK-6398 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Travis Galoppo The current EM implementation for GaussianMixture protects itself from numerical instability at the expense of utility in high dimensions. A few options exist for extending utility into higher dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6372) spark-submit --conf is not being propagated to child processes
[ https://issues.apache.org/jira/browse/SPARK-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6372: - Assignee: Marcelo Vanzin spark-submit --conf is not being propagated to child processes Key: SPARK-6372 URL: https://issues.apache.org/jira/browse/SPARK-6372 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Blocker Fix For: 1.4.0 Thanks to [~irashid] for bringing this up. It seems that the new launcher library is incorrectly handling --conf and not passing it down to the child processes. Fix is simple, PR coming up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367089#comment-14367089 ] Abou Haydar Elias commented on SPARK-5874: -- The tokenizer as for now converts the input string to lowercase and then splits it by white spaces only. I suggest more flexibility for the Tokenizer pipeline stage. So we can eventually add stemming and text analysis directly into the Tokenizer. There are many post-tokenization steps that can be done, including (but not limited to): - [Stemming|http://en.wikipedia.org/wiki/Stemming] – Replacing words with their stems. For instance with English stemming bikes is replaced with bike; now query bike can find both documents containing bike and those containing bikes. - Stop Words Filtering – Common words like the, and and a rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some noise and actually improve search quality. - [Text Normalization|http://en.wikipedia.org/wiki/Text_normalization] – Stripping accents and other character markings can make for better searching. - Synonym Expansion – Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set. so what do you think? How to improve the current ML pipeline API? --- Key: SPARK-5874 URL: https://issues.apache.org/jira/browse/SPARK-5874 Project: Spark Issue Type: Brainstorming Components: ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical I created this JIRA to collect feedbacks about the ML pipeline API we introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 with confidence, which requires valuable input from the community. I'll create sub-tasks for each major issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6325) YarnAllocator crash with dynamic allocation on
[ https://issues.apache.org/jira/browse/SPARK-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6325: - Assignee: Marcelo Vanzin YarnAllocator crash with dynamic allocation on -- Key: SPARK-6325 URL: https://issues.apache.org/jira/browse/SPARK-6325 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.3.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Critical Fix For: 1.4.0, 1.3.1 Run spark-shell like this: {noformat} spark-shell --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.minExecutors=1 \ --conf spark.dynamicAllocation.maxExecutors=20 \ --conf spark.dynamicAllocation.executorIdleTimeout=10 \ --conf spark.dynamicAllocation.schedulerBacklogTimeout=5 \ --conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5 {noformat} Then run this simple test: {code} scala val verySmallRdd = sc.parallelize(1 to 10, 10).map { i = | if (i % 2 == 0) { Thread.sleep(30 * 1000); i } else 0 | } verySmallRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at console:21 scala verySmallRdd.collect() {code} When Spark starts ramping down the number of allocated executors, it will hit an assert in YarnAllocator.scala: {code} assert(targetNumExecutors = 0, Allocator killed more executors than are allocated!) {code} This assert will cause the akka backend to die, but not the AM itself. So the app will be in a zombie-like state, where the driver is alive but can't talk to the AM. Sadness ensues. I have a working fix, just need to add unit tests. Stay tuned. Thanks to [~wypoon] for finding the problem, and for the test case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6325) YarnAllocator crash with dynamic allocation on
[ https://issues.apache.org/jira/browse/SPARK-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6325. -- Resolution: Fixed Fix Version/s: 1.3.1 1.4.0 Issue resolved by pull request 5018 [https://github.com/apache/spark/pull/5018] YarnAllocator crash with dynamic allocation on -- Key: SPARK-6325 URL: https://issues.apache.org/jira/browse/SPARK-6325 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.3.0 Reporter: Marcelo Vanzin Priority: Critical Fix For: 1.4.0, 1.3.1 Run spark-shell like this: {noformat} spark-shell --conf spark.shuffle.service.enabled=true \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.minExecutors=1 \ --conf spark.dynamicAllocation.maxExecutors=20 \ --conf spark.dynamicAllocation.executorIdleTimeout=10 \ --conf spark.dynamicAllocation.schedulerBacklogTimeout=5 \ --conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5 {noformat} Then run this simple test: {code} scala val verySmallRdd = sc.parallelize(1 to 10, 10).map { i = | if (i % 2 == 0) { Thread.sleep(30 * 1000); i } else 0 | } verySmallRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at console:21 scala verySmallRdd.collect() {code} When Spark starts ramping down the number of allocated executors, it will hit an assert in YarnAllocator.scala: {code} assert(targetNumExecutors = 0, Allocator killed more executors than are allocated!) {code} This assert will cause the akka backend to die, but not the AM itself. So the app will be in a zombie-like state, where the driver is alive but can't talk to the AM. Sadness ensues. I have a working fix, just need to add unit tests. Stay tuned. Thanks to [~wypoon] for finding the problem, and for the test case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6398) Improve utility of GaussianMixture for higer dimensional data
Travis Galoppo created SPARK-6398: - Summary: Improve utility of GaussianMixture for higer dimensional data Key: SPARK-6398 URL: https://issues.apache.org/jira/browse/SPARK-6398 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Travis Galoppo The current EM implementation for GaussianMixture protects itself from numerical instability at the expense of utility in high dimensions. A few options exist for extending utility into higher dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6399) Code compiled against 1.3.0 may not run against older Spark versions
Marcelo Vanzin created SPARK-6399: - Summary: Code compiled against 1.3.0 may not run against older Spark versions Key: SPARK-6399 URL: https://issues.apache.org/jira/browse/SPARK-6399 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Affects Versions: 1.3.0 Reporter: Marcelo Vanzin Commit 65b987c3 re-organized the implicit conversions of RDDs so that they're easier to use. The problem is that scalac now generates code that will not run on older Spark versions if those conversions are used. Basically, even if you explicitly import {{SparkContext._}}, scalac will generate references to the new methods in the {{RDD}} object instead. So the compiled code will reference code that doesn't exist in older versions of Spark. You can work around this by explicitly calling the methods in the {{SparkContext}} object, although that's a little ugly. We should at least document this limitation (if there's no way to fix it), since I believe forwards compatibility in the API was also a goal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6096) Support model save/load in Python's naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367125#comment-14367125 ] Xusen Yin commented on SPARK-6096: -- [~mengxr] Pls assign it to me. Support model save/load in Python's naive Bayes --- Key: SPARK-6096 URL: https://issues.apache.org/jira/browse/SPARK-6096 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6286) Handle TASK_ERROR in TaskState
[ https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6286. -- Resolution: Fixed Fix Version/s: 1.3.1 1.4.0 Assignee: Iulian Dragos Resolved by https://github.com/apache/spark/pull/5000 Handle TASK_ERROR in TaskState -- Key: SPARK-6286 URL: https://issues.apache.org/jira/browse/SPARK-6286 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Iulian Dragos Assignee: Iulian Dragos Priority: Minor Labels: mesos Fix For: 1.4.0, 1.3.1 Scala warning: {code} match may not be exhaustive. It would fail on the following input: TASK_ERROR {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6372) spark-submit --conf is not being propagated to child processes
[ https://issues.apache.org/jira/browse/SPARK-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6372. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5057 [https://github.com/apache/spark/pull/5057] spark-submit --conf is not being propagated to child processes Key: SPARK-6372 URL: https://issues.apache.org/jira/browse/SPARK-6372 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Priority: Blocker Fix For: 1.4.0 Thanks to [~irashid] for bringing this up. It seems that the new launcher library is incorrectly handling --conf and not passing it down to the child processes. Fix is simple, PR coming up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6389) YARN app diagnostics report doesn't report NPEs
[ https://issues.apache.org/jira/browse/SPARK-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6389. -- Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Steve Loughran Resolved by https://github.com/apache/spark/pull/5070 YARN app diagnostics report doesn't report NPEs --- Key: SPARK-6389 URL: https://issues.apache.org/jira/browse/SPARK-6389 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Steve Loughran Assignee: Steve Loughran Priority: Trivial Fix For: 1.4.0 {{ApplicationMaster.run()}} catches exceptions and calls {{toMessage()}} to get their message included in the YARN diagnostics report visible in the RM UI. Except, NPEs don't have a message —if one is raised their report becomes {{Uncaught exception: null}}, which isn't that useful. The full text stack trace is logged correctly in the AM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4416) Support Mesos framework authentication
[ https://issues.apache.org/jira/browse/SPARK-4416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4416. -- Resolution: Duplicate Support Mesos framework authentication -- Key: SPARK-4416 URL: https://issues.apache.org/jira/browse/SPARK-4416 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Support Mesos framework authentication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367309#comment-14367309 ] Xiangrui Meng commented on SPARK-5874: -- [~Elie A.] Thanks for your feedback! This JIRA is to discuss the pipeline API but not specific components. For text preprocessing, we will certainly add standard stemmer and stop words filter as transformers. There is also a RegexTokenizer in review: https://github.com/apache/spark/pull/4504 How to improve the current ML pipeline API? --- Key: SPARK-5874 URL: https://issues.apache.org/jira/browse/SPARK-5874 Project: Spark Issue Type: Brainstorming Components: ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical I created this JIRA to collect feedbacks about the ML pipeline API we introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 with confidence, which requires valuable input from the community. I'll create sub-tasks for each major issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6400) It would be great if you could share your test jars in Maven central repository for the Spark SQL module
Óscar Puertas created SPARK-6400: Summary: It would be great if you could share your test jars in Maven central repository for the Spark SQL module Key: SPARK-6400 URL: https://issues.apache.org/jira/browse/SPARK-6400 Project: Spark Issue Type: Wish Components: SQL, Tests Affects Versions: 1.3.0 Reporter: Óscar Puertas It would be great if you could share your test jars in Maven central repository for the Spark SQL module -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6096) Support model save/load in Python's naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6096: - Assignee: Xusen Yin Support model save/load in Python's naive Bayes --- Key: SPARK-6096 URL: https://issues.apache.org/jira/browse/SPARK-6096 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6096) Support model save/load in Python's naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367280#comment-14367280 ] Xiangrui Meng commented on SPARK-6096: -- Done. Support model save/load in Python's naive Bayes --- Key: SPARK-6096 URL: https://issues.apache.org/jira/browse/SPARK-6096 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6401) Unable to load a old API input format in Spark streaming
Rémy DUBOIS created SPARK-6401: -- Summary: Unable to load a old API input format in Spark streaming Key: SPARK-6401 URL: https://issues.apache.org/jira/browse/SPARK-6401 Project: Spark Issue Type: Improvement Reporter: Rémy DUBOIS The fileStream method of the JavaStreamingContext class does not allow using a old API InputFormat. This feature exists in Spark batch but not in streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6402) EC2 script and job scheduling documentation still refer to Shark
Pierre Borckmans created SPARK-6402: --- Summary: EC2 script and job scheduling documentation still refer to Shark Key: SPARK-6402 URL: https://issues.apache.org/jira/browse/SPARK-6402 Project: Spark Issue Type: Documentation Reporter: Pierre Borckmans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6402) EC2 script and job scheduling documentation still refer to Shark
[ https://issues.apache.org/jira/browse/SPARK-6402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367440#comment-14367440 ] Apache Spark commented on SPARK-6402: - User 'pierre-borckmans' has created a pull request for this issue: https://github.com/apache/spark/pull/5083 EC2 script and job scheduling documentation still refer to Shark Key: SPARK-6402 URL: https://issues.apache.org/jira/browse/SPARK-6402 Project: Spark Issue Type: Documentation Reporter: Pierre Borckmans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6397) Check the missingInput simply
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367437#comment-14367437 ] Santiago M. Mola commented on SPARK-6397: - I think a proper title would be: Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis. And description: Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. (I'm about to create a pull request that fixes this problem in some more places) Check the missingInput simply - Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yadong Qi Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6374) Add getter for GeneralizedLinearAlgorithm
[ https://issues.apache.org/jira/browse/SPARK-6374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6374. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5058 [https://github.com/apache/spark/pull/5058] Add getter for GeneralizedLinearAlgorithm - Key: SPARK-6374 URL: https://issues.apache.org/jira/browse/SPARK-6374 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: yuhao yang Priority: Minor Fix For: 1.4.0 Original Estimate: 1h Remaining Estimate: 1h I find it's better to have getter for NumFeatures and addIntercept within GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get the value through debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3632) ConnectionManager can run out of receive threads with authentication on
[ https://issues.apache.org/jira/browse/SPARK-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367863#comment-14367863 ] Thomas Graves commented on SPARK-3632: -- [~andrewor14] At this point doesn't seem like we need to port back to branch1, you ok if we close this? ConnectionManager can run out of receive threads with authentication on --- Key: SPARK-3632 URL: https://issues.apache.org/jira/browse/SPARK-3632 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Critical Labels: backport-needed Fix For: 1.2.0 If you turn authentication on and you are using a lot of executors. There is a chance that all the of the threads in the handleMessageExecutor could be waiting to send a message because they are blocked waiting on authentication to happen. This can cause a temporary deadlock until the connection times out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6403) Launch master as spot instance on EC2
Adam Vogel created SPARK-6403: - Summary: Launch master as spot instance on EC2 Key: SPARK-6403 URL: https://issues.apache.org/jira/browse/SPARK-6403 Project: Spark Issue Type: New Feature Components: EC2 Affects Versions: 1.2.1 Reporter: Adam Vogel Priority: Minor Fix For: 1.2.1 Currently the spark_ec2.py script only supports requesting slaves as spot instances. Launching the master as a spot instance has potential cost savings, at the risk of losing the Spark cluster without warning. Unless users include logic for relaunching slaves when lost, it is usually the case that all slaves are lost simultaneously. Thus, for jobs which do not require resilience to losing spot instances, being able to launch the master as a spot instance saves money. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367701#comment-14367701 ] Manoj Kumar commented on SPARK-6192: Thanks for your feedback. I've fixed it up (same link) adding an Importance section, denoting the importance of the project. Let me know if there is anything else to be done. Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5078) Allow setting Akka host name from env vars
[ https://issues.apache.org/jira/browse/SPARK-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367704#comment-14367704 ] Timothy St. Clair commented on SPARK-5078: -- Cross listing details of the issue here, for posterity: https://groups.google.com/forum/#!topic/akka-user/9RQdf2NjciE + mattfs example https://github.com/mattf/docker-spark fix for k8's Allow setting Akka host name from env vars -- Key: SPARK-5078 URL: https://issues.apache.org/jira/browse/SPARK-5078 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical Fix For: 1.3.0, 1.2.1 Current spark lets you set the ip address using SPARK_LOCAL_IP, but then this is given to akka after doing a reverse DNS lookup. This makes it difficult to run spark in Docker. You can already change the hostname that is used programmatically, but it would be nice to be able to do this with an environment variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6395) Rebuild the schema from a GenericRow
Chen Song created SPARK-6395: Summary: Rebuild the schema from a GenericRow Key: SPARK-6395 URL: https://issues.apache.org/jira/browse/SPARK-6395 Project: Spark Issue Type: Task Components: SQL Reporter: Chen Song Sometimes we need the schema of the row, but GenericRow doesn't contain schema information. So we need a method such as val schema = ScalaReflection.rebuildSchema(row) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.
Yifan Wang created SPARK-6404: - Summary: Call broadcast() in each interval for spark streaming programs. Key: SPARK-6404 URL: https://issues.apache.org/jira/browse/SPARK-6404 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Yifan Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6406) Launcher backward compatibility issues
[ https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368380#comment-14368380 ] Apache Spark commented on SPARK-6406: - User 'nishkamravi2' has created a pull request for this issue: https://github.com/apache/spark/pull/5085 Launcher backward compatibility issues -- Key: SPARK-6406 URL: https://issues.apache.org/jira/browse/SPARK-6406 Project: Spark Issue Type: Bug Reporter: Nishkam Ravi Priority: Blocker The new launcher library breaks backward compatibility. hadoop string in the spark assembly should not be mandatory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6152) Spark does not support Java 8 compiled Scala classes
[ https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368138#comment-14368138 ] Jonathan Neufeld edited comment on SPARK-6152 at 3/18/15 11:31 PM: --- The exception is raised in com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader It looks like it's this line: if(this.readShort(6) 51) { throw new IllegalArgumentException(); I checked the result of readShort and found it to be 52 in my environment. It bears the mark of a compiled class file version compatibility problem to me. was (Author: madmartian): The exception is raised in com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader It looks like it's this line: if(this.readShort(6) 51) { throw new IllegalArgumentException(); Which bears the mark of a compiled class file version compatibility problem to me. Spark does not support Java 8 compiled Scala classes Key: SPARK-6152 URL: https://issues.apache.org/jira/browse/SPARK-6152 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Java 8+ Scala 2.11 Reporter: Ronald Chen Priority: Minor Spark uses reflectasm to check Scala closures which fails if the *user defined Scala closures* are compiled to Java 8 class version The cause is reflectasm does not support Java 8 https://github.com/EsotericSoftware/reflectasm/issues/35 Workaround: Don't compile Scala classes to Java 8, Scala 2.11 does not support nor require any Java 8 features Stack trace: {code} java.lang.IllegalArgumentException at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41) at org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107) at org.apache.spark.SparkContext.clean(SparkContext.scala:1478) at org.apache.spark.rdd.RDD.map(RDD.scala:288) at ...my Scala 2.11 compiled to Java 8 code calling into spark {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6405) Spark Kryo buffer should be forced to be max. 2GB
Matt Cheah created SPARK-6405: - Summary: Spark Kryo buffer should be forced to be max. 2GB Key: SPARK-6405 URL: https://issues.apache.org/jira/browse/SPARK-6405 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Matt Cheah Fix For: 1.4.0 Kryo buffers used in serialization are backed by Java byte arrays, which have a maximum size of 2GB. However, we blindly set the size without worrying about numeric overflow or regards to the maximum array size. We should enforce the maximum buffer size to be 2GB and warn the user when they have exceeded that amount. I'm open to the idea of flat-out failing the initialization of the Spark Context if the buffer size is over 2GB, but I'm afraid that could break backwards-compatability... although one can argue that the user had incorrect buffer sizes in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6152) Spark does not support Java 8 compiled Scala classes
[ https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368242#comment-14368242 ] Sean Owen commented on SPARK-6152: -- To your deleted comment -- yes indeed it looks like the library explicitly does not work with Java 8 since class file version 52 == Java 8 Spark does not support Java 8 compiled Scala classes Key: SPARK-6152 URL: https://issues.apache.org/jira/browse/SPARK-6152 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Java 8+ Scala 2.11 Reporter: Ronald Chen Priority: Minor Spark uses reflectasm to check Scala closures which fails if the *user defined Scala closures* are compiled to Java 8 class version The cause is reflectasm does not support Java 8 https://github.com/EsotericSoftware/reflectasm/issues/35 Workaround: Don't compile Scala classes to Java 8, Scala 2.11 does not support nor require any Java 8 features Stack trace: {code} java.lang.IllegalArgumentException at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41) at org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107) at org.apache.spark.SparkContext.clean(SparkContext.scala:1478) at org.apache.spark.rdd.RDD.map(RDD.scala:288) at ...my Scala 2.11 compiled to Java 8 code calling into spark {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.
[ https://issues.apache.org/jira/browse/SPARK-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368283#comment-14368283 ] Saisai Shao commented on SPARK-6404: Hi [~heavens...@gmail.com], I think for current broadcast, you cannot update the data that is already broadcasted. If you want to update the data in each interval, I'm not sure is the below snippet works for you: {code} stream.foreachRDD { rdd = val foo = sc.broadcast(Foo) rdd.foreach { foo.value.xxx } } {code} Call broadcast() in each interval for spark streaming programs. --- Key: SPARK-6404 URL: https://issues.apache.org/jira/browse/SPARK-6404 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Yifan Wang If I understand it correctly, Spark’s broadcast() function will be called only once at the beginning of the batch. For streaming applications that need to run for 24/7, it is often needed to update variables that shared by broadcast() dynamically. It would be ideal if broadcast() could be called at the beginning of each interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6397) Override QueryPlan.missingInput when necessary and rely on CheckAnalysis
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6397: - Description: Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. Summary: Override QueryPlan.missingInput when necessary and rely on CheckAnalysis (was: Check the missingInput simply) Override QueryPlan.missingInput when necessary and rely on CheckAnalysis Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yadong Qi Priority: Minor Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6401) Unable to load a old API input format in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368241#comment-14368241 ] Sean Owen commented on SPARK-6401: -- Yeah it would be more consistent. I suppose I'd be interested to hear whether others think it's worth continuing to add support for it or not though. It's pretty easy to port or carry a parallel version that uses the new InputFormat, right? I think you can even make an adapter. Unable to load a old API input format in Spark streaming Key: SPARK-6401 URL: https://issues.apache.org/jira/browse/SPARK-6401 Project: Spark Issue Type: Improvement Reporter: Rémy DUBOIS Priority: Minor The fileStream method of the JavaStreamingContext class does not allow using a old API InputFormat. This feature exists in Spark batch but not in streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6152) Spark does not support Java 8 compiled Scala classes
[ https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368138#comment-14368138 ] Jonathan Neufeld commented on SPARK-6152: - The exception is raised in com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader It looks like it's this line: if(this.readShort(6) 51) { throw new IllegalArgumentException(); Which bears the mark of a compiled class file version compatibility problem to me. Spark does not support Java 8 compiled Scala classes Key: SPARK-6152 URL: https://issues.apache.org/jira/browse/SPARK-6152 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Java 8+ Scala 2.11 Reporter: Ronald Chen Priority: Minor Spark uses reflectasm to check Scala closures which fails if the *user defined Scala closures* are compiled to Java 8 class version The cause is reflectasm does not support Java 8 https://github.com/EsotericSoftware/reflectasm/issues/35 Workaround: Don't compile Scala classes to Java 8, Scala 2.11 does not support nor require any Java 8 features Stack trace: {code} java.lang.IllegalArgumentException at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41) at org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107) at org.apache.spark.SparkContext.clean(SparkContext.scala:1478) at org.apache.spark.rdd.RDD.map(RDD.scala:288) at ...my Scala 2.11 compiled to Java 8 code calling into spark {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.
[ https://issues.apache.org/jira/browse/SPARK-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368383#comment-14368383 ] Yifan Wang commented on SPARK-6404: --- I got an error. Is that expected? {code} Traceback (most recent call last): File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/streaming/util.py, line 90, in dumps return bytearray(self.serializer.dumps((func.func, func.deserializers))) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/serializers.py, line 405, in dumps return cloudpickle.dumps(obj, 2) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 816, in dumps cp.dump(obj) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 133, in dump return pickle.Pickler.dump(self, obj) File /usr/lib/python2.6/pickle.py, line 224, in dump self.save(obj) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.6/pickle.py, line 548, in save_tuple save(element) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 249, in save_function self.save_function_tuple(obj, modList) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 304, in save_function_tuple save((code, closure, base_globals)) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.6/pickle.py, line 548, in save_tuple save(element) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.6/pickle.py, line 600, in save_list self._batch_appends(iter(obj)) File /usr/lib/python2.6/pickle.py, line 636, in _batch_appends save(tmp[0]) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 249, in save_function self.save_function_tuple(obj, modList) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 309, in save_function_tuple save(f_globals) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 174, in save_dict pickle.Pickler.save_dict(self, obj) File /usr/lib/python2.6/pickle.py, line 649, in save_dict self._batch_setitems(obj.iteritems()) File /usr/lib/python2.6/pickle.py, line 681, in _batch_setitems save(v) File /usr/lib/python2.6/pickle.py, line 306, in save rv = reduce(self.proto) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/context.py, line 236, in __getnewargs__ It appears that you are attempting to reference SparkContext from a broadcast Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063. {code} Call broadcast() in each interval for spark streaming programs. --- Key: SPARK-6404 URL: https://issues.apache.org/jira/browse/SPARK-6404 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Yifan Wang If I understand it correctly, Spark’s broadcast() function will be called only once at the beginning of the batch. For streaming applications that need to run for 24/7, it is often needed to update variables that shared by broadcast() dynamically. It would be ideal if broadcast() could be called at the beginning of each interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.
[ https://issues.apache.org/jira/browse/SPARK-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368383#comment-14368383 ] Yifan Wang edited comment on SPARK-6404 at 3/19/15 2:28 AM: I got an error. Is that expected? It seems I can't call broadcast() in the worker node. {code} Traceback (most recent call last): File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/streaming/util.py, line 90, in dumps return bytearray(self.serializer.dumps((func.func, func.deserializers))) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/serializers.py, line 405, in dumps return cloudpickle.dumps(obj, 2) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 816, in dumps cp.dump(obj) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 133, in dump return pickle.Pickler.dump(self, obj) File /usr/lib/python2.6/pickle.py, line 224, in dump self.save(obj) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.6/pickle.py, line 548, in save_tuple save(element) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 249, in save_function self.save_function_tuple(obj, modList) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 304, in save_function_tuple save((code, closure, base_globals)) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.6/pickle.py, line 548, in save_tuple save(element) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.6/pickle.py, line 600, in save_list self._batch_appends(iter(obj)) File /usr/lib/python2.6/pickle.py, line 636, in _batch_appends save(tmp[0]) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 249, in save_function self.save_function_tuple(obj, modList) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 309, in save_function_tuple save(f_globals) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 174, in save_dict pickle.Pickler.save_dict(self, obj) File /usr/lib/python2.6/pickle.py, line 649, in save_dict self._batch_setitems(obj.iteritems()) File /usr/lib/python2.6/pickle.py, line 681, in _batch_setitems save(v) File /usr/lib/python2.6/pickle.py, line 306, in save rv = reduce(self.proto) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/context.py, line 236, in __getnewargs__ It appears that you are attempting to reference SparkContext from a broadcast Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063. {code} was (Author: heavens...@gmail.com): I got an error. Is that expected? {code} Traceback (most recent call last): File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/streaming/util.py, line 90, in dumps return bytearray(self.serializer.dumps((func.func, func.deserializers))) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/serializers.py, line 405, in dumps return cloudpickle.dumps(obj, 2) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 816, in dumps cp.dump(obj) File /nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py, line 133, in dump return pickle.Pickler.dump(self, obj) File /usr/lib/python2.6/pickle.py, line 224, in dump self.save(obj) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File /usr/lib/python2.6/pickle.py, line 548, in save_tuple save(element) File /usr/lib/python2.6/pickle.py, line 286, in save f(self, obj) # Call unbound method with explicit self File
[jira] [Created] (SPARK-6406) Launcher backward compatibility issues
Nishkam Ravi created SPARK-6406: --- Summary: Launcher backward compatibility issues Key: SPARK-6406 URL: https://issues.apache.org/jira/browse/SPARK-6406 Project: Spark Issue Type: Bug Reporter: Nishkam Ravi Priority: Blocker The new launcher library breaks backward compatibility. hadoop string in the spark assembly should not be mandatory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6394) cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-6394. --- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5043 [https://github.com/apache/spark/pull/5043] cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler - Key: SPARK-6394 URL: https://issues.apache.org/jira/browse/SPARK-6394 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Wenchen Fan Priority: Minor Fix For: 1.4.0 The current implementation of getCacheLocs include searching a HashMap many times, we can avoid this. And BlockManager.blockIdsToExecutorIds isn't called anywhere, we can remove it. Also we can combine BlockManager.blockIdsToHosts and blockIdsToBlockManagers into a single method in order to remove some unnecessary layers of indirection / collection creation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6394) cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6394: -- Assignee: Wenchen Fan cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler - Key: SPARK-6394 URL: https://issues.apache.org/jira/browse/SPARK-6394 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Wenchen Fan Assignee: Wenchen Fan Priority: Minor Fix For: 1.4.0 The current implementation of getCacheLocs include searching a HashMap many times, we can avoid this. And BlockManager.blockIdsToExecutorIds isn't called anywhere, we can remove it. Also we can combine BlockManager.blockIdsToHosts and blockIdsToBlockManagers into a single method in order to remove some unnecessary layers of indirection / collection creation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6152) Spark does not support Java 8 compiled Scala classes
[ https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Neufeld updated SPARK-6152: Comment: was deleted (was: The exception is raised in com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader It looks like it's this line: if(this.readShort(6) 51) { throw new IllegalArgumentException(); I checked the result of readShort and found it to be 52 in my environment. It bears the mark of a compiled class file version compatibility problem to me.) Spark does not support Java 8 compiled Scala classes Key: SPARK-6152 URL: https://issues.apache.org/jira/browse/SPARK-6152 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Java 8+ Scala 2.11 Reporter: Ronald Chen Priority: Minor Spark uses reflectasm to check Scala closures which fails if the *user defined Scala closures* are compiled to Java 8 class version The cause is reflectasm does not support Java 8 https://github.com/EsotericSoftware/reflectasm/issues/35 Workaround: Don't compile Scala classes to Java 8, Scala 2.11 does not support nor require any Java 8 features Stack trace: {code} java.lang.IllegalArgumentException at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown Source) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41) at org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107) at org.apache.spark.SparkContext.clean(SparkContext.scala:1478) at org.apache.spark.rdd.RDD.map(RDD.scala:288) at ...my Scala 2.11 compiled to Java 8 code calling into spark {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6146) Support more datatype in SqlParser
[ https://issues.apache.org/jira/browse/SPARK-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6146: Target Version/s: 1.3.1 (was: 1.4.0) Support more datatype in SqlParser -- Key: SPARK-6146 URL: https://issues.apache.org/jira/browse/SPARK-6146 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Critical Right now, I cannot do {code} df.selectExpr(cast(a as bigint)) {code} because only the following data types are supported in SqlParser {code} protected lazy val dataType: Parser[DataType] = ( STRING ^^^ StringType | TIMESTAMP ^^^ TimestampType | DOUBLE ^^^ DoubleType | fixedDecimalType | DECIMAL ^^^ DecimalType.Unlimited | DATE ^^^ DateType | INT ^^^ IntegerType ) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5911) Make Column.cast(to: String) support fixed precision and scale decimal type
[ https://issues.apache.org/jira/browse/SPARK-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5911: Target Version/s: 1.3.1 (was: 1.4.0) Make Column.cast(to: String) support fixed precision and scale decimal type --- Key: SPARK-5911 URL: https://issues.apache.org/jira/browse/SPARK-5911 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6146) Support more datatype in SqlParser
[ https://issues.apache.org/jira/browse/SPARK-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368204#comment-14368204 ] Apache Spark commented on SPARK-6146: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/5078 Support more datatype in SqlParser -- Key: SPARK-6146 URL: https://issues.apache.org/jira/browse/SPARK-6146 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Critical Right now, I cannot do {code} df.selectExpr(cast(a as bigint)) {code} because only the following data types are supported in SqlParser {code} protected lazy val dataType: Parser[DataType] = ( STRING ^^^ StringType | TIMESTAMP ^^^ TimestampType | DOUBLE ^^^ DoubleType | fixedDecimalType | DECIMAL ^^^ DecimalType.Unlimited | DATE ^^^ DateType | INT ^^^ IntegerType ) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5911) Make Column.cast(to: String) support fixed precision and scale decimal type
[ https://issues.apache.org/jira/browse/SPARK-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368205#comment-14368205 ] Apache Spark commented on SPARK-5911: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/5078 Make Column.cast(to: String) support fixed precision and scale decimal type --- Key: SPARK-5911 URL: https://issues.apache.org/jira/browse/SPARK-5911 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API
[ https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368505#comment-14368505 ] Yin Huai edited comment on SPARK-5508 at 3/19/15 5:00 AM: -- Seems the root cause of this problem is the array element schema name. For Hive's Parquet Serde, array element's name is array_element ([see here |https://github.com/apache/incubator-parquet-mr/blob/parquet-1.6.0rc3/parquet-hive/parquet-hive-storage-handler/src/main/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L104]). While, for us, we use array ([see here| https://github.com/apache/spark/blob/v1.3.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala#L68] and we are following [parquet avro| https://github.com/apache/incubator-parquet-mr/blob/master/parquet-avro/src/main/java/parquet/avro/AvroSchemaConverter.java#L131]). was (Author: yhuai): Seems the root cause of this problem is the array element schema name. For Hive's Parquet Serde, array element's name is array_element (see https://github.com/apache/incubator-parquet-mr/blob/parquet-1.6.0rc3/parquet-hive/parquet-hive-storage-handler/src/main/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L104). While, for us, we use array (see https://github.com/apache/spark/blob/v1.3.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala#L68 and we are following parquet avro https://github.com/apache/incubator-parquet-mr/blob/master/parquet-avro/src/main/java/parquet/avro/AvroSchemaConverter.java#L131). Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API --- Key: SPARK-5508 URL: https://issues.apache.org/jira/browse/SPARK-5508 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Environment: mesos, cdh Reporter: Ayoub Benali Labels: hivecontext, parquet When the table is saved as parquet, we cannot query a field which is an array of struct after an INSERT statement, like show bellow: {noformat} scala val data1={ | timestamp: 1422435598, | data_array: [ | { | field1: 1, | field2: 2 | } | ] | } scala val data2={ | timestamp: 1422435598, | data_array: [ | { | field1: 3, | field2: 4 | } | ] scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) scala val rdd = hiveContext.jsonRDD(jsonRDD) scala rdd.printSchema root |-- data_array: array (nullable = true) ||-- element: struct (containsNull = false) |||-- field1: integer (nullable = true) |||-- field2: integer (nullable = true) |-- timestamp: integer (nullable = true) scala rdd.registerTempTable(tmp_table) scala hiveContext.sql(select data.field1 from tmp_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect res3: Array[org.apache.spark.sql.Row] = Array([1], [3]) scala hiveContext.sql(SET hive.exec.dynamic.partition = true) scala hiveContext.sql(SET hive.exec.dynamic.partition.mode = nonstrict) scala hiveContext.sql(set parquet.compression=GZIP) scala hiveContext.setConf(spark.sql.parquet.binaryAsString, true) scala hiveContext.sql(create external table if not exists persisted_table(data_array ARRAY STRUCTfield1: INT, field2: INT, timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table') scala hiveContext.sql(insert into table persisted_table select * from tmp_table).collect scala hiveContext.sql(select data.field1 from persisted_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://*/test_table/part-1 at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
[jira] [Commented] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API
[ https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368493#comment-14368493 ] Yin Huai commented on SPARK-5508: - I tried the following snippet in sparkShell {code} import sqlContext._ jsonRDD(sc.parallelize({a:[1,2,3,4]} :: Nil)).registerTempTable(jt) sql(set spark.sql.parquet.useDataSourceApi=false) sql(create table jt_hive_parquet stored as parquet as select * from jt) sql(set spark.sql.parquet.useDataSourceApi=true) table(jt_hive_parquet).show {code} Seems the array value is null. A workaround of this is set the following and then use HiveTableScan to read the data. {code} sql(set spark.sql.parquet.useDataSourceApi=false) sql(set spark.sql.hive.convertMetastoreParquet=false) {code} The array will be correctly read. The metadata dump and file dump of the file is {code} creator: parquet-mr version 1.6.0rc3 (build d4d5a07ec9bd262ca1e93c309f1d7d4a74ebda4c) file schema: hive_schema a: OPTIONAL F:1 .bag:REPEATED F:1 ..array_element: OPTIONAL INT64 R:1 D:3 row group 1: RC:1 TS:86 OFFSET:4 a: .bag: ..array_element: INT64 UNCOMPRESSED DO:0 FPO:4 SZ:86/86/1.00 VC:4 ENC:RLE,PLAIN {code} {code} row group 0 a: .bag: ..array_element: INT64 UNCOMPRESSED DO:0 FPO:4 SZ:86/86/1.00 VC:4 ENC:RLE,PLAIN a.bag.array_element TV=4 RL=1 DL=3 page 0: DLE:RLE RLE:RLE VLE:PLAIN SZ:45 VC:4 INT64 a.bag.array_element *** row group 1 of 1, values 1 to 4 *** value 1: R:0 D:3 V:1 value 2: R:1 D:3 V:2 value 3: R:1 D:3 V:3 value 4: R:1 D:3 V:4 {code} The metadata dump and file dump of a data source parquet file (generated through Parquet2Relation2) is {code} creator: parquet-mr version 1.6.0rc3 (build d4d5a07ec9bd262ca1e93c309f1d7d4a74ebda4c) extra: org.apache.spark.sql.parquet.row.metadata = {type:struct,fields:[{name:a,type:{type:array,elementType:long,containsNull:true},nullable:true,metadata:{}}]} file schema: root a: OPTIONAL F:1 .bag:REPEATED F:1 ..array: OPTIONAL INT64 R:1 D:3 row group 1: RC:1 TS:86 OFFSET:4 a: .bag: ..array: INT64 UNCOMPRESSED DO:0 FPO:4 SZ:86/86/1.00 VC:4 ENC:PLAIN,RLE {code} {code} row group 0 a: .bag: ..array: INT64 UNCOMPRESSED DO:0 FPO:4 SZ:86/86/1.00 VC:4 ENC:PLAIN,RLE a.bag.array TV=4 RL=1 DL=3 page 0: DLE:RLE RLE:RLE VLE:PLAIN SZ:45 VC:4 INT64 a.bag.array *** row group 1 of 1, values 1 to 4 *** value 1: R:0 D:3 V:1 value 2: R:1 D:3 V:2 value 3: R:1 D:3 V:3 value 4: R:1 D:3 V:4 {code} Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API --- Key: SPARK-5508 URL: https://issues.apache.org/jira/browse/SPARK-5508 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Environment: mesos, cdh Reporter: Ayoub Benali Labels: hivecontext, parquet When the table is saved as parquet, we cannot query a field which is an array of struct after an INSERT statement, like show bellow: {noformat} scala val data1={ | timestamp: 1422435598, | data_array: [ | { | field1: 1, | field2: 2 | } | ] | } scala val data2={ | timestamp: 1422435598, | data_array: [ | { | field1: 3, | field2: 4 | } | ] scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) scala val rdd = hiveContext.jsonRDD(jsonRDD) scala rdd.printSchema root |-- data_array: array (nullable = true) ||-- element: struct (containsNull = false) |||-- field1: integer (nullable = true) |||-- field2: integer (nullable = true) |-- timestamp: integer (nullable = true) scala rdd.registerTempTable(tmp_table)
[jira] [Updated] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-6407: Description: Like MLLib's ALS implementation for recommendation, and applying to streaming. Similar to streaming linear regression, logistic regression, could we apply gradient updates to batches of data and reuse existing MLLib implementation? was:Like MLLib's ALS implementation for recommendation, and applying to streaming Streaming ALS for Collaborative Filtering - Key: SPARK-6407 URL: https://issues.apache.org/jira/browse/SPARK-6407 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Felix Cheung Priority: Minor Like MLLib's ALS implementation for recommendation, and applying to streaming. Similar to streaming linear regression, logistic regression, could we apply gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API
[ https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368505#comment-14368505 ] Yin Huai commented on SPARK-5508: - Seems the root cause of this problem is the array element schema name. For Hive's Parquet Serde, array element's name is array_element (see https://github.com/apache/incubator-parquet-mr/blob/parquet-1.6.0rc3/parquet-hive/parquet-hive-storage-handler/src/main/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L104). While, for us, we use array (see https://github.com/apache/spark/blob/v1.3.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala#L68 and we are following parquet avro https://github.com/apache/incubator-parquet-mr/blob/master/parquet-avro/src/main/java/parquet/avro/AvroSchemaConverter.java#L131). Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API --- Key: SPARK-5508 URL: https://issues.apache.org/jira/browse/SPARK-5508 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Environment: mesos, cdh Reporter: Ayoub Benali Labels: hivecontext, parquet When the table is saved as parquet, we cannot query a field which is an array of struct after an INSERT statement, like show bellow: {noformat} scala val data1={ | timestamp: 1422435598, | data_array: [ | { | field1: 1, | field2: 2 | } | ] | } scala val data2={ | timestamp: 1422435598, | data_array: [ | { | field1: 3, | field2: 4 | } | ] scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) scala val rdd = hiveContext.jsonRDD(jsonRDD) scala rdd.printSchema root |-- data_array: array (nullable = true) ||-- element: struct (containsNull = false) |||-- field1: integer (nullable = true) |||-- field2: integer (nullable = true) |-- timestamp: integer (nullable = true) scala rdd.registerTempTable(tmp_table) scala hiveContext.sql(select data.field1 from tmp_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect res3: Array[org.apache.spark.sql.Row] = Array([1], [3]) scala hiveContext.sql(SET hive.exec.dynamic.partition = true) scala hiveContext.sql(SET hive.exec.dynamic.partition.mode = nonstrict) scala hiveContext.sql(set parquet.compression=GZIP) scala hiveContext.setConf(spark.sql.parquet.binaryAsString, true) scala hiveContext.sql(create external table if not exists persisted_table(data_array ARRAY STRUCTfield1: INT, field2: INT, timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table') scala hiveContext.sql(insert into table persisted_table select * from tmp_table).collect scala hiveContext.sql(select data.field1 from persisted_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://*/test_table/part-1 at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797) at
[jira] [Commented] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.
[ https://issues.apache.org/jira/browse/SPARK-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368409#comment-14368409 ] Saisai Shao commented on SPARK-6404: Would you please paste your code snippet to identify the problem. Call broadcast() in each interval for spark streaming programs. --- Key: SPARK-6404 URL: https://issues.apache.org/jira/browse/SPARK-6404 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Yifan Wang If I understand it correctly, Spark’s broadcast() function will be called only once at the beginning of the batch. For streaming applications that need to run for 24/7, it is often needed to update variables that shared by broadcast() dynamically. It would be ideal if broadcast() could be called at the beginning of each interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6200) Support dialect in SQL
[ https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368486#comment-14368486 ] haiyang commented on SPARK-6200: You are right! In fact,I haven't add the corresponding code of making DialectManager configurable.Because I thought if I add a lot of the public API will make it even more difficult for you to accept this PR :). I think the implementation that you linked to below is a good implementation.It's simpler than what I provided.But sometimes maybe I want to give my own dialect add a short name like sql or hiveql,or load some default dialects with a short name when spark sql started.I guess I don't see the corresponding code of this implementation.If I misunderstood, please forgive me:).What do you think of this idea?Maybe I can give some comments to make that implementation better:). Thank you for your attention to my PR. Support dialect in SQL -- Key: SPARK-6200 URL: https://issues.apache.org/jira/browse/SPARK-6200 Project: Spark Issue Type: Improvement Components: SQL Reporter: haiyang Created a new dialect manager,support dialect command and add new dialect use sql statement etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6407) Streaming ALS for Collaborative Filtering
Felix Cheung created SPARK-6407: --- Summary: Streaming ALS for Collaborative Filtering Key: SPARK-6407 URL: https://issues.apache.org/jira/browse/SPARK-6407 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Felix Cheung Priority: Minor Like MLLib's ALS implementation for recommendation, and applying to streaming -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4258) NPE with new Parquet Filters
[ https://issues.apache.org/jira/browse/SPARK-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368528#comment-14368528 ] Yin Huai commented on SPARK-4258: - [~liancheng] Does the version that we are currently using have the fix? NPE with new Parquet Filters Key: SPARK-4258 URL: https://issues.apache.org/jira/browse/SPARK-4258 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical Fix For: 1.2.0 {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 160, ip-10-0-247-144.us-west-2.compute.internal): java.lang.NullPointerException: parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206) parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:210) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) parquet.filter2.predicate.Operators$Or.accept(Operators.java:302) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:201) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) parquet.filter2.predicate.Operators$And.accept(Operators.java:290) parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52) parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46) parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) {code} This occurs when reading parquet data encoded with the older version of the library for TPC-DS query 34. Will work on coming up with a smaller reproduction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4485) Add broadcast outer join to optimize left outer join and right outer join
[ https://issues.apache.org/jira/browse/SPARK-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-4485: Target Version/s: 1.4.0 (was: 1.1.0) Add broadcast outer join to optimize left outer join and right outer join -- Key: SPARK-4485 URL: https://issues.apache.org/jira/browse/SPARK-4485 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: XiaoJing wang Priority: Minor Original Estimate: 0.05h Remaining Estimate: 0.05h For now, spark use broadcast join instead of hash join to optimize {{inner join}} when the size of one side data did not reach the {{AUTO_BROADCASTJOIN_THRESHOLD}} However,Spark SQL will perform shuffle operations on each child relations while executing {{left outer join}} and {{right outer join}}. {outer join}} is more suitable for optimiztion with broadcast join. We are planning to create a {{BroadcastHashouterJoin}} to implement the broadcast join for {{left outer join}} and {{right outer join}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6374) Add getter for GeneralizedLinearAlgorithm
[ https://issues.apache.org/jira/browse/SPARK-6374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6374: - Assignee: yuhao yang Add getter for GeneralizedLinearAlgorithm - Key: SPARK-6374 URL: https://issues.apache.org/jira/browse/SPARK-6374 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: yuhao yang Assignee: yuhao yang Priority: Minor Fix For: 1.4.0 Original Estimate: 1h Remaining Estimate: 1h I find it's better to have getter for NumFeatures and addIntercept within GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get the value through debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6401) Unable to load a old API input format in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6401: - Priority: Minor (was: Major) You mean the .mapred. MapReduce API right? I think it's not unreasonable to add, but, the new APIs have been around for so long (Hadoop 1.x) that I don't know if people would be using the old API at all? That is I'd almost say these should be deprecated. Unable to load a old API input format in Spark streaming Key: SPARK-6401 URL: https://issues.apache.org/jira/browse/SPARK-6401 Project: Spark Issue Type: Improvement Reporter: Rémy DUBOIS Priority: Minor The fileStream method of the JavaStreamingContext class does not allow using a old API InputFormat. This feature exists in Spark batch but not in streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6401) Unable to load a old API input format in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367789#comment-14367789 ] Rémy DUBOIS edited comment on SPARK-6401 at 3/18/15 8:18 PM: - Yes I mean the mapred API. All our input formats are developed in mapred API so it would allow us to avoid rewriting them in mapreduce API. Since the batch API can read from a mapred InputFormat, don't you think it would be more consistent to have the same possibility in the streaming API? was (Author: rdubois): Yes I mean the mapred API. All our input formats are developed in mapred API so it would allow us to avoid rewriting them in mapreduce API. Or at least, it would allow us to do it gradually. Since the batch API can read from a mapred InputFormat, don't you think it would be more consistent to have the same possibility in the streaming API? Unable to load a old API input format in Spark streaming Key: SPARK-6401 URL: https://issues.apache.org/jira/browse/SPARK-6401 Project: Spark Issue Type: Improvement Reporter: Rémy DUBOIS Priority: Minor The fileStream method of the JavaStreamingContext class does not allow using a old API InputFormat. This feature exists in Spark batch but not in streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6168) Expose some of the collection classes as DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367787#comment-14367787 ] Apache Spark commented on SPARK-6168: - User 'mridulm' has created a pull request for this issue: https://github.com/apache/spark/pull/5084 Expose some of the collection classes as DeveloperApi - Key: SPARK-6168 URL: https://issues.apache.org/jira/browse/SPARK-6168 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: Mridul Muralidharan Priority: Minor It would be very useful to mark some of the collection and util classes we have as @DeveloperApi or even @Experimental. For example : - org.apache.spark.util.BoundedPriorityQueue - org.apache.spark.util.collection.Utils - org.apache.spark.util.collection.CompactBuffer Usecases for these are for example to do a treeAggregate and/or treeReduce impl for common api's within RDD which fail due to limitations of aggregate/reduce at scale (ref: SPARK-6165). Ofcourse, they are useful for other things too :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6401) Unable to load a old API input format in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367789#comment-14367789 ] Rémy DUBOIS commented on SPARK-6401: Yes I mean the mapred API. All our input formats are developed in mapred API so it would allow us to avoid rewriting them in mapreduce API. Or at least, it would allow us to do it gradually. Since the batch API can read from a mapred InputFormat, don't you think it would be more consistent to have the same possibility in the streaming API? Unable to load a old API input format in Spark streaming Key: SPARK-6401 URL: https://issues.apache.org/jira/browse/SPARK-6401 Project: Spark Issue Type: Improvement Reporter: Rémy DUBOIS Priority: Minor The fileStream method of the JavaStreamingContext class does not allow using a old API InputFormat. This feature exists in Spark batch but not in streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.
[ https://issues.apache.org/jira/browse/SPARK-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yifan Wang updated SPARK-6404: -- Description: If I understand it correctly, Spark’s broadcast() function will be called only once at the beginning of the batch. For streaming applications that need to run for 24/7, it is often needed to update variables that shared by broadcast() dynamically. It would be ideal if broadcast() could be called at the beginning of each interval. Call broadcast() in each interval for spark streaming programs. --- Key: SPARK-6404 URL: https://issues.apache.org/jira/browse/SPARK-6404 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Yifan Wang If I understand it correctly, Spark’s broadcast() function will be called only once at the beginning of the batch. For streaming applications that need to run for 24/7, it is often needed to update variables that shared by broadcast() dynamically. It would be ideal if broadcast() could be called at the beginning of each interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException
[ https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367891#comment-14367891 ] Ilya Ganelin commented on SPARK-5945: - Hi Imran - I'd be happy to tackle this. Could you please assign it to me? Thank you. Spark should not retry a stage infinitely on a FetchFailedException --- Key: SPARK-5945 URL: https://issues.apache.org/jira/browse/SPARK-5945 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid While investigating SPARK-5928, I noticed some very strange behavior in the way spark retries stages after a FetchFailedException. It seems that on a FetchFailedException, instead of simply killing the task and retrying, Spark aborts the stage and retries. If it just retried the task, the task might fail 4 times and then trigger the usual job killing mechanism. But by killing the stage instead, the max retry logic is skipped (it looks to me like there is no limit for retries on a stage). After a bit of discussion with Kay Ousterhout, it seems the idea is that if a fetch fails, we assume that the block manager we are fetching from has failed, and that it will succeed if we retry the stage w/out that block manager. In that case, it wouldn't make any sense to retry the task, since its doomed to fail every time, so we might as well kill the whole stage. But this raises two questions: 1) Is it really safe to assume that a FetchFailedException means that the BlockManager has failed, and ti will work if we just try another one? SPARK-5928 shows that there are at least some cases where that assumption is wrong. Even if we fix that case, this logic seems brittle to the next case we find. I guess the idea is that this behavior is what gives us the R in RDD ... but it seems like its not really that robust and maybe should be reconsidered. 2) Should stages only be retried a limited number of times? It would be pretty easy to put in a limited number of retries per stage. Though again, we encounter issues with keeping things resilient. Theoretically one stage could have many retries, but due to failures in different stages further downstream, so we might need to track the cause of each retry as well to still have the desired behavior. In general it just seems there is some flakiness in the retry logic. This is the only reproducible example I have at the moment, but I vaguely recall hitting other cases of strange behavior w/ retries when trying to run long pipelines. Eg., if one executor is stuck in a GC during a fetch, the fetch fails, but the executor eventually comes back and the stage gets retried again, but the same GC issues happen the second time around, etc. Copied from SPARK-5928, here's the example program that can regularly produce a loop of stage failures. Note that it will only fail from a remote fetch, so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5932) Use consistent naming for byte properties
[ https://issues.apache.org/jira/browse/SPARK-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367919#comment-14367919 ] Ilya Ganelin commented on SPARK-5932: - [~andrewor14] - I can take this out. Thanks. Use consistent naming for byte properties - Key: SPARK-5932 URL: https://issues.apache.org/jira/browse/SPARK-5932 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Andrew Or This is SPARK-5931's sister issue. The naming of existing byte configs is inconsistent. We currently have the following throughout the code base: {code} spark.reducer.maxMbInFlight // megabytes spark.kryoserializer.buffer.mb // megabytes spark.shuffle.file.buffer.kb // kilobytes spark.broadcast.blockSize // kilobytes spark.executor.logs.rolling.size.maxBytes // bytes spark.io.compression.snappy.block.size // bytes {code} Instead, my proposal is to simplify the config name itself and make everything accept time using the following format: 500b, 2k, 100m, 46g, similar to what we currently use for our memory settings. For instance: {code} spark.reducer.maxSizeInFlight = 10m spark.kryoserializer.buffer = 2m spark.shuffle.file.buffer = 10k spark.broadcast.blockSize = 20k spark.executor.logs.rolling.maxSize = 500b spark.io.compression.snappy.blockSize = 200b {code} All existing configs that are relevant will be deprecated in favor of the new ones. We should do this soon before we keep introducing more time configs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5931) Use consistent naming for time properties
[ https://issues.apache.org/jira/browse/SPARK-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367915#comment-14367915 ] Ilya Ganelin edited comment on SPARK-5931 at 3/18/15 9:09 PM: -- [~andrewor14] - I can take this out. Thanks. was (Author: ilganeli): @andrewor - I can take this out. Thanks. Use consistent naming for time properties - Key: SPARK-5931 URL: https://issues.apache.org/jira/browse/SPARK-5931 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Andrew Or This is SPARK-5932's sister issue. The naming of existing time configs is inconsistent. We currently have the following throughout the code base: {code} spark.network.timeout // seconds spark.executor.heartbeatInterval // milliseconds spark.storage.blockManagerSlaveTimeoutMs // milliseconds spark.yarn.scheduler.heartbeat.interval-ms // milliseconds {code} Instead, my proposal is to simplify the config name itself and make everything accept time using the following format: 5s, 2ms, 100us. For instance: {code} spark.network.timeout = 5s spark.executor.heartbeatInterval = 500ms spark.storage.blockManagerSlaveTimeout = 100ms spark.yarn.scheduler.heartbeatInterval = 400ms {code} All existing configs that are relevant will be deprecated in favor of the new ones. We should do this soon before we keep introducing more time configs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5931) Use consistent naming for time properties
[ https://issues.apache.org/jira/browse/SPARK-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367915#comment-14367915 ] Ilya Ganelin commented on SPARK-5931: - @andrewor - I can take this out. Thanks. Use consistent naming for time properties - Key: SPARK-5931 URL: https://issues.apache.org/jira/browse/SPARK-5931 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Andrew Or This is SPARK-5932's sister issue. The naming of existing time configs is inconsistent. We currently have the following throughout the code base: {code} spark.network.timeout // seconds spark.executor.heartbeatInterval // milliseconds spark.storage.blockManagerSlaveTimeoutMs // milliseconds spark.yarn.scheduler.heartbeat.interval-ms // milliseconds {code} Instead, my proposal is to simplify the config name itself and make everything accept time using the following format: 5s, 2ms, 100us. For instance: {code} spark.network.timeout = 5s spark.executor.heartbeatInterval = 500ms spark.storage.blockManagerSlaveTimeout = 100ms spark.yarn.scheduler.heartbeatInterval = 400ms {code} All existing configs that are relevant will be deprecated in favor of the new ones. We should do this soon before we keep introducing more time configs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6364) hashCode and equals for Matrices
[ https://issues.apache.org/jira/browse/SPARK-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366720#comment-14366720 ] Apache Spark commented on SPARK-6364: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/5081 hashCode and equals for Matrices Key: SPARK-6364 URL: https://issues.apache.org/jira/browse/SPARK-6364 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Manoj Kumar hashCode implementation should be similar to Vector's. But we may want to reduce the complexity by scanning only a few nonzeros instead of all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org