[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704418#comment-14704418 ] Apache Spark commented on SPARK-10100: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/8332 AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction -- Key: SPARK-10100 URL: https://issues.apache.org/jira/browse/SPARK-10100 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Assignee: Herman van Hovell Attachments: SPARK-10100.perf.test.scala Looks like Max (probably Min) implemented based on AggregateFunction2 is slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10133) loadLibSVMFile fails to detect zero-based lines
Xusen Yin created SPARK-10133: - Summary: loadLibSVMFile fails to detect zero-based lines Key: SPARK-10133 URL: https://issues.apache.org/jira/browse/SPARK-10133 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.1 Reporter: Xusen Yin Priority: Minor https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L88 The code wants to assure that each line of vector is one-based and in ascending order, but it fails since the previous = -1 in the beginning. In this condition, a libSVM format file that begins with 0-based index could read in normally, but the size of the according SparseVector is wrong, i.e. numFeatures - 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9107) Include memory usage for each job stage
[ https://issues.apache.org/jira/browse/SPARK-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9107: --- Assignee: Apache Spark Include memory usage for each job stage - Key: SPARK-9107 URL: https://issues.apache.org/jira/browse/SPARK-9107 Project: Spark Issue Type: Sub-task Components: Spark Core, Web UI Reporter: Zhang, Liye Assignee: Apache Spark In [spark-9104|https://issues.apache.org/jira/browse/SPARK-9104], the memory usage is showed as running time memory usage, which means, we can only see the current status of the memory consumption, just in the same way how the Storage Tab works, it would not store the previous status, and the same situation for history server WebUI. The target for this issue is to show in different finished stages or in different finished jobs, what is the memory usage status when/before the job/stage completes. Also we can give out the maximum and minimum memory size used during each job/stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9106) Log the memory usage info into history server
[ https://issues.apache.org/jira/browse/SPARK-9106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704511#comment-14704511 ] Apache Spark commented on SPARK-9106: - User 'liyezhang556520' has created a pull request for this issue: https://github.com/apache/spark/pull/7753 Log the memory usage info into history server - Key: SPARK-9106 URL: https://issues.apache.org/jira/browse/SPARK-9106 Project: Spark Issue Type: Sub-task Components: Spark Core, Web UI Reporter: Zhang, Liye Save the memory usage info as eventLog, and can be traced from history server. So that user can make an offline analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9105) Add an additional WebUI Tab for Memory Usage
[ https://issues.apache.org/jira/browse/SPARK-9105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704510#comment-14704510 ] Apache Spark commented on SPARK-9105: - User 'liyezhang556520' has created a pull request for this issue: https://github.com/apache/spark/pull/7753 Add an additional WebUI Tab for Memory Usage Key: SPARK-9105 URL: https://issues.apache.org/jira/browse/SPARK-9105 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Zhang, Liye Add a spark a WebUI Tab for Memory usage, the Tab should expose memory usage status in different spark components. It should show the summary for each executors and may also the details for each tasks. On this Tab, there may be some duplicated information with Storage Tab, but they are in different showing format, take RDD cache for example, the RDD cached size showed on Storage Tab is indexed with RDD name, while on memory usage Tab, the RDD can be indexed with Executors, or tasks. Also, the two Tabs can share some same Web Pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9107) Include memory usage for each job stage
[ https://issues.apache.org/jira/browse/SPARK-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704512#comment-14704512 ] Apache Spark commented on SPARK-9107: - User 'liyezhang556520' has created a pull request for this issue: https://github.com/apache/spark/pull/7753 Include memory usage for each job stage - Key: SPARK-9107 URL: https://issues.apache.org/jira/browse/SPARK-9107 Project: Spark Issue Type: Sub-task Components: Spark Core, Web UI Reporter: Zhang, Liye In [spark-9104|https://issues.apache.org/jira/browse/SPARK-9104], the memory usage is showed as running time memory usage, which means, we can only see the current status of the memory consumption, just in the same way how the Storage Tab works, it would not store the previous status, and the same situation for history server WebUI. The target for this issue is to show in different finished stages or in different finished jobs, what is the memory usage status when/before the job/stage completes. Also we can give out the maximum and minimum memory size used during each job/stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9107) Include memory usage for each job stage
[ https://issues.apache.org/jira/browse/SPARK-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9107: --- Assignee: (was: Apache Spark) Include memory usage for each job stage - Key: SPARK-9107 URL: https://issues.apache.org/jira/browse/SPARK-9107 Project: Spark Issue Type: Sub-task Components: Spark Core, Web UI Reporter: Zhang, Liye In [spark-9104|https://issues.apache.org/jira/browse/SPARK-9104], the memory usage is showed as running time memory usage, which means, we can only see the current status of the memory consumption, just in the same way how the Storage Tab works, it would not store the previous status, and the same situation for history server WebUI. The target for this issue is to show in different finished stages or in different finished jobs, what is the memory usage status when/before the job/stage completes. Also we can give out the maximum and minimum memory size used during each job/stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9106) Log the memory usage info into history server
[ https://issues.apache.org/jira/browse/SPARK-9106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9106: --- Assignee: Apache Spark Log the memory usage info into history server - Key: SPARK-9106 URL: https://issues.apache.org/jira/browse/SPARK-9106 Project: Spark Issue Type: Sub-task Components: Spark Core, Web UI Reporter: Zhang, Liye Assignee: Apache Spark Save the memory usage info as eventLog, and can be traced from history server. So that user can make an offline analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9105) Add an additional WebUI Tab for Memory Usage
[ https://issues.apache.org/jira/browse/SPARK-9105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9105: --- Assignee: (was: Apache Spark) Add an additional WebUI Tab for Memory Usage Key: SPARK-9105 URL: https://issues.apache.org/jira/browse/SPARK-9105 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Zhang, Liye Add a spark a WebUI Tab for Memory usage, the Tab should expose memory usage status in different spark components. It should show the summary for each executors and may also the details for each tasks. On this Tab, there may be some duplicated information with Storage Tab, but they are in different showing format, take RDD cache for example, the RDD cached size showed on Storage Tab is indexed with RDD name, while on memory usage Tab, the RDD can be indexed with Executors, or tasks. Also, the two Tabs can share some same Web Pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9105) Add an additional WebUI Tab for Memory Usage
[ https://issues.apache.org/jira/browse/SPARK-9105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9105: --- Assignee: Apache Spark Add an additional WebUI Tab for Memory Usage Key: SPARK-9105 URL: https://issues.apache.org/jira/browse/SPARK-9105 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Zhang, Liye Assignee: Apache Spark Add a spark a WebUI Tab for Memory usage, the Tab should expose memory usage status in different spark components. It should show the summary for each executors and may also the details for each tasks. On this Tab, there may be some duplicated information with Storage Tab, but they are in different showing format, take RDD cache for example, the RDD cached size showed on Storage Tab is indexed with RDD name, while on memory usage Tab, the RDD can be indexed with Executors, or tasks. Also, the two Tabs can share some same Web Pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9106) Log the memory usage info into history server
[ https://issues.apache.org/jira/browse/SPARK-9106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9106: --- Assignee: (was: Apache Spark) Log the memory usage info into history server - Key: SPARK-9106 URL: https://issues.apache.org/jira/browse/SPARK-9106 Project: Spark Issue Type: Sub-task Components: Spark Core, Web UI Reporter: Zhang, Liye Save the memory usage info as eventLog, and can be traced from history server. So that user can make an offline analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10133) loadLibSVMFile fails to detect zero-based lines
[ https://issues.apache.org/jira/browse/SPARK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704521#comment-14704521 ] Xusen Yin commented on SPARK-10133: --- Thanks. I ignored the above line. loadLibSVMFile fails to detect zero-based lines --- Key: SPARK-10133 URL: https://issues.apache.org/jira/browse/SPARK-10133 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.1 Reporter: Xusen Yin Priority: Minor https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L88 The code wants to assure that each line of vector is one-based and in ascending order, but it fails since the previous = -1 in the beginning. In this condition, a libSVM format file that begins with 0-based index could read in normally, but the size of the according SparseVector is wrong, i.e. numFeatures - 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9089: --- Assignee: Apache Spark Failing to run simple job on Spark Standalone Cluster - Key: SPARK-9089 URL: https://issues.apache.org/jira/browse/SPARK-9089 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 1.4.0 Environment: Staging Reporter: Amar Goradia Assignee: Apache Spark Priority: Critical We are trying out Spark and as part of that, we have setup Standalone Spark Cluster. As part of testing things out, we simple open PySpark shell and ran this simple job: a=sc.parallelize([1,2,3]).count() As a result, we are getting errors. We tried googling around this error but haven't been able to find exact reasoning behind why we are running into this state. Can somebody please help us further look into this issue and advise us on what we are missing here? Here is full error stack: a=sc.parallelize([1,2,3]).count() 15/07/16 00:52:15 INFO SparkContext: Starting job: count at stdin:1 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at stdin:1) with 2 output partitions (allowLocal=false) 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at stdin:1) 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List() 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List() 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] at count at stdin:1), which has no missing parents 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at stdin:1) failed in Unknown s 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at stdin:1, took 0.004963 s Traceback (most recent call last): File stdin, line 1, in module File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 972, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 963, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 771, in reduce vals = self.mapPartitions(func).collect() File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 745, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.reflect.InvocationTargetException sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) java.lang.reflect.Constructor.newInstance(Constructor.java:526) org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68) org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60) org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73) org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815) org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411) org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257) at
[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs
[ https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705278#comment-14705278 ] Silas Davis commented on SPARK-3533: I've looked at various solutions, and have summarised what I found in my post here: http://apache-spark-developers-list.1001551.n3.nabble.com/Writing-to-multiple-outputs-in-Spark-td13298.html. The Stack Overflow question linked only address multiple Text outputs, and only does that for hadoop 1. My code synthesises the idea of using a wrapping OutputFormat, and of another gist that uses MultipleOuputs, but modifies saveAsNewAPIHadoopFile. My code also makes do with the current Spark API, but was enough effort, and seems common enough an aim that I'd argue some of it should be moved into Spark itself. As for showing some code, my implementation is contained on the gist I have posted, and I have added this to the links attached to this ticket. I was hoping to get some comments on the code before embarking on a full pull request in which would require more consideration on where to place files etc. I'm not sure if you're suggesting it would be better to make a pull request now, or whether the gist is sufficient. I will open a pull request if you prefer. Is there anything else I should be doing to get committer buy-in? [~nchammas] Have you been able to take a look at the code? Add saveAsTextFileByKey() method to RDDs Key: SPARK-3533 URL: https://issues.apache.org/jira/browse/SPARK-3533 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Nicholas Chammas Users often have a single RDD of key-value pairs that they want to save to multiple locations based on the keys. For example, say I have an RDD like this: {code} a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda x: x[0]) a.collect() [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] a.keys().distinct().collect() ['B', 'F', 'N'] {code} Now I want to write the RDD out to different paths depending on the keys, so that I have one output directory per distinct key. Each output directory could potentially have multiple {{part-}} files, one per RDD partition. So the output would look something like: {code} /path/prefix/B [/part-1, /part-2, etc] /path/prefix/F [/part-1, /part-2, etc] /path/prefix/N [/part-1, /part-2, etc] {code} Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't straightforward. It's not clear if it's even possible at all in PySpark. Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes it easy to save RDDs out to multiple locations at once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9944) hive.metastore.warehouse.dir is not respected
[ https://issues.apache.org/jira/browse/SPARK-9944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705315#comment-14705315 ] Yin Huai commented on SPARK-9944: - OK. I guess {{/user/ec2-user/warehouse}} is not the one you set, right? I took a look and here is my finding. The reason is that in Spark SQL, if you do not specify a database name when you create a table, your table will be created under the current database and we will use the location of current database as the parent dir of your table dir. If you create a table in the {{default}} database, we are using the location of {{default}} database as the parent dir of your tables. Once this default db is created, {{hive.metastore.warehouse.dir}} cannot override it. However, Hive will allow you use {{hive.metastore.warehouse.dir}} as the parent dir of your tables if you are using {{default}} database and you do not specify the location of your table. See https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L4473-L4479 (this is added by https://issues.apache.org/jira/browse/HIVE-6374). hive.metastore.warehouse.dir is not respected - Key: SPARK-9944 URL: https://issues.apache.org/jira/browse/SPARK-9944 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.4.1 Reporter: Manku Timma In 1.3.1, {{hive.metastore.warehouse.dir}} was honoured and table data was stored there. In 1.4.0, this is no longer used. Instead {{DBS.DB_LOCATION_URI}} of the metastore is used always. This breaks use cases where the param is used to override the warehouse location. To reproduce the issue, start spark-shell with {{hive.metastore.warehouse.dir}} set in hive-site.xml and run {{df.saveAsTable(x)}}. You will see that the param is not honoured. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9982) SparkR DataFrame fail to return data of Decimal type
[ https://issues.apache.org/jira/browse/SPARK-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9982. -- Resolution: Fixed Fix Version/s: 1.5.0 SparkR DataFrame fail to return data of Decimal type Key: SPARK-9982 URL: https://issues.apache.org/jira/browse/SPARK-9982 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.1 Reporter: Alex Shkurenko Assignee: Alex Shkurenko Fix For: 1.5.0 Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, but with the Decimal datatype coming from a Postgres DB: //Set up SparkR Sys.setenv(SPARK_HOME=/Users/ashkurenko/work/git_repos/spark) Sys.setenv(SPARKR_SUBMIT_ARGS=--driver-class-path ~/Downloads/postgresql-9.4-1201.jdbc4.jar sparkr-shell) .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) library(SparkR) sc - sparkR.init(master=local) // Connect to a Postgres DB via JDBC sqlContext - sparkRSQL.init(sc) sql(sqlContext, CREATE TEMPORARY TABLE mytable USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:postgresql://servername:5432/dbname' ,dbtable 'mydbtable' ) ) // Try pulling a Decimal column from a table myDataFrame - sql(sqlContext,(select a_decimal_column from mytable )) // The schema shows up fine show(myDataFrame) DataFrame[a_decimal_column:decimal(10,0)] schema(myDataFrame) StructType |-name = a_decimal_column, type = DecimalType(10,0), nullable = TRUE // ... but pulling data fails: localDF - collect(myDataFrame) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9982) SparkR DataFrame fail to return data of Decimal type
[ https://issues.apache.org/jira/browse/SPARK-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9982: - Assignee: Alex Shkurenko SparkR DataFrame fail to return data of Decimal type Key: SPARK-9982 URL: https://issues.apache.org/jira/browse/SPARK-9982 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.1 Reporter: Alex Shkurenko Assignee: Alex Shkurenko Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, but with the Decimal datatype coming from a Postgres DB: //Set up SparkR Sys.setenv(SPARK_HOME=/Users/ashkurenko/work/git_repos/spark) Sys.setenv(SPARKR_SUBMIT_ARGS=--driver-class-path ~/Downloads/postgresql-9.4-1201.jdbc4.jar sparkr-shell) .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) library(SparkR) sc - sparkR.init(master=local) // Connect to a Postgres DB via JDBC sqlContext - sparkRSQL.init(sc) sql(sqlContext, CREATE TEMPORARY TABLE mytable USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:postgresql://servername:5432/dbname' ,dbtable 'mydbtable' ) ) // Try pulling a Decimal column from a table myDataFrame - sql(sqlContext,(select a_decimal_column from mytable )) // The schema shows up fine show(myDataFrame) DataFrame[a_decimal_column:decimal(10,0)] schema(myDataFrame) StructType |-name = a_decimal_column, type = DecimalType(10,0), nullable = TRUE // ... but pulling data fails: localDF - collect(myDataFrame) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9982) SparkR DataFrame fail to return data of Decimal type
[ https://issues.apache.org/jira/browse/SPARK-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705328#comment-14705328 ] Shivaram Venkataraman commented on SPARK-9982: -- Resolved by https://github.com/apache/spark/pull/8239 SparkR DataFrame fail to return data of Decimal type Key: SPARK-9982 URL: https://issues.apache.org/jira/browse/SPARK-9982 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.1 Reporter: Alex Shkurenko Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, but with the Decimal datatype coming from a Postgres DB: //Set up SparkR Sys.setenv(SPARK_HOME=/Users/ashkurenko/work/git_repos/spark) Sys.setenv(SPARKR_SUBMIT_ARGS=--driver-class-path ~/Downloads/postgresql-9.4-1201.jdbc4.jar sparkr-shell) .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) library(SparkR) sc - sparkR.init(master=local) // Connect to a Postgres DB via JDBC sqlContext - sparkRSQL.init(sc) sql(sqlContext, CREATE TEMPORARY TABLE mytable USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:postgresql://servername:5432/dbname' ,dbtable 'mydbtable' ) ) // Try pulling a Decimal column from a table myDataFrame - sql(sqlContext,(select a_decimal_column from mytable )) // The schema shows up fine show(myDataFrame) DataFrame[a_decimal_column:decimal(10,0)] schema(myDataFrame) StructType |-name = a_decimal_column, type = DecimalType(10,0), nullable = TRUE // ... but pulling data fails: localDF - collect(myDataFrame) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7544) pyspark.sql.types.Row should implement __getitem__
[ https://issues.apache.org/jira/browse/SPARK-7544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704430#comment-14704430 ] Yanbo Liang commented on SPARK-7544: I can work on it. pyspark.sql.types.Row should implement __getitem__ -- Key: SPARK-7544 URL: https://issues.apache.org/jira/browse/SPARK-7544 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Nicholas Chammas Priority: Minor Following from the related discussions in [SPARK-7505] and [SPARK-7133], the {{Row}} type should implement {{\_\_getitem\_\_}} so that people can do this {code} row['field'] {code} instead of this: {code} row.field {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10131) running spark job in docker by mesos-slave
[ https://issues.apache.org/jira/browse/SPARK-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10131. --- Resolution: Invalid Do you mind asking this at u...@spark.apache.org? as far as I can tell you're just asking for assistance in understanding the problem rather than reporting a specific problem attributable to Spark. See https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark please. running spark job in docker by mesos-slave -- Key: SPARK-10131 URL: https://issues.apache.org/jira/browse/SPARK-10131 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.4.1 Environment: docker 1.8.1 mesos 0.23.0 spark 1.4.1 Reporter: Stream Liu I try to running spark job in docker by mesos-slave. by i always get ERROR in mesos-slave E0820 07:46:08.780293 9 slave.cpp:1643] Failed to update resources for container f2aeb5ee-2419-430c-be7d-8276947b909a of executor '20150820-064813-1684252864-5050-1-S0' of framework 20150820-064813-1684252864-5050-1-0004, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/13071/cgroup: Failed to open file '/proc/13071/cgroup': No such file or directory the target container could running but always exit with ExitCode137. i think this could be cause by cgroup ? the job could make work with out --containerizers=docker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10134) Improve the performance of Binary Comparison
[ https://issues.apache.org/jira/browse/SPARK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10134: Assignee: Apache Spark Improve the performance of Binary Comparison Key: SPARK-10134 URL: https://issues.apache.org/jira/browse/SPARK-10134 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Assignee: Apache Spark Fix For: 1.6.0 Currently, compare the binary byte by byte is quite slow, use the Guava utility to improve the performance, which take 8 bytes one time in the comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10134) Improve the performance of Binary Comparison
[ https://issues.apache.org/jira/browse/SPARK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704487#comment-14704487 ] Apache Spark commented on SPARK-10134: -- User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/8335 Improve the performance of Binary Comparison Key: SPARK-10134 URL: https://issues.apache.org/jira/browse/SPARK-10134 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Fix For: 1.6.0 Currently, compare the binary byte by byte is quite slow, use the Guava utility to improve the performance, which take 8 bytes one time in the comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10134) Improve the performance of Binary Comparison
[ https://issues.apache.org/jira/browse/SPARK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10134: Assignee: (was: Apache Spark) Improve the performance of Binary Comparison Key: SPARK-10134 URL: https://issues.apache.org/jira/browse/SPARK-10134 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Fix For: 1.6.0 Currently, compare the binary byte by byte is quite slow, use the Guava utility to improve the performance, which take 8 bytes one time in the comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10130) type coercion for IF should have children resolved first
[ https://issues.apache.org/jira/browse/SPARK-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704513#comment-14704513 ] Cheng Hao commented on SPARK-10130: --- Can you change the fix version to 1.5? Lots of people suffer from the issue I think. type coercion for IF should have children resolved first Key: SPARK-10130 URL: https://issues.apache.org/jira/browse/SPARK-10130 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang SELECT IF(a 0, a, 0) FROM (SELECT key a FROM src) temp; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704351#comment-14704351 ] Yin Huai commented on SPARK-10100: -- How about we leave these functions as is for now (looks like the improvement provided by updating expressions is not very significant and also avoid code changes in the QA period )? AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction -- Key: SPARK-10100 URL: https://issues.apache.org/jira/browse/SPARK-10100 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Assignee: Herman van Hovell Attachments: SPARK-10100.perf.test.scala Looks like Max (probably Min) implemented based on AggregateFunction2 is slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9686) Spark hive jdbc client cannot get table from metadata store
[ https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704374#comment-14704374 ] pin_zhang commented on SPARK-9686: -- What's the status of this bug? will it be fixed in 1.4.x? Spark hive jdbc client cannot get table from metadata store --- Key: SPARK-9686 URL: https://issues.apache.org/jira/browse/SPARK-9686 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.4.1 Reporter: pin_zhang Assignee: Cheng Lian 1. Start start-thriftserver.sh 2. connect with beeline 3. create table 4.show tables, the new created table returned 5. Class.forName(org.apache.hive.jdbc.HiveDriver); String URL = jdbc:hive2://localhost:1/default; Properties info = new Properties(); Connection conn = DriverManager.getConnection(URL, info); ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(), null, null, null); Problem: No tables with returned this API, that work in spark1.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10130) type coercion for IF should have children resolved first
Adrian Wang created SPARK-10130: --- Summary: type coercion for IF should have children resolved first Key: SPARK-10130 URL: https://issues.apache.org/jira/browse/SPARK-10130 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang SELECT IF(a 0, a, 0) FROM (SELECT key a FROM src) temp; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9040) StructField datatype Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704393#comment-14704393 ] Yanbo Liang commented on SPARK-9040: [~vnayak053] The code work well on Spark 1.4. Do you have try it on Spark 1.4? StructField datatype Conversion Error - Key: SPARK-9040 URL: https://issues.apache.org/jira/browse/SPARK-9040 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, SQL Affects Versions: 1.3.0 Environment: Cloudera 5.3 on CDH 6 Reporter: Sandeep Pal The following issue occurs if I specify the StructFields in specific order in StructType as follow: fields = [StructField(d, IntegerType(), True),StructField(b, IntegerType(), True),StructField(a, StringType(), True),StructField(c, IntegerType(), True)] But the following code words fine: fields = [StructField(d, IntegerType(), True),StructField(b, IntegerType(), True),StructField(c, IntegerType(), True),StructField(a, StringType(), True)] ipython-input-27-9d675dd6a2c9 in module() 18 19 schema = StructType(fields) --- 20 schemasimid_simple = sqlContext.createDataFrame(simid_simplereqfields, schema) 21 schemasimid_simple.registerTempTable(simid_simple) /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 302 303 for row in rows: -- 304 _verify_type(row, schema) 305 306 # convert python objects to sql data /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/types.py in _verify_type(obj, dataType) 986 length of fields (%d) % (len(obj), len(dataType.fields))) 987 for v, f in zip(obj, dataType.fields): -- 988 _verify_type(v, f.dataType) 989 990 _cached_cls = weakref.WeakValueDictionary() /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/types.py in _verify_type(obj, dataType) 970 if type(obj) not in _acceptable_types[_type]: 971 raise TypeError(%s can not accept object in type %s -- 972 % (dataType, type(obj))) 973 974 if isinstance(dataType, ArrayType): TypeError: StringType can not accept object in type type 'int' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9098) Inconsistent Dense Vectors hashing between PySpark and Scala
[ https://issues.apache.org/jira/browse/SPARK-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9098. -- Resolution: Duplicate Target Version/s: (was: 1.6.0) I agree, I think this is a subset of the broader fix/issue in SPARK-9793 Inconsistent Dense Vectors hashing between PySpark and Scala Key: SPARK-9098 URL: https://issues.apache.org/jira/browse/SPARK-9098 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.3.1, 1.4.0 Reporter: Maciej Szymkiewicz Priority: Minor When using Scala it is possible to group a RDD using DenseVector as a key: {code} import org.apache.spark.mllib.linalg.Vectors val rdd = sc.parallelize( (Vectors.dense(1, 2, 3), 10) :: (Vectors.dense(1, 2, 3), 20) :: Nil) rdd.groupByKey.count {code} returns 1 as expected. In PySpark {{DenseVector}} {{___hash___}} seems to be inherited from the {{object}} and based on memory address: {code} from pyspark.mllib.linalg import DenseVector rdd = sc.parallelize( [(DenseVector([1, 2, 3]), 10), (DenseVector([1, 2, 3]), 20)]) rdd.groupByKey().count() {code} returns 2. Since underlaying `numpy.ndarray` can be used to mutate DenseVector hashing doesn't look meaningful at all: {code} dv = DenseVector([1, 2, 3]) hdv1 = hash(dv) dv.array[0] = 3.0 hdv2 = hash(dv) hdv1 == hdv2 True dv == DenseVector([1, 2, 3]) False {code} In my opinion the best approach would be to enforce immutability and provide a meaningful hashing. An alternative is to make {{DenseVector}} unhashable same as {{numpy.ndarray}}. Source: http://stackoverflow.com/questions/31449412/how-to-groupbykey-a-rdd-with-densevector-as-key-in-spark/31451752 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-10067) Long delay (16 seconds) when running local session on offline machine
[ https://issues.apache.org/jira/browse/SPARK-10067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Pinyol updated SPARK-10067: -- Comment: was deleted (was: Fixed after upgrading to JDK 1.8.0._60. Probably due to http://bugs.java.com/view_bug.do?bug_id=8077102) Long delay (16 seconds) when running local session on offline machine - Key: SPARK-10067 URL: https://issues.apache.org/jira/browse/SPARK-10067 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Environment: Mac 10.10.5, java 1.8.0_51 from IntelliJ 14.1 Reporter: Daniel Pinyol Priority: Minor If I run this {code:java} SparkContext sc = new SparkContext(local, test); {code} on a machine with no network, it hangs during 15 or 16 seconds during this point, and then it successfully resumes. Looks like the problem is when checking the kerberos realm (see callstack below). Is there anyway to avoid this annoying delay? I reviewed https://spark.apache.org/docs/latest/configuration.html, but couldn't find any solution. thanks {noformat} main@1 prio=5 tid=0x1 nid=NA runnable java.lang.Thread.State: RUNNABLE at java.net.PlainDatagramSocketImpl.peekData(PlainDatagramSocketImpl.java:-1) - locked 0x758 (a java.net.PlainDatagramSocketImpl) at java.net.DatagramSocket.receive(DatagramSocket.java:787) - locked 0x732 (a java.net.DatagramSocket) - locked 0x759 (a java.net.DatagramPacket) at com.sun.jndi.dns.DnsClient.doUdpQuery(DnsClient.java:413) at com.sun.jndi.dns.DnsClient.query(DnsClient.java:207) at com.sun.jndi.dns.Resolver.query(Resolver.java:81) at com.sun.jndi.dns.DnsContext.c_getAttributes(DnsContext.java:434) at com.sun.jndi.toolkit.ctx.ComponentDirContext.p_getAttributes(ComponentDirContext.java:235) at com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.getAttributes(PartialCompositeDirContext.java:141) at com.sun.jndi.toolkit.url.GenericURLDirContext.getAttributes(GenericURLDirContext.java:103) at sun.security.krb5.KrbServiceLocator.getKerberosService(KrbServiceLocator.java:85) at sun.security.krb5.Config.checkRealm(Config.java:1120) at sun.security.krb5.Config.getRealmFromDNS(Config.java:1093) at sun.security.krb5.Config.getDefaultRealm(Config.java:987) at sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-1) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:75) at org.apache.hadoop.security.authentication.util.KerberosName.clinit(KerberosName.java:85) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:225) - locked 0x57d (a java.lang.Class) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:214) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:669) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2162) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2162) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2162) at org.apache.spark.SparkContext.init(SparkContext.scala:301) at org.apache.spark.SparkContext.init(SparkContext.scala:155) at org.apache.spark.SparkContext.init(SparkContext.scala:170) at DataFrameSandbox.init(DataFrameSandbox.java:31) at DataFrameSandbox.main(DataFrameSandbox.java:45) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10132) daemon crash caused by memory leak
[ https://issues.apache.org/jira/browse/SPARK-10132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZemingZhao updated SPARK-10132: --- Attachment: xqjmap.live xqjmap.all oracle_gclog attach the gclog and jmap info daemon crash caused by memory leak -- Key: SPARK-10132 URL: https://issues.apache.org/jira/browse/SPARK-10132 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1, 1.4.1 Environment: 1. Cluster: 7 Redhat notes cluster, each has 32 cores 2. OS type: Red Hat Enterprise Linux Server release 7.1 (Maipo) 3. Java version: tried both Oracle jdk 1.6 and 1.7 java version 1.6.0_13 Java(TM) SE Runtime Environment (build 1.6.0_13-b03) Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) java version 1.7.0 Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) 4. JVM Option on spark-env.sh, Notes: SPARK_DAEMON_MEMORY was set to 300M to speed up the crash process SPARK_DAEMON_JAVA_OPTS=-Xloggc:/root/spark/oracle_gclog SPARK_DAEMON_MEMORY=300m Reporter: ZemingZhao Priority: Critical Attachments: oracle_gclog, xqjmap.all, xqjmap.live constantly submit short batch workload onto spark. spark master and worker will crash casued by memory leak. according to the gclog and jmap info, this leak should be related to Akka but cannot find the root cause by now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704540#comment-14704540 ] Apache Spark commented on SPARK-9089: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8337 Failing to run simple job on Spark Standalone Cluster - Key: SPARK-9089 URL: https://issues.apache.org/jira/browse/SPARK-9089 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 1.4.0 Environment: Staging Reporter: Amar Goradia Priority: Critical We are trying out Spark and as part of that, we have setup Standalone Spark Cluster. As part of testing things out, we simple open PySpark shell and ran this simple job: a=sc.parallelize([1,2,3]).count() As a result, we are getting errors. We tried googling around this error but haven't been able to find exact reasoning behind why we are running into this state. Can somebody please help us further look into this issue and advise us on what we are missing here? Here is full error stack: a=sc.parallelize([1,2,3]).count() 15/07/16 00:52:15 INFO SparkContext: Starting job: count at stdin:1 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at stdin:1) with 2 output partitions (allowLocal=false) 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at stdin:1) 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List() 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List() 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] at count at stdin:1), which has no missing parents 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at stdin:1) failed in Unknown s 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at stdin:1, took 0.004963 s Traceback (most recent call last): File stdin, line 1, in module File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 972, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 963, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 771, in reduce vals = self.mapPartitions(func).collect() File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 745, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.reflect.InvocationTargetException sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) java.lang.reflect.Constructor.newInstance(Constructor.java:526) org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68) org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60) org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73) org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815) org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411) org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266) at
[jira] [Assigned] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9089: --- Assignee: (was: Apache Spark) Failing to run simple job on Spark Standalone Cluster - Key: SPARK-9089 URL: https://issues.apache.org/jira/browse/SPARK-9089 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 1.4.0 Environment: Staging Reporter: Amar Goradia Priority: Critical We are trying out Spark and as part of that, we have setup Standalone Spark Cluster. As part of testing things out, we simple open PySpark shell and ran this simple job: a=sc.parallelize([1,2,3]).count() As a result, we are getting errors. We tried googling around this error but haven't been able to find exact reasoning behind why we are running into this state. Can somebody please help us further look into this issue and advise us on what we are missing here? Here is full error stack: a=sc.parallelize([1,2,3]).count() 15/07/16 00:52:15 INFO SparkContext: Starting job: count at stdin:1 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at stdin:1) with 2 output partitions (allowLocal=false) 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at stdin:1) 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List() 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List() 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] at count at stdin:1), which has no missing parents 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at stdin:1) failed in Unknown s 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at stdin:1, took 0.004963 s Traceback (most recent call last): File stdin, line 1, in module File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 972, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 963, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 771, in reduce vals = self.mapPartitions(func).collect() File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 745, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.reflect.InvocationTargetException sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) java.lang.reflect.Constructor.newInstance(Constructor.java:526) org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68) org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60) org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73) org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815) org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411) org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257) at
[jira] [Comment Edited] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704559#comment-14704559 ] Yanbo Liang edited comment on SPARK-9089 at 8/20/15 9:18 AM: - I think this issue is due to failure of compression codec construction and it throws {{InvocationTargetException}}. It usually happened when Snappy was configured as the compression codec. Here we can catch the {{InvocationTargetException}} and throw {{IllegalArgumentException}} that tells users to fallback to another one such as LZF. was (Author: yanboliang): I think this issue is due to failure of compression codec construction and it throws {{InvocationTargetException}}. It usually happened when Snappy was configured as the compression codec. Here we can catch the {{InvocationTargetException}} and throw {{IllegalArgumentException}} that tells users to switch to fallback compression codec such as LZF. Failing to run simple job on Spark Standalone Cluster - Key: SPARK-9089 URL: https://issues.apache.org/jira/browse/SPARK-9089 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 1.4.0 Environment: Staging Reporter: Amar Goradia Priority: Critical We are trying out Spark and as part of that, we have setup Standalone Spark Cluster. As part of testing things out, we simple open PySpark shell and ran this simple job: a=sc.parallelize([1,2,3]).count() As a result, we are getting errors. We tried googling around this error but haven't been able to find exact reasoning behind why we are running into this state. Can somebody please help us further look into this issue and advise us on what we are missing here? Here is full error stack: a=sc.parallelize([1,2,3]).count() 15/07/16 00:52:15 INFO SparkContext: Starting job: count at stdin:1 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at stdin:1) with 2 output partitions (allowLocal=false) 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at stdin:1) 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List() 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List() 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] at count at stdin:1), which has no missing parents 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at stdin:1) failed in Unknown s 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at stdin:1, took 0.004963 s Traceback (most recent call last): File stdin, line 1, in module File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 972, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 963, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 771, in reduce vals = self.mapPartitions(func).collect() File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 745, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.reflect.InvocationTargetException sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) java.lang.reflect.Constructor.newInstance(Constructor.java:526) org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68) org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60) org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73) org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
[jira] [Commented] (SPARK-8805) Spark shell not working
[ https://issues.apache.org/jira/browse/SPARK-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704497#comment-14704497 ] Sean Owen commented on SPARK-8805: -- Yeah, I think the problem is your version of bash is pretty old. bash 4 has been around for a long time; I'd use that. Spark shell not working --- Key: SPARK-8805 URL: https://issues.apache.org/jira/browse/SPARK-8805 Project: Spark Issue Type: Brainstorming Components: Spark Core, Windows Reporter: Perinkulam I Ganesh I am using Git Bash on windows. Installed Open jdk1.8.0_45 and spark 1.4.0 I am able to build spark and install it. But when ever I execute spark shell it gives me the following error: $ spark-shell /c/.../spark/bin/spark-class: line 76: conditional binary operator expected -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'
[ https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10122. --- Resolution: Not A Problem I don't think this is a bug. Yes, only the initial RDD is actually a Kafka RDD. You need to operate on it if you manipulate offset ranges. AttributeError: 'RDD' object has no attribute 'offsetRanges' Key: SPARK-10122 URL: https://issues.apache.org/jira/browse/SPARK-10122 Project: Spark Issue Type: Bug Components: PySpark, Streaming Reporter: Amit Ramesh Priority: Critical Labels: kafka SPARK-8389 added the offsetRanges interface to Kafka direct streams. This however appears to break when chaining operations after a transform operation. Following is example code that would result in an error (stack trace below). Note that if the 'count()' operation is taken out of the example code then this error does not occur anymore, and the Kafka data is printed. {code:title=kafka_test.py|collapse=true} from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils def attach_kafka_metadata(kafka_rdd): offset_ranges = kafka_rdd.offsetRanges() return kafka_rdd if __name__ == __main__: sc = SparkContext(appName='kafka-test') ssc = StreamingContext(sc, 10) kafka_stream = KafkaUtils.createDirectStream( ssc, [TOPIC], kafkaParams={ 'metadata.broker.list': BROKERS, }, ) kafka_stream.transform(attach_kafka_metadata).count().pprint() ssc.start() ssc.awaitTermination() {code} {code:title=Stack trace|collapse=true} Traceback (most recent call last): File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, line 62, in call r = self.func(t, *rdds) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, line 332, in lambda func = lambda t, rdd: oldfunc(rdd) File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in attach_kafka_metadata offset_ranges = kafka_rdd.offsetRanges() AttributeError: 'RDD' object has no attribute 'offsetRanges' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10092) Multi-DB support follow up
[ https://issues.apache.org/jira/browse/SPARK-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704516#comment-14704516 ] Apache Spark commented on SPARK-10092: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8336 Multi-DB support follow up -- Key: SPARK-10092 URL: https://issues.apache.org/jira/browse/SPARK-10092 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Seems we need a follow-up work for our multi-db support. Here are issues we need to address. 1. saveAsTable always save the table in the folder of the current database 2. HiveContext's refrshTable and analyze do not dbName.tableName. 3. It will be good to use TableIdentifier in CreateTableUsing, CreateTableUsingAsSelect, CreateTempTableUsing, CreateTempTableUsingAsSelect, CreateMetastoreDataSource, and CreateMetastoreDataSourceAsSelect, instead of using string representation (actually, in several places we have already parsed the string and get the TableIdentifier). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10130) type coercion for IF should have children resolved first
[ https://issues.apache.org/jira/browse/SPARK-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-10130: Fix Version/s: 1.5.0 type coercion for IF should have children resolved first Key: SPARK-10130 URL: https://issues.apache.org/jira/browse/SPARK-10130 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Fix For: 1.5.0 SELECT IF(a 0, a, 0) FROM (SELECT key a FROM src) temp; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704534#comment-14704534 ] Littlestar commented on SPARK-2883: --- spark 1.4.1: The orc file writer relies on HiveContext and Hive metastore Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: New Feature Components: Input/Output, SQL Reporter: Zhan Zhang Assignee: Zhan Zhang Priority: Critical Fix For: 1.4.0 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png, orc.diff Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10132) daemon crash caused by memory leak
ZemingZhao created SPARK-10132: -- Summary: daemon crash caused by memory leak Key: SPARK-10132 URL: https://issues.apache.org/jira/browse/SPARK-10132 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1, 1.3.1 Environment: 1. Cluster: 7 Redhat notes cluster, each has 32 cores 2. OS type: Red Hat Enterprise Linux Server release 7.1 (Maipo) 3. Java version: tried both Oracle jdk 1.6 and 1.7 java version 1.6.0_13 Java(TM) SE Runtime Environment (build 1.6.0_13-b03) Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) java version 1.7.0 Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) 4. JVM Option on spark-env.sh, Notes: SPARK_DAEMON_MEMORY was set to 300M to speed up the crash process SPARK_DAEMON_JAVA_OPTS=-Xloggc:/root/spark/oracle_gclog SPARK_DAEMON_MEMORY=300m Reporter: ZemingZhao Priority: Critical constantly submit short batch workload onto spark. spark master and worker will crash casued by memory leak. according to the gclog and jmap info, this leak should be related to Akka but cannot find the root cause by now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10134) Improve the performance of Binary Comparison
Cheng Hao created SPARK-10134: - Summary: Improve the performance of Binary Comparison Key: SPARK-10134 URL: https://issues.apache.org/jira/browse/SPARK-10134 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Fix For: 1.6.0 Currently, compare the binary byte by byte is quite slow, use the Guava utility to improve the performance, which take 8 bytes one time in the comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10132) daemon crash caused by memory leak
[ https://issues.apache.org/jira/browse/SPARK-10132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10132. --- Resolution: Invalid I don't think any of this suggests a problem in Spark though, right? You just ran out of memory. I'm provisionally closing this since there is no detail about Spark here, a reproduction, or argument that there is a memory leak. It can be reopened if this detail is provided. daemon crash caused by memory leak -- Key: SPARK-10132 URL: https://issues.apache.org/jira/browse/SPARK-10132 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1, 1.4.1 Environment: 1. Cluster: 7 Redhat notes cluster, each has 32 cores 2. OS type: Red Hat Enterprise Linux Server release 7.1 (Maipo) 3. Java version: tried both Oracle jdk 1.6 and 1.7 java version 1.6.0_13 Java(TM) SE Runtime Environment (build 1.6.0_13-b03) Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) java version 1.7.0 Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) 4. JVM Option on spark-env.sh, Notes: SPARK_DAEMON_MEMORY was set to 300M to speed up the crash process SPARK_DAEMON_JAVA_OPTS=-Xloggc:/root/spark/oracle_gclog SPARK_DAEMON_MEMORY=300m Reporter: ZemingZhao Priority: Critical Attachments: oracle_gclog, xqjmap.all, xqjmap.live constantly submit short batch workload onto spark. spark master and worker will crash casued by memory leak. according to the gclog and jmap info, this leak should be related to Akka but cannot find the root cause by now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-9089: --- Component/s: (was: PySpark) Spark Core Failing to run simple job on Spark Standalone Cluster - Key: SPARK-9089 URL: https://issues.apache.org/jira/browse/SPARK-9089 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 1.4.0 Environment: Staging Reporter: Amar Goradia Priority: Critical We are trying out Spark and as part of that, we have setup Standalone Spark Cluster. As part of testing things out, we simple open PySpark shell and ran this simple job: a=sc.parallelize([1,2,3]).count() As a result, we are getting errors. We tried googling around this error but haven't been able to find exact reasoning behind why we are running into this state. Can somebody please help us further look into this issue and advise us on what we are missing here? Here is full error stack: a=sc.parallelize([1,2,3]).count() 15/07/16 00:52:15 INFO SparkContext: Starting job: count at stdin:1 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at stdin:1) with 2 output partitions (allowLocal=false) 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at stdin:1) 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List() 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List() 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] at count at stdin:1), which has no missing parents 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at stdin:1) failed in Unknown s 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at stdin:1, took 0.004963 s Traceback (most recent call last): File stdin, line 1, in module File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 972, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 963, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 771, in reduce vals = self.mapPartitions(func).collect() File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 745, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.reflect.InvocationTargetException sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) java.lang.reflect.Constructor.newInstance(Constructor.java:526) org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68) org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60) org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73) org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815) org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411) org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257) at
[jira] [Resolved] (SPARK-10133) loadLibSVMFile fails to detect zero-based lines
[ https://issues.apache.org/jira/browse/SPARK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10133. --- Resolution: Not A Problem No, because the indices have already had 1 subtracted from them. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L82 I'm sure about this since I also made the same mistake when reviewing this change, enough that I'm provisionally closing this. loadLibSVMFile fails to detect zero-based lines --- Key: SPARK-10133 URL: https://issues.apache.org/jira/browse/SPARK-10133 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.1 Reporter: Xusen Yin Priority: Minor https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L88 The code wants to assure that each line of vector is one-based and in ascending order, but it fails since the previous = -1 in the beginning. In this condition, a libSVM format file that begins with 0-based index could read in normally, but the size of the according SparseVector is wrong, i.e. numFeatures - 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704559#comment-14704559 ] Yanbo Liang commented on SPARK-9089: I think this issue is due to failure of compression codec construction and it throws {{InvocationTargetException}}. It usually happened when Snappy was configured as the compression codec. Here we can catch the {{InvocationTargetException}} and throw {{IllegalArgumentException}} that tells users to switch to fallback compression codec such as LZF. Failing to run simple job on Spark Standalone Cluster - Key: SPARK-9089 URL: https://issues.apache.org/jira/browse/SPARK-9089 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 1.4.0 Environment: Staging Reporter: Amar Goradia Priority: Critical We are trying out Spark and as part of that, we have setup Standalone Spark Cluster. As part of testing things out, we simple open PySpark shell and ran this simple job: a=sc.parallelize([1,2,3]).count() As a result, we are getting errors. We tried googling around this error but haven't been able to find exact reasoning behind why we are running into this state. Can somebody please help us further look into this issue and advise us on what we are missing here? Here is full error stack: a=sc.parallelize([1,2,3]).count() 15/07/16 00:52:15 INFO SparkContext: Starting job: count at stdin:1 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at stdin:1) with 2 output partitions (allowLocal=false) 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at stdin:1) 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List() 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List() 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] at count at stdin:1), which has no missing parents 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at stdin:1) failed in Unknown s 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at stdin:1, took 0.004963 s Traceback (most recent call last): File stdin, line 1, in module File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 972, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 963, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 771, in reduce vals = self.mapPartitions(func).collect() File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 745, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.reflect.InvocationTargetException sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) java.lang.reflect.Constructor.newInstance(Constructor.java:526) org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68) org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60) org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73) org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815) org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
[jira] [Resolved] (SPARK-8854) Documentation for Association Rules
[ https://issues.apache.org/jira/browse/SPARK-8854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8854. -- Resolution: Duplicate Documentation for Association Rules --- Key: SPARK-8854 URL: https://issues.apache.org/jira/browse/SPARK-8854 Project: Spark Issue Type: Documentation Components: MLlib Reporter: Feynman Liang Priority: Minor Documentation describing how to generate association rules from frequent itemsets needs to be provided. The relevant method is {{FPGrowthModel.generateAssociationRules}}. This will likely be added to the existing section for frequent-itemsets using FPGrowth. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10143: - Component/s: SQL Parquet changed the behavior of calculating splits -- Key: SPARK-10143 URL: https://issues.apache.org/jira/browse/SPARK-10143 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Critical When Parquet's task side metadata is enabled (by default it is enabled and it needs to be enabled to deal with tables with many files), Parquet delegates the work of calculating initial splits to FileInputFormat (see https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). If filesystem's block size is smaller than the row group size and users do not set min split size, splits in the initial split list will have lots of dummy splits and they contribute to empty tasks (because the starting point and ending point of a split does not cover the starting point of a row group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10146) Have an easy way to set data source reader/writer specific confs
[ https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706205#comment-14706205 ] Yin Huai commented on SPARK-10146: -- One possible way to do it is that every data source defines a list of confs that can be applied to its reader/writer and we let users set those confs in SQLConf or through data source options. Then, we propagate those confs to the reader/writer. Have an easy way to set data source reader/writer specific confs Key: SPARK-10146 URL: https://issues.apache.org/jira/browse/SPARK-10146 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Critical Right now, it is hard to set data source reader/writer specifics confs correctly (e.g. parquet's row group size). Users need to set those confs in hadoop conf before start the application or through {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be great if we can have an easy to set those confs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10146) Have an easy way to set data source reader/writer specific confs
[ https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10146: - Issue Type: Improvement (was: Bug) Have an easy way to set data source reader/writer specific confs Key: SPARK-10146 URL: https://issues.apache.org/jira/browse/SPARK-10146 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Critical Right now, it is hard to set data source reader/writer specifics confs correctly (e.g. parquet's row group size). Users need to set those confs in hadoop conf before start the application or through {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be great if we can have an easy to set those confs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10146) Have an easy way to set data source reader/writer specific confs
Yin Huai created SPARK-10146: Summary: Have an easy way to set data source reader/writer specific confs Key: SPARK-10146 URL: https://issues.apache.org/jira/browse/SPARK-10146 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Critical Right now, it is hard to set data source reader/writer specifics confs correctly (e.g. parquet's row group size). Users need to set those confs in hadoop conf before start the application or through {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be great if we can have an easy to set those confs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs
[ https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-10147: - Description: Phenomenon:App still shows in HistoryServer web when the event file has been deleted on hdfs. Cause: It is because *log-replay-executor* thread and *clean log* thread both will write value to object *application*, so it has synchronization problem was: It is because *log-replay-executor* thread and *clean log* thread both will write value to object *application*, so it has synchronization problem App shouldn't show in HistoryServer web when the event file has been deleted on hdfs Key: SPARK-10147 URL: https://issues.apache.org/jira/browse/SPARK-10147 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Phenomenon:App still shows in HistoryServer web when the event file has been deleted on hdfs. Cause: It is because *log-replay-executor* thread and *clean log* thread both will write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs
[ https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-10147: - Summary: App shouldn't show in HistoryServer web when the event file has been deleted on hdfs (was: App still shows in HistoryServer web when the event file has been deleted on hdfs) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs Key: SPARK-10147 URL: https://issues.apache.org/jira/browse/SPARK-10147 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula It is because *log-replay-executor* thread and *clean log* thread both will write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9983) Local physical operators for query execution
[ https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9983: --- Description: In distributed query execution, there are two kinds of operators: (1) operators that exchange data between different executors or threads: examples include broadcast, shuffle. (2) operators that process data in a single thread: examples include project, filter, group by, etc. This ticket proposes clearly differentiating them and create local operators in Spark. This leads to a lot of benefits: easier to test, easier to optimize data exchange, better design (single responsibility), and potentially even having a hyper-optimized single-node version of DataFrame. was: In distributed query execution, there are two kinds of operators: (1) operators that exchange data between different executors or threads: examples include broadcast, shuffle. (2) operators that process data in a single thread: examples include project, filter, group by, etc. This ticket proposes clearly differentiating them and create local operators in Spark. This leads to a lot of benefits: easier to test, easier to optimize data exchange, and better design (single responsibility). Local physical operators for query execution Key: SPARK-9983 URL: https://issues.apache.org/jira/browse/SPARK-9983 Project: Spark Issue Type: Story Components: SQL Reporter: Reynold Xin Assignee: Shixiong Zhu In distributed query execution, there are two kinds of operators: (1) operators that exchange data between different executors or threads: examples include broadcast, shuffle. (2) operators that process data in a single thread: examples include project, filter, group by, etc. This ticket proposes clearly differentiating them and create local operators in Spark. This leads to a lot of benefits: easier to test, easier to optimize data exchange, better design (single responsibility), and potentially even having a hyper-optimized single-node version of DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'
[ https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706186#comment-14706186 ] Saisai Shao commented on SPARK-10122: - Hi [~aramesh], thanks a lot for pointing this out. This is actually a bug, sorry for not covering it in the unit test. The problem is Python will compact a series of {{TransformedDStream}} into one: {code} if (isinstance(prev, TransformedDStream) and not prev.is_cached and not prev.is_checkpointed): prev_func = prev.func self.func = lambda t, rdd: func(t, prev_func(t, rdd)) self.prev = prev.prev {code} As {{KafkaTransformedDStream}} is a subclass of {{TransformedDStream}}, so it will be compacted to replace with its parent DStream, as the code shows {{self.prev = prev.prev}}, which is a DStream, get offset ranges on DStream will throw an exception as you mentioned before. I will submit a PR to fix this, so you could try with the patch to see if it is fixed. AttributeError: 'RDD' object has no attribute 'offsetRanges' Key: SPARK-10122 URL: https://issues.apache.org/jira/browse/SPARK-10122 Project: Spark Issue Type: Bug Components: PySpark, Streaming Reporter: Amit Ramesh Labels: kafka SPARK-8389 added the offsetRanges interface to Kafka direct streams. This however appears to break when chaining operations after a transform operation. Following is example code that would result in an error (stack trace below). Note that if the 'count()' operation is taken out of the example code then this error does not occur anymore, and the Kafka data is printed. {code:title=kafka_test.py|collapse=true} from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils def attach_kafka_metadata(kafka_rdd): offset_ranges = kafka_rdd.offsetRanges() return kafka_rdd if __name__ == __main__: sc = SparkContext(appName='kafka-test') ssc = StreamingContext(sc, 10) kafka_stream = KafkaUtils.createDirectStream( ssc, [TOPIC], kafkaParams={ 'metadata.broker.list': BROKERS, }, ) kafka_stream.transform(attach_kafka_metadata).count().pprint() ssc.start() ssc.awaitTermination() {code} {code:title=Stack trace|collapse=true} Traceback (most recent call last): File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, line 62, in call r = self.func(t, *rdds) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, line 332, in lambda func = lambda t, rdd: oldfunc(rdd) File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in attach_kafka_metadata offset_ranges = kafka_rdd.offsetRanges() AttributeError: 'RDD' object has no attribute 'offsetRanges' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'
[ https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706202#comment-14706202 ] Apache Spark commented on SPARK-10122: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/8347 AttributeError: 'RDD' object has no attribute 'offsetRanges' Key: SPARK-10122 URL: https://issues.apache.org/jira/browse/SPARK-10122 Project: Spark Issue Type: Bug Components: PySpark, Streaming Reporter: Amit Ramesh Labels: kafka SPARK-8389 added the offsetRanges interface to Kafka direct streams. This however appears to break when chaining operations after a transform operation. Following is example code that would result in an error (stack trace below). Note that if the 'count()' operation is taken out of the example code then this error does not occur anymore, and the Kafka data is printed. {code:title=kafka_test.py|collapse=true} from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils def attach_kafka_metadata(kafka_rdd): offset_ranges = kafka_rdd.offsetRanges() return kafka_rdd if __name__ == __main__: sc = SparkContext(appName='kafka-test') ssc = StreamingContext(sc, 10) kafka_stream = KafkaUtils.createDirectStream( ssc, [TOPIC], kafkaParams={ 'metadata.broker.list': BROKERS, }, ) kafka_stream.transform(attach_kafka_metadata).count().pprint() ssc.start() ssc.awaitTermination() {code} {code:title=Stack trace|collapse=true} Traceback (most recent call last): File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, line 62, in call r = self.func(t, *rdds) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, line 332, in lambda func = lambda t, rdd: oldfunc(rdd) File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in attach_kafka_metadata offset_ranges = kafka_rdd.offsetRanges() AttributeError: 'RDD' object has no attribute 'offsetRanges' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10146) Have an easy way to set data source reader/writer specific confs
[ https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706205#comment-14706205 ] Yin Huai edited comment on SPARK-10146 at 8/21/15 3:42 AM: --- One possible way is that every data source defines a list of confs that can be applied to its reader/writer and we let users set those confs in SQLConf or through data source options. Then, we propagate those confs to the reader/writer. was (Author: yhuai): One possible way to do it is that every data source defines a list of confs that can be applied to its reader/writer and we let users set those confs in SQLConf or through data source options. Then, we propagate those confs to the reader/writer. Have an easy way to set data source reader/writer specific confs Key: SPARK-10146 URL: https://issues.apache.org/jira/browse/SPARK-10146 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Critical Right now, it is hard to set data source reader/writer specifics confs correctly (e.g. parquet's row group size). Users need to set those confs in hadoop conf before start the application or through {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be great if we can have an easy to set those confs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10147) App still shows in HistoryServer web when the event file has been deleted on hdfs
meiyoula created SPARK-10147: Summary: App still shows in HistoryServer web when the event file has been deleted on hdfs Key: SPARK-10147 URL: https://issues.apache.org/jira/browse/SPARK-10147 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula It is because *log-replay-executor* thread and *clean log* thread both will write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8467) Add LDAModel.describeTopics() in Python
[ https://issues.apache.org/jira/browse/SPARK-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706257#comment-14706257 ] Hrishikesh commented on SPARK-8467: --- [~yuu.ishik...@gmail.com], are you still working on this? Add LDAModel.describeTopics() in Python --- Key: SPARK-8467 URL: https://issues.apache.org/jira/browse/SPARK-8467 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Yu Ishikawa Add LDAModel. describeTopics() in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9669) Support PySpark with Mesos Cluster mode
[ https://issues.apache.org/jira/browse/SPARK-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9669: --- Assignee: (was: Apache Spark) Support PySpark with Mesos Cluster mode --- Key: SPARK-9669 URL: https://issues.apache.org/jira/browse/SPARK-9669 Project: Spark Issue Type: Improvement Components: Mesos, PySpark Reporter: Timothy Chen PySpark with cluster mode with Mesos is not yet supported. We need to enable it and make sure it's able to launch Pyspark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9669) Support PySpark with Mesos Cluster mode
[ https://issues.apache.org/jira/browse/SPARK-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9669: --- Assignee: Apache Spark Support PySpark with Mesos Cluster mode --- Key: SPARK-9669 URL: https://issues.apache.org/jira/browse/SPARK-9669 Project: Spark Issue Type: Improvement Components: Mesos, PySpark Reporter: Timothy Chen Assignee: Apache Spark PySpark with cluster mode with Mesos is not yet supported. We need to enable it and make sure it's able to launch Pyspark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9669) Support PySpark with Mesos Cluster mode
[ https://issues.apache.org/jira/browse/SPARK-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706262#comment-14706262 ] Apache Spark commented on SPARK-9669: - User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/8349 Support PySpark with Mesos Cluster mode --- Key: SPARK-9669 URL: https://issues.apache.org/jira/browse/SPARK-9669 Project: Spark Issue Type: Improvement Components: Mesos, PySpark Reporter: Timothy Chen PySpark with cluster mode with Mesos is not yet supported. We need to enable it and make sure it's able to launch Pyspark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9848) Add @Since annotation to new public APIs in 1.5
[ https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706178#comment-14706178 ] Xiangrui Meng commented on SPARK-9848: -- No, that would be too much for this release. We plan to do that after 1.5. Add @Since annotation to new public APIs in 1.5 --- Key: SPARK-9848 URL: https://issues.apache.org/jira/browse/SPARK-9848 Project: Spark Issue Type: Sub-task Components: Documentation, ML, MLlib Reporter: Xiangrui Meng Assignee: Manoj Kumar Priority: Critical Labels: starter We should get a list of new APIs from SPARK-9660. cc: [~fliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8400) ml.ALS doesn't handle -1 block size
[ https://issues.apache.org/jira/browse/SPARK-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706182#comment-14706182 ] Xiangrui Meng commented on SPARK-8400: -- Sorry for my late reply! We check numBlocks in LocalIndexEncoder. However, I'm not sure whether this happens before any data shuffling. It might be better to check numUserBlocks and numItemBlocks directly. ml.ALS doesn't handle -1 block size --- Key: SPARK-8400 URL: https://issues.apache.org/jira/browse/SPARK-8400 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.3.1 Reporter: Xiangrui Meng Assignee: Bryan Cutler Under spark.mllib, if number blocks is set to -1, we set the block size automatically based on the input partition size. However, this behavior is not preserved in the spark.ml API. If user sets -1 in Spark 1.3, it will not work, but no error messages will show. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10137) Avoid to restart receivers if scheduleReceivers returns balanced results
[ https://issues.apache.org/jira/browse/SPARK-10137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10137: -- Assignee: Shixiong Zhu Avoid to restart receivers if scheduleReceivers returns balanced results Key: SPARK-10137 URL: https://issues.apache.org/jira/browse/SPARK-10137 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Critical In some cases, even if scheduleReceivers returns balanced results, ReceiverTracker still may reject some receivers and force them to restart. See my PR for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10137) Avoid to restart receivers if scheduleReceivers returns balanced results
[ https://issues.apache.org/jira/browse/SPARK-10137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10137: -- Priority: Critical (was: Major) Avoid to restart receivers if scheduleReceivers returns balanced results Key: SPARK-10137 URL: https://issues.apache.org/jira/browse/SPARK-10137 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Shixiong Zhu Priority: Critical In some cases, even if scheduleReceivers returns balanced results, ReceiverTracker still may reject some receivers and force them to restart. See my PR for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming
[ https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706139#comment-14706139 ] Baogang Wang edited comment on SPARK-10145 at 8/21/15 3:27 AM: --- spark.serializer org.apache.spark.serializer.KryoSerializer spark.akka.frameSize1024 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.akka.timeout 900 spark.storage.memoryFraction0.4 spark.rdd.compress true spark.shuffle.blockTransferService nio spark.yarn.executor.memoryOverhead 1024 was (Author: heayin): # Default system properties included when running spark-submit. # This is useful for setting default environmental settings. # Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers=one two three #spark.core.connection.ack.wait.timeout 3600 #spark.core.connection.auth.wait.timeout3600 spark.akka.frameSize1024 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.akka.timeout 900 spark.storage.memoryFraction0.4 spark.rdd.compress true spark.shuffle.blockTransferService nio spark.yarn.executor.memoryOverhead 1024 Executor exit without useful messages when spark runs in spark-streaming Key: SPARK-10145 URL: https://issues.apache.org/jira/browse/SPARK-10145 Project: Spark Issue Type: Bug Components: Streaming, YARN Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 cores and 32g memory Reporter: Baogang Wang Priority: Critical Original Estimate: 168h Remaining Estimate: 168h Each node is allocated 30g memory by Yarn. My application receives messages from Kafka by directstream. Each application consists of 4 dstream window Spark application is submitted by this command: spark-submit --class spark_security.safe.SafeSockPuppet --driver-memory 3g --executor-memory 3g --num-executors 3 --executor-cores 4 --name safeSparkDealerUser --master yarn --deploy-mode cluster spark_Security-1.0-SNAPSHOT.jar.nocalse hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties After about 1 hours, some executor exits. There is no more yarn logs after the executor exits and there is no stack when the executor exits. When I see the yarn node manager log, it shows as follows : 2015-08-17 17:25:41,550 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1439803298368_0005_01_01 by user root 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Creating a new application reference for app application_1439803298368_0005 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=172.19.160.102 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1439803298368_0005 CONTAINERID=container_1439803298368_0005_01_01 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1439803298368_0005 transitioned from NEW to INITING 2015-08-17 17:25:41,552 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1439803298368_0005_01_01 to application application_1439803298368_0005 2015-08-17 17:25:41,557 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2015-08-17 17:25:41,663 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1439803298368_0005 transitioned from INITING to RUNNING 2015-08-17 17:25:41,664 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1439803298368_0005_01_01 transitioned from NEW to LOCALIZING 2015-08-17 17:25:41,664 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices:
[jira] [Assigned] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs
[ https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10147: Assignee: (was: Apache Spark) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs Key: SPARK-10147 URL: https://issues.apache.org/jira/browse/SPARK-10147 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Phenomenon:App still shows in HistoryServer web when the event file has been deleted on hdfs. Cause: It is because *log-replay-executor* thread and *clean log* thread both will write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs
[ https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706217#comment-14706217 ] Apache Spark commented on SPARK-10147: -- User 'XuTingjun' has created a pull request for this issue: https://github.com/apache/spark/pull/8348 App shouldn't show in HistoryServer web when the event file has been deleted on hdfs Key: SPARK-10147 URL: https://issues.apache.org/jira/browse/SPARK-10147 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Phenomenon:App still shows in HistoryServer web when the event file has been deleted on hdfs. Cause: It is because *log-replay-executor* thread and *clean log* thread both will write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs
[ https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10147: Assignee: Apache Spark App shouldn't show in HistoryServer web when the event file has been deleted on hdfs Key: SPARK-10147 URL: https://issues.apache.org/jira/browse/SPARK-10147 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Assignee: Apache Spark Phenomenon:App still shows in HistoryServer web when the event file has been deleted on hdfs. Cause: It is because *log-replay-executor* thread and *clean log* thread both will write value to object *application*, so it has synchronization problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706244#comment-14706244 ] Reynold Xin commented on SPARK-: This needs to be designed first. I'm not sure if static code analysis is a great idea since they fail often. I'm open to ideas though. RDD-like API on top of Catalyst/DataFrame - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Story Components: SQL Reporter: Reynold Xin The RDD API is very flexible, and as a result harder to optimize its execution in some cases. The DataFrame API, on the other hand, is much easier to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to use UDFs, lack of strong types in Scala/Java). As a Spark user, I want an API that sits somewhere in the middle of the spectrum so I can write most of my applications with that API, and yet it can be optimized well by Spark to achieve performance and stability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10142) Python checkpoint recovery does not work with non-local file path
Tathagata Das created SPARK-10142: - Summary: Python checkpoint recovery does not work with non-local file path Key: SPARK-10142 URL: https://issues.apache.org/jira/browse/SPARK-10142 Project: Spark Issue Type: Bug Components: PySpark, Streaming Affects Versions: 1.4.1, 1.3.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10144) Actually show peak execution memory on UI by default
[ https://issues.apache.org/jira/browse/SPARK-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10144: -- Summary: Actually show peak execution memory on UI by default (was: Actually show peak execution memory by default) Actually show peak execution memory on UI by default Key: SPARK-10144 URL: https://issues.apache.org/jira/browse/SPARK-10144 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Andrew Or The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer the case... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10144) Actually show peak execution memory by default
Andrew Or created SPARK-10144: - Summary: Actually show peak execution memory by default Key: SPARK-10144 URL: https://issues.apache.org/jira/browse/SPARK-10144 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Andrew Or The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer the case... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'
[ https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Ramesh reopened SPARK-10122: - AttributeError: 'RDD' object has no attribute 'offsetRanges' Key: SPARK-10122 URL: https://issues.apache.org/jira/browse/SPARK-10122 Project: Spark Issue Type: Bug Components: PySpark, Streaming Reporter: Amit Ramesh Priority: Critical Labels: kafka SPARK-8389 added the offsetRanges interface to Kafka direct streams. This however appears to break when chaining operations after a transform operation. Following is example code that would result in an error (stack trace below). Note that if the 'count()' operation is taken out of the example code then this error does not occur anymore, and the Kafka data is printed. {code:title=kafka_test.py|collapse=true} from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils def attach_kafka_metadata(kafka_rdd): offset_ranges = kafka_rdd.offsetRanges() return kafka_rdd if __name__ == __main__: sc = SparkContext(appName='kafka-test') ssc = StreamingContext(sc, 10) kafka_stream = KafkaUtils.createDirectStream( ssc, [TOPIC], kafkaParams={ 'metadata.broker.list': BROKERS, }, ) kafka_stream.transform(attach_kafka_metadata).count().pprint() ssc.start() ssc.awaitTermination() {code} {code:title=Stack trace|collapse=true} Traceback (most recent call last): File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, line 62, in call r = self.func(t, *rdds) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, line 332, in lambda func = lambda t, rdd: oldfunc(rdd) File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in attach_kafka_metadata offset_ranges = kafka_rdd.offsetRanges() AttributeError: 'RDD' object has no attribute 'offsetRanges' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'
[ https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704674#comment-14704674 ] Amit Ramesh edited comment on SPARK-10122 at 8/20/15 10:51 AM: --- [~srowen] as you can see in the example, offsetRanges() is being applied to the initial RDD as part of the transform operation. And the code works fine if the line 'kafka_stream.transform(attach_kafka_metadata).count().pprint()' is changed to 'kafka_stream.transform(attach_kafka_metadata).pprint()'. was (Author: aramesh): [~srowen] as you can see in the example, offsetRanges() is being applied to the initial RDD as part of the transform operation. And the code works file if the line 'kafka_stream.transform(attach_kafka_metadata).count().pprint()' is changed to 'kafka_stream.transform(attach_kafka_metadata).pprint()'. AttributeError: 'RDD' object has no attribute 'offsetRanges' Key: SPARK-10122 URL: https://issues.apache.org/jira/browse/SPARK-10122 Project: Spark Issue Type: Bug Components: PySpark, Streaming Reporter: Amit Ramesh Priority: Critical Labels: kafka SPARK-8389 added the offsetRanges interface to Kafka direct streams. This however appears to break when chaining operations after a transform operation. Following is example code that would result in an error (stack trace below). Note that if the 'count()' operation is taken out of the example code then this error does not occur anymore, and the Kafka data is printed. {code:title=kafka_test.py|collapse=true} from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils def attach_kafka_metadata(kafka_rdd): offset_ranges = kafka_rdd.offsetRanges() return kafka_rdd if __name__ == __main__: sc = SparkContext(appName='kafka-test') ssc = StreamingContext(sc, 10) kafka_stream = KafkaUtils.createDirectStream( ssc, [TOPIC], kafkaParams={ 'metadata.broker.list': BROKERS, }, ) kafka_stream.transform(attach_kafka_metadata).count().pprint() ssc.start() ssc.awaitTermination() {code} {code:title=Stack trace|collapse=true} Traceback (most recent call last): File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, line 62, in call r = self.func(t, *rdds) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, line 332, in lambda func = lambda t, rdd: oldfunc(rdd) File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in attach_kafka_metadata offset_ranges = kafka_rdd.offsetRanges() AttributeError: 'RDD' object has no attribute 'offsetRanges' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'
[ https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10122: -- Priority: Major (was: Critical) Ah I see now, you are not operating on the transformed stream resulting from count(); it comes after and not before in the example. This should be OK in the Scala API in my experience, but it looks like Python operates a little differently. When a transformation is applied to a transformed stream, it collapses the two transformations. So the count + kafka offset function are turned into one, which is applied to the raw DStream behind the Kafka DStream and it fails. I think. [~jerryshao] [~davies] [~tdas] worth a look. It may be a case for not implementing the dstream transformation this way, if this guess is right. AttributeError: 'RDD' object has no attribute 'offsetRanges' Key: SPARK-10122 URL: https://issues.apache.org/jira/browse/SPARK-10122 Project: Spark Issue Type: Bug Components: PySpark, Streaming Reporter: Amit Ramesh Labels: kafka SPARK-8389 added the offsetRanges interface to Kafka direct streams. This however appears to break when chaining operations after a transform operation. Following is example code that would result in an error (stack trace below). Note that if the 'count()' operation is taken out of the example code then this error does not occur anymore, and the Kafka data is printed. {code:title=kafka_test.py|collapse=true} from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils def attach_kafka_metadata(kafka_rdd): offset_ranges = kafka_rdd.offsetRanges() return kafka_rdd if __name__ == __main__: sc = SparkContext(appName='kafka-test') ssc = StreamingContext(sc, 10) kafka_stream = KafkaUtils.createDirectStream( ssc, [TOPIC], kafkaParams={ 'metadata.broker.list': BROKERS, }, ) kafka_stream.transform(attach_kafka_metadata).count().pprint() ssc.start() ssc.awaitTermination() {code} {code:title=Stack trace|collapse=true} Traceback (most recent call last): File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, line 62, in call r = self.func(t, *rdds) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, line 332, in lambda func = lambda t, rdd: oldfunc(rdd) File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in attach_kafka_metadata offset_ranges = kafka_rdd.offsetRanges() AttributeError: 'RDD' object has no attribute 'offsetRanges' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'
[ https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704674#comment-14704674 ] Amit Ramesh commented on SPARK-10122: - [~srowen] as you can see in the example, offsetRanges() is being applied to the initial RDD as part of the transform operation. And the code works file if the line 'kafka_stream.transform(attach_kafka_metadata).count().pprint()' is changed to 'kafka_stream.transform(attach_kafka_metadata).pprint()'. AttributeError: 'RDD' object has no attribute 'offsetRanges' Key: SPARK-10122 URL: https://issues.apache.org/jira/browse/SPARK-10122 Project: Spark Issue Type: Bug Components: PySpark, Streaming Reporter: Amit Ramesh Priority: Critical Labels: kafka SPARK-8389 added the offsetRanges interface to Kafka direct streams. This however appears to break when chaining operations after a transform operation. Following is example code that would result in an error (stack trace below). Note that if the 'count()' operation is taken out of the example code then this error does not occur anymore, and the Kafka data is printed. {code:title=kafka_test.py|collapse=true} from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils def attach_kafka_metadata(kafka_rdd): offset_ranges = kafka_rdd.offsetRanges() return kafka_rdd if __name__ == __main__: sc = SparkContext(appName='kafka-test') ssc = StreamingContext(sc, 10) kafka_stream = KafkaUtils.createDirectStream( ssc, [TOPIC], kafkaParams={ 'metadata.broker.list': BROKERS, }, ) kafka_stream.transform(attach_kafka_metadata).count().pprint() ssc.start() ssc.awaitTermination() {code} {code:title=Stack trace|collapse=true} Traceback (most recent call last): File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, line 62, in call r = self.func(t, *rdds) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 616, in lambda self.func = lambda t, rdd: func(t, prev_func(t, rdd)) File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, line 332, in lambda func = lambda t, rdd: oldfunc(rdd) File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in attach_kafka_metadata offset_ranges = kafka_rdd.offsetRanges() AttributeError: 'RDD' object has no attribute 'offsetRanges' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10015) ML model broadcasts should be stored in private vars: spark.ml tree ensembles
[ https://issues.apache.org/jira/browse/SPARK-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704740#comment-14704740 ] Sameer Abhyankar commented on SPARK-10015: -- [~josephkb] I have created a common trait in the PR for SPARK-10017 that we can reuse for all of these other related Jiras (10015-10020). If, the PR for SPARK-10017 looks ok and is merged, I can update the PRs for the other Jiras. Thx! ML model broadcasts should be stored in private vars: spark.ml tree ensembles - Key: SPARK-10015 URL: https://issues.apache.org/jira/browse/SPARK-10015 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter See parent for details. Applies to: * GBTClassifier * RandomForestClassifier * GBTRegressor * RandomForestRegressor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10092) Multi-DB support follow up
[ https://issues.apache.org/jira/browse/SPARK-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10092. Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8336 [https://github.com/apache/spark/pull/8336] Multi-DB support follow up -- Key: SPARK-10092 URL: https://issues.apache.org/jira/browse/SPARK-10092 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.5.0 Seems we need a follow-up work for our multi-db support. Here are issues we need to address. 1. saveAsTable always save the table in the folder of the current database 2. HiveContext's refrshTable and analyze do not dbName.tableName. 3. It will be good to use TableIdentifier in CreateTableUsing, CreateTableUsingAsSelect, CreateTempTableUsing, CreateTempTableUsingAsSelect, CreateMetastoreDataSource, and CreateMetastoreDataSourceAsSelect, instead of using string representation (actually, in several places we have already parsed the string and get the TableIdentifier). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8436) Inconsistent behavior when converting a Timestamp column to Integer/Long and then convert back to Timestamp
[ https://issues.apache.org/jira/browse/SPARK-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704712#comment-14704712 ] Apache Spark commented on SPARK-8436: - User 'x1-' has created a pull request for this issue: https://github.com/apache/spark/pull/8339 Inconsistent behavior when converting a Timestamp column to Integer/Long and then convert back to Timestamp --- Key: SPARK-8436 URL: https://issues.apache.org/jira/browse/SPARK-8436 Project: Spark Issue Type: Improvement Components: SQL Reporter: Le Minh Tu Priority: Minor I'm aware that when converting from Integer/LongType to Timestamp, the column's values should be in milliseconds. However, I was surprised when trying to do this `a.select(a['event_time'].astype(LongType()).astype(TimestampType())).first()` and got back a totally different datetime ('event_time' is initially a TimestampType). There must be some constraints in implementation that I'm not aware of but it would be nice if a double conversion like this returns the initial value as one might expect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8436) Inconsistent behavior when converting a Timestamp column to Integer/Long and then convert back to Timestamp
[ https://issues.apache.org/jira/browse/SPARK-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8436: --- Assignee: (was: Apache Spark) Inconsistent behavior when converting a Timestamp column to Integer/Long and then convert back to Timestamp --- Key: SPARK-8436 URL: https://issues.apache.org/jira/browse/SPARK-8436 Project: Spark Issue Type: Improvement Components: SQL Reporter: Le Minh Tu Priority: Minor I'm aware that when converting from Integer/LongType to Timestamp, the column's values should be in milliseconds. However, I was surprised when trying to do this `a.select(a['event_time'].astype(LongType()).astype(TimestampType())).first()` and got back a totally different datetime ('event_time' is initially a TimestampType). There must be some constraints in implementation that I'm not aware of but it would be nice if a double conversion like this returns the initial value as one might expect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
[ https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704742#comment-14704742 ] Herman van Hovell commented on SPARK-10100: --- Lets leave it for 1.6. AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction -- Key: SPARK-10100 URL: https://issues.apache.org/jira/browse/SPARK-10100 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Assignee: Herman van Hovell Attachments: SPARK-10100.perf.test.scala Looks like Max (probably Min) implemented based on AggregateFunction2 is slower than the old MaxFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6196) Add MAPR 4.0.2 support to the build
[ https://issues.apache.org/jira/browse/SPARK-6196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704704#comment-14704704 ] Apache Spark commented on SPARK-6196: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/8338 Add MAPR 4.0.2 support to the build --- Key: SPARK-6196 URL: https://issues.apache.org/jira/browse/SPARK-6196 Project: Spark Issue Type: Improvement Components: Build Reporter: Trystan Leftwich Priority: Minor Labels: build Mapr 4.0.2 upgraded to use hadoop 2.5.1 and the current mapr build doesn't support building for 4.0.2 http://doc.mapr.com/display/RelNotes/Version+4.0.2+Release+Notes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10109) NPE when saving Parquet To HDFS
[ https://issues.apache.org/jira/browse/SPARK-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704710#comment-14704710 ] Virgil Palanciuc commented on SPARK-10109: -- I think I know what caused this - the code I was using above was in a parallel foreach, and the outputPath was always the same; I thought there shouldn't be a problem since the pids were always different (and the write is partitioned by pid), but I guess some metadata/temporary files were not? After removing the parallel iteration on 'pid', it works fine. NPE when saving Parquet To HDFS --- Key: SPARK-10109 URL: https://issues.apache.org/jira/browse/SPARK-10109 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Environment: Sparc-ec2, standalone cluster on amazon Reporter: Virgil Palanciuc Very simple code, trying to save a dataframe I get this in the driver {quote} 15/08/19 11:21:41 INFO TaskSetManager: Lost task 9.2 in stage 217.0 (TID 4748) on executor 172.xx.xx.xx: java.lang.NullPointerException (null) and (not for that task): 15/08/19 11:21:46 WARN TaskSetManager: Lost task 5.0 in stage 543.0 (TID 5607, 172.yy.yy.yy): java.lang.NullPointerException at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146) at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) at org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:88) at org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536) at org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536) at scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107) at scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:107) at org.apache.spark.sql.sources.DynamicPartitionWriterContainer.clearOutputWriters(commands.scala:536) at org.apache.spark.sql.sources.DynamicPartitionWriterContainer.abortTask(commands.scala:552) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$2(commands.scala:269) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {quote} I get this in the executor log: {quote} 15/08/19 11:21:41 WARN DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /gglogs/2015-07-27/_temporary/_attempt_201508191119_0217_m_09_2/dpid=18432/pid=1109/part-r-9-46ac3a79-a95c-4d9c-a2f1-b3ee76f6a46c.snappy.parquet File does not exist. Holder DFSClient_NONMAPREDUCE_1730998114_63 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at
[jira] [Assigned] (SPARK-8436) Inconsistent behavior when converting a Timestamp column to Integer/Long and then convert back to Timestamp
[ https://issues.apache.org/jira/browse/SPARK-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8436: --- Assignee: Apache Spark Inconsistent behavior when converting a Timestamp column to Integer/Long and then convert back to Timestamp --- Key: SPARK-8436 URL: https://issues.apache.org/jira/browse/SPARK-8436 Project: Spark Issue Type: Improvement Components: SQL Reporter: Le Minh Tu Assignee: Apache Spark Priority: Minor I'm aware that when converting from Integer/LongType to Timestamp, the column's values should be in milliseconds. However, I was surprised when trying to do this `a.select(a['event_time'].astype(LongType()).astype(TimestampType())).first()` and got back a totally different datetime ('event_time' is initially a TimestampType). There must be some constraints in implementation that I'm not aware of but it would be nice if a double conversion like this returns the initial value as one might expect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10135) Percent of pruned partitions is shown wrong
Romi Kuntsman created SPARK-10135: - Summary: Percent of pruned partitions is shown wrong Key: SPARK-10135 URL: https://issues.apache.org/jira/browse/SPARK-10135 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Romi Kuntsman Priority: Trivial When reading partitioned Parquet in SparkSQL, an info message about the number of pruned partitions is displayed. Actual: Selected 15 partitions out of 181, pruned -1106.7% partitions. Expected: Selected 15 partitions out of 181, pruned 91.71270718232044% partitions. Fix: (i'm newbie here so please help make patch, thanks!) in DataSourceStrategy.scala in method apply() insted of: val percentPruned = (1 - total.toDouble / selected.toDouble) * 100 should be: val percentPruned = (1 - selected.toDouble / total.toDouble) * 100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility
[ https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706041#comment-14706041 ] Apache Spark commented on SPARK-8580: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/8341 Add Parquet files generated by different systems to test interoperability and compatibility --- Key: SPARK-8580 URL: https://issues.apache.org/jira/browse/SPARK-8580 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 to improve interoperability with other systems (reading non-standard Parquet files they generate, and generating standard Parquet files), it would be good to have a set of standard test Parquet files generated by various systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old versions of Spark SQL) to ensure compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility
[ https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8580: --- Assignee: Cheng Lian (was: Apache Spark) Add Parquet files generated by different systems to test interoperability and compatibility --- Key: SPARK-8580 URL: https://issues.apache.org/jira/browse/SPARK-8580 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 to improve interoperability with other systems (reading non-standard Parquet files they generate, and generating standard Parquet files), it would be good to have a set of standard test Parquet files generated by various systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old versions of Spark SQL) to ensure compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility
[ https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8580: --- Assignee: Apache Spark (was: Cheng Lian) Add Parquet files generated by different systems to test interoperability and compatibility --- Key: SPARK-8580 URL: https://issues.apache.org/jira/browse/SPARK-8580 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Apache Spark As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 to improve interoperability with other systems (reading non-standard Parquet files they generate, and generating standard Parquet files), it would be good to have a set of standard test Parquet files generated by various systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old versions of Spark SQL) to ensure compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10143: Assignee: (was: Apache Spark) Parquet changed the behavior of calculating splits -- Key: SPARK-10143 URL: https://issues.apache.org/jira/browse/SPARK-10143 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Critical When Parquet's task side metadata is enabled (by default it is enabled and it needs to be enabled to deal with tables with many files), Parquet delegates the work of calculating initial splits to FileInputFormat (see https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). If filesystem's block size is smaller than the row group size and users do not set min split size, splits in the initial split list will have lots of dummy splits and they contribute to empty tasks (because the starting point and ending point of a split does not cover the starting point of a row group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706045#comment-14706045 ] Apache Spark commented on SPARK-10143: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/8346 Parquet changed the behavior of calculating splits -- Key: SPARK-10143 URL: https://issues.apache.org/jira/browse/SPARK-10143 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Critical When Parquet's task side metadata is enabled (by default it is enabled and it needs to be enabled to deal with tables with many files), Parquet delegates the work of calculating initial splits to FileInputFormat (see https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). If filesystem's block size is smaller than the row group size and users do not set min split size, splits in the initial split list will have lots of dummy splits and they contribute to empty tasks (because the starting point and ending point of a split does not cover the starting point of a row group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10143: Assignee: Apache Spark Parquet changed the behavior of calculating splits -- Key: SPARK-10143 URL: https://issues.apache.org/jira/browse/SPARK-10143 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Reporter: Yin Huai Assignee: Apache Spark Priority: Critical When Parquet's task side metadata is enabled (by default it is enabled and it needs to be enabled to deal with tables with many files), Parquet delegates the work of calculating initial splits to FileInputFormat (see https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). If filesystem's block size is smaller than the row group size and users do not set min split size, splits in the initial split list will have lots of dummy splits and they contribute to empty tasks (because the starting point and ending point of a split does not cover the starting point of a row group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10143: - Comment: was deleted (was: For something quick, we can use the row group size set in hadoop conf to set the min split size.) Parquet changed the behavior of calculating splits -- Key: SPARK-10143 URL: https://issues.apache.org/jira/browse/SPARK-10143 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Critical When Parquet's task side metadata is enabled (by default it is enabled and it needs to be enabled to deal with tables with many files), Parquet delegates the work of calculating initial splits to FileInputFormat (see https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). If filesystem's block size is smaller than the row group size and users do not set min split size, splits in the initial split list will have lots of dummy splits and they contribute to empty tasks (because the starting point and ending point of a split does not cover the starting point of a row group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10143) Parquet changed the behavior of calculating splits
[ https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705988#comment-14705988 ] Yin Huai commented on SPARK-10143: -- [~rdblue] Can you confirm the behavior change of Parquet? Looks like we are just asking FileInputFormat to give us the initial splits. I am thinking to use the current setting of parquet row group size as the fs min split size for the job. What do you think? Thanks :) Parquet changed the behavior of calculating splits -- Key: SPARK-10143 URL: https://issues.apache.org/jira/browse/SPARK-10143 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Critical When Parquet's task side metadata is enabled (by default it is enabled and it needs to be enabled to deal with tables with many files), Parquet delegates the work of calculating initial splits to FileInputFormat (see https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311). If filesystem's block size is smaller than the row group size and users do not set min split size, splits in the initial split list will have lots of dummy splits and they contribute to empty tasks (because the starting point and ending point of a split does not cover the starting point of a row group). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10144) Actually show peak execution memory on UI by default
[ https://issues.apache.org/jira/browse/SPARK-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10144: Assignee: Apache Spark (was: Andrew Or) Actually show peak execution memory on UI by default Key: SPARK-10144 URL: https://issues.apache.org/jira/browse/SPARK-10144 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Apache Spark The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer the case... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10144) Actually show peak execution memory on UI by default
[ https://issues.apache.org/jira/browse/SPARK-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706033#comment-14706033 ] Apache Spark commented on SPARK-10144: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/8345 Actually show peak execution memory on UI by default Key: SPARK-10144 URL: https://issues.apache.org/jira/browse/SPARK-10144 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Andrew Or The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer the case... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10144) Actually show peak execution memory on UI by default
[ https://issues.apache.org/jira/browse/SPARK-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10144: Assignee: Andrew Or (was: Apache Spark) Actually show peak execution memory on UI by default Key: SPARK-10144 URL: https://issues.apache.org/jira/browse/SPARK-10144 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Andrew Or The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer the case... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming
[ https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706139#comment-14706139 ] Baogang Wang commented on SPARK-10145: -- the spark-defaults.conf is as follows: Default system properties included when running spark-submit. # This is useful for setting default environmental settings. # Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers=one two three #spark.core.connection.ack.wait.timeout 3600 #spark.core.connection.auth.wait.timeout3600 spark.akka.frameSize1024 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.akka.timeout 900 spark.storage.memoryFraction0.4 spark.rdd.compress true spark.shuffle.blockTransferService nio spark.yarn.executor.memoryOverhead 1024 Executor exit without useful messages when spark runs in spark-streaming Key: SPARK-10145 URL: https://issues.apache.org/jira/browse/SPARK-10145 Project: Spark Issue Type: Bug Components: Streaming, YARN Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 cores and 32g memory Reporter: Baogang Wang Priority: Critical Original Estimate: 168h Remaining Estimate: 168h Each node is allocated 30g memory by Yarn. My application receives messages from Kafka by directstream. Each application consists of 4 dstream window Spark application is submitted by this command: spark-submit --class spark_security.safe.SafeSockPuppet --driver-memory 3g --executor-memory 3g --num-executors 3 --executor-cores 4 --name safeSparkDealerUser --master yarn --deploy-mode cluster spark_Security-1.0-SNAPSHOT.jar.nocalse hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties After about 1 hours, some executor exits. There is no more yarn logs after the executor exits and there is no stack when the executor exits. When I see the yarn node manager log, it shows as follows : 2015-08-17 17:25:41,550 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1439803298368_0005_01_01 by user root 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Creating a new application reference for app application_1439803298368_0005 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=172.19.160.102 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1439803298368_0005 CONTAINERID=container_1439803298368_0005_01_01 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1439803298368_0005 transitioned from NEW to INITING 2015-08-17 17:25:41,552 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1439803298368_0005_01_01 to application application_1439803298368_0005 2015-08-17 17:25:41,557 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished. 2015-08-17 17:25:41,663 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1439803298368_0005 transitioned from INITING to RUNNING 2015-08-17 17:25:41,664 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1439803298368_0005_01_01 transitioned from NEW to LOCALIZING 2015-08-17 17:25:41,664 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1439803298368_0005 2015-08-17 17:25:41,664 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container container_1439803298368_0005_01_01 2015-08-17 17:25:41,665 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar transitioned from INIT to
[jira] [Comment Edited] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming
[ https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706139#comment-14706139 ] Baogang Wang edited comment on SPARK-10145 at 8/21/15 2:34 AM: --- # Default system properties included when running spark-submit. # This is useful for setting default environmental settings. # Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers=one two three #spark.core.connection.ack.wait.timeout 3600 #spark.core.connection.auth.wait.timeout3600 spark.akka.frameSize1024 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.akka.timeout 900 spark.storage.memoryFraction0.4 spark.rdd.compress true spark.shuffle.blockTransferService nio spark.yarn.executor.memoryOverhead 1024 was (Author: heayin): the spark-defaults.conf is as follows: Default system properties included when running spark-submit. # This is useful for setting default environmental settings. # Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers=one two three #spark.core.connection.ack.wait.timeout 3600 #spark.core.connection.auth.wait.timeout3600 spark.akka.frameSize1024 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.akka.timeout 900 spark.storage.memoryFraction0.4 spark.rdd.compress true spark.shuffle.blockTransferService nio spark.yarn.executor.memoryOverhead 1024 Executor exit without useful messages when spark runs in spark-streaming Key: SPARK-10145 URL: https://issues.apache.org/jira/browse/SPARK-10145 Project: Spark Issue Type: Bug Components: Streaming, YARN Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 cores and 32g memory Reporter: Baogang Wang Priority: Critical Original Estimate: 168h Remaining Estimate: 168h Each node is allocated 30g memory by Yarn. My application receives messages from Kafka by directstream. Each application consists of 4 dstream window Spark application is submitted by this command: spark-submit --class spark_security.safe.SafeSockPuppet --driver-memory 3g --executor-memory 3g --num-executors 3 --executor-cores 4 --name safeSparkDealerUser --master yarn --deploy-mode cluster spark_Security-1.0-SNAPSHOT.jar.nocalse hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties After about 1 hours, some executor exits. There is no more yarn logs after the executor exits and there is no stack when the executor exits. When I see the yarn node manager log, it shows as follows : 2015-08-17 17:25:41,550 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1439803298368_0005_01_01 by user root 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Creating a new application reference for app application_1439803298368_0005 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=172.19.160.102 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1439803298368_0005 CONTAINERID=container_1439803298368_0005_01_01 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1439803298368_0005 transitioned from NEW to INITING 2015-08-17 17:25:41,552 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1439803298368_0005_01_01 to application application_1439803298368_0005 2015-08-17 17:25:41,557 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The
[jira] [Resolved] (SPARK-10140) Add target fields to @Since annotation
[ https://issues.apache.org/jira/browse/SPARK-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10140. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8344 [https://github.com/apache/spark/pull/8344] Add target fields to @Since annotation -- Key: SPARK-10140 URL: https://issues.apache.org/jira/browse/SPARK-10140 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.5.0 Add target fields to @Since so constructor params and fields also get annotated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org