[jira] [Created] (SPARK-6286) Handle TASK_ERROR in TaskState
Iulian Dragos created SPARK-6286: Summary: Handle TASK_ERROR in TaskState Key: SPARK-6286 URL: https://issues.apache.org/jira/browse/SPARK-6286 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Iulian Dragos Priority: Minor Scala warning: {code} match may not be exhaustive. It would fail on the following input: TASK_ERROR {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357055#comment-14357055 ] ANUPAM MEDIRATTA commented on SPARK-5692: - Manoj Kumar, Not yet but plan to work on it over the weekend. Is that okay? Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: ANUPAM MEDIRATTA Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357053#comment-14357053 ] ANUPAM MEDIRATTA commented on SPARK-5692: - Manoj Kumar, Not yet but plan to work on it over the weekend. Is that okay? Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: ANUPAM MEDIRATTA Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ANUPAM MEDIRATTA updated SPARK-5692: Comment: was deleted (was: Manoj Kumar, Not yet but plan to work on it over the weekend. Is that okay? ) Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: ANUPAM MEDIRATTA Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357057#comment-14357057 ] ANUPAM MEDIRATTA commented on SPARK-5692: - Manoj Kumar, Not yet but plan to work on it over the weekend. Is that okay? Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: ANUPAM MEDIRATTA Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6283) Add a CassandraInputDStream to stream from a C* table
Helena Edelson created SPARK-6283: - Summary: Add a CassandraInputDStream to stream from a C* table Key: SPARK-6283 URL: https://issues.apache.org/jira/browse/SPARK-6283 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Helena Edelson Add support for streaming from Cassandra to Spark Streaming - external. Related ticket: https://datastax-oss.atlassian.net/browse/SPARKC-40 [~helena_e] is doing the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6286) Handle TASK_ERROR in TaskState
[ https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Iulian Dragos updated SPARK-6286: - Labels: mesos (was: ) Handle TASK_ERROR in TaskState -- Key: SPARK-6286 URL: https://issues.apache.org/jira/browse/SPARK-6286 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Iulian Dragos Priority: Minor Labels: mesos Scala warning: {code} match may not be exhaustive. It would fail on the following input: TASK_ERROR {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods
[ https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357162#comment-14357162 ] mgdadv commented on SPARK-6189: --- The current behavior really is quite confusing. In particular with R datasets where the period is often used. I think it is not obvious what the correct thing to do is. The patch above replaces the period with an underscore. This fixes the problem, but would be potentially problematic if a different solution is wanted in the future, i.e. then scripts relying on this patch would have to be changed. Alternatively one could just spit out a warning. The problem is that Spark is quite verbose and the warning might be missed. This would be the least intrusive solution I can think of. Another possibility would be to raise an exception instead of just printing a warning. Pandas to DataFrame conversion should check field names for periods --- Key: SPARK-6189 URL: https://issues.apache.org/jira/browse/SPARK-6189 Project: Spark Issue Type: Improvement Components: DataFrame, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Issue I ran into: I imported an R dataset in CSV format into a Pandas DataFrame and then use toDF() to convert that into a Spark DataFrame. The R dataset had a column with a period in it (column GNP.deflator in the longley dataset). When I tried to select it using the Spark DataFrame DSL, I could not because the DSL thought the period was selecting a field within GNP. Also, since GNP is another field's name, it gives an error which could be obscure to users, complaining: {code} org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type DoubleType; {code} We should either handle periods in column names or check during loading and warn/fail gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357072#comment-14357072 ] Marcelo Vanzin commented on SPARK-6277: --- Sometimes I do miss being able to reference other configs and/or system properties / env variables from the config. It would allow some interesting use cases, at least from a distribution's point of view. This should be pretty simple to achieve using the commons-config library, although I'd prefer avoiding the dependency since it would be more friendly to SPARK-4824. Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf - Key: SPARK-6277 URL: https://issues.apache.org/jira/browse/SPARK-6277 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.0, 1.2.1 Reporter: Jianshi Huang I need to set spark.local.dir to use user local home instead of /tmp, but currently spark-defaults.conf can only allow constant values. What I want to do is to write: bq. spark.local.dir /home/${user.name}/spark/tmp or bq. spark.local.dir /home/${USER}/spark/tmp Otherwise I would have to hack bin/spark-class and pass the option through -Dspark.local.dir Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357072#comment-14357072 ] Marcelo Vanzin edited comment on SPARK-6277 at 3/11/15 3:49 PM: Sometimes I do miss being able to reference other configs and/or system properties / env variables from the config. It would allow some interesting use cases, at least from a distribution's point of view. This should be pretty simple to achieve using the commons-config library, although I'd prefer avoiding the dependency since it would be more friendly to SPARK-4924. was (Author: vanzin): Sometimes I do miss being able to reference other configs and/or system properties / env variables from the config. It would allow some interesting use cases, at least from a distribution's point of view. This should be pretty simple to achieve using the commons-config library, although I'd prefer avoiding the dependency since it would be more friendly to SPARK-4824. Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf - Key: SPARK-6277 URL: https://issues.apache.org/jira/browse/SPARK-6277 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.0, 1.2.1 Reporter: Jianshi Huang I need to set spark.local.dir to use user local home instead of /tmp, but currently spark-defaults.conf can only allow constant values. What I want to do is to write: bq. spark.local.dir /home/${user.name}/spark/tmp or bq. spark.local.dir /home/${USER}/spark/tmp Otherwise I would have to hack bin/spark-class and pass the option through -Dspark.local.dir Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods
[ https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357086#comment-14357086 ] Apache Spark commented on SPARK-6189: - User 'mgdadv' has created a pull request for this issue: https://github.com/apache/spark/pull/4982 Pandas to DataFrame conversion should check field names for periods --- Key: SPARK-6189 URL: https://issues.apache.org/jira/browse/SPARK-6189 Project: Spark Issue Type: Improvement Components: DataFrame, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Issue I ran into: I imported an R dataset in CSV format into a Pandas DataFrame and then use toDF() to convert that into a Spark DataFrame. The R dataset had a column with a period in it (column GNP.deflator in the longley dataset). When I tried to select it using the Spark DataFrame DSL, I could not because the DSL thought the period was selecting a field within GNP. Also, since GNP is another field's name, it gives an error which could be obscure to users, complaining: {code} org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type DoubleType; {code} We should either handle periods in column names or check during loading and warn/fail gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ANUPAM MEDIRATTA updated SPARK-5692: Comment: was deleted (was: Manoj Kumar, Not yet but plan to work on it over the weekend. Is that okay? ) Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: ANUPAM MEDIRATTA Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ANUPAM MEDIRATTA updated SPARK-5692: Comment: was deleted (was: Manoj Kumar, Not yet but plan to work on it over the weekend. Is that okay? ) Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: ANUPAM MEDIRATTA Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ANUPAM MEDIRATTA updated SPARK-5692: Comment: was deleted (was: Manoj Kumar, Not yet but plan to work on it over the weekend. Is that okay? ) Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: ANUPAM MEDIRATTA Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6284) Support framework authentication and role in Mesos framework
Timothy Chen created SPARK-6284: --- Summary: Support framework authentication and role in Mesos framework Key: SPARK-6284 URL: https://issues.apache.org/jira/browse/SPARK-6284 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Support framework authentication and role in both Coarse grain and fine grain mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357054#comment-14357054 ] ANUPAM MEDIRATTA commented on SPARK-5692: - Manoj Kumar, Not yet but plan to work on it over the weekend. Is that okay? Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: ANUPAM MEDIRATTA Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357056#comment-14357056 ] ANUPAM MEDIRATTA commented on SPARK-5692: - Manoj Kumar, Not yet but plan to work on it over the weekend. Is that okay? Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: ANUPAM MEDIRATTA Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357058#comment-14357058 ] ANUPAM MEDIRATTA commented on SPARK-5692: - Manoj Kumar, Not yet but plan to work on it over the weekend. Is that okay? Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: ANUPAM MEDIRATTA Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6285) Duplicated code leads to errors
Iulian Dragos created SPARK-6285: Summary: Duplicated code leads to errors Key: SPARK-6285 URL: https://issues.apache.org/jira/browse/SPARK-6285 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Iulian Dragos The following class is duplicated inside [ParquetTestData|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala#L39] and [ParquetIOSuite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala#L44], with exact same code and fully qualified name: {code} org.apache.spark.sql.parquet.TestGroupWriteSupport {code} The second one was introduced in [3b395e10|https://github.com/apache/spark/commit/3b395e10510782474789c9098084503f98ca4830], but even though it mentions that `ParquetTestData` should be removed later, I couldn't find a corresponding Jira ticket. This duplicate class causes the Eclipse builder to fail (since src/main and src/test are compiled together in Eclipse, unlike Sbt). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ANUPAM MEDIRATTA updated SPARK-5692: Comment: was deleted (was: Manoj Kumar, Not yet but plan to work on it over the weekend. Is that okay? ) Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: ANUPAM MEDIRATTA Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6284) Support framework authentication and role in Mesos framework
[ https://issues.apache.org/jira/browse/SPARK-6284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357181#comment-14357181 ] Apache Spark commented on SPARK-6284: - User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/4960 Support framework authentication and role in Mesos framework Key: SPARK-6284 URL: https://issues.apache.org/jira/browse/SPARK-6284 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Support framework authentication and role in both Coarse grain and fine grain mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357545#comment-14357545 ] Jason Hubbard commented on SPARK-2243: -- It also mentions changing the broadcast factory because it doesn't work properly with multiple spark contexts: https://github.com/spark-jobserver/spark-jobserver/blob/5f3cadcc95465fe6d97fdffbf78e38ef5342ffa1/job-server/src/main/resources/application.conf#L29 I believe that is probably related to the JIRA already mentioned in this post, but is marked as won't fix: SPARK-3148 I've ran multiple spark contexts in a single JVM and the problem I see is the akka actor on the worker tries connecting to the wrong driver. Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 0.7.0, 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Created] (SPARK-6289) PySpark doesn't maintain SQL Types
Michael Nazario created SPARK-6289: -- Summary: PySpark doesn't maintain SQL Types Key: SPARK-6289 URL: https://issues.apache.org/jira/browse/SPARK-6289 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.2.1 Reporter: Michael Nazario For the TimestampType, Spark SQL requires a datetime.date in Python. However, if you collect a row based on that type, you'll end up with a returned value which is type datetime.datetime. I have tried to reproduce this using the pyspark shell, but have been unable to. This is definitely a problem coming from pyrolite though: https://github.com/irmen/Pyrolite/ Pyrolite is being used for datetime and date serialization, but appears to not map to date objects, but maps to datetime objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6289) PySpark doesn't maintain SQL date Types
[ https://issues.apache.org/jira/browse/SPARK-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Nazario updated SPARK-6289: --- Summary: PySpark doesn't maintain SQL date Types (was: PySpark doesn't maintain SQL Types) PySpark doesn't maintain SQL date Types --- Key: SPARK-6289 URL: https://issues.apache.org/jira/browse/SPARK-6289 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.2.1 Reporter: Michael Nazario For the TimestampType, Spark SQL requires a datetime.date in Python. However, if you collect a row based on that type, you'll end up with a returned value which is type datetime.datetime. I have tried to reproduce this using the pyspark shell, but have been unable to. This is definitely a problem coming from pyrolite though: https://github.com/irmen/Pyrolite/ Pyrolite is being used for datetime and date serialization, but appears to not map to date objects, but maps to datetime objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes
[ https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357627#comment-14357627 ] Josh Rosen commented on SPARK-6270: --- In the long run, my preference is to remove HistoryServer-like responsibilities from the Master: the standalone Master is typically configured with a small amount of memory and risks OOMing when loading UIs, even if the UI loading is done asynchronously (right now it blocks the main event processing thread). We might consider trying to add lazy loading as an intermediate stepping-stone to properly fixing this issue, but I'd like to argue against that approach: lazy loading inside of the Master is going to require mechanisms similar to what we have in the HistoryServer's loaderServlet, so we're either going to have to duplicate a bunch of code or change the HistoryServer code to be more modular so that we can reuse its components it inside of the Master. Another consideration firewall / port issues: currently, the master web UI and the Spark web UIs that it loads are served on the same port. If we set up a new Jetty server for the UIs, whether in the same Master JVM or in a separate HistoryServer process, then the Spark UIs will be served at some different port, potentially breaking those links in environments where only the master web UI port is exposed. I think it's going to be really painful to avoid this, though, and I don't think we should resort to solutions where we proxy the Spark UI through the master UI, since the responses could be huge and lead to OOMs in the proxy. I think we should Introduce a new configuration which completely disables the master's Spark UI serving feature, backport this to all maintenance branches, and mention this feature in the release notes. For Spark 1.4, I think we should completely remove the web UI serving from the Master and provide the ability to configure the master with a HistoryServer address which will be used to generate links to UIs. This runs into its own set of problems, though: the current HistoryServer FSHistoryProvider assumes that all applications' event logs are located in the same directory, whereas the Master can load event logs from any directory which is specified in the application description. This means that we'll need a way to instruct the HistoryServer to load logs from an arbitrary path. Therefore, maybe we should extend the HistoryServer's HTTP interface to allow requests to specify the event log location (falling back to the history server's default event log directory if no alternate log location was specified). This could have security implications, though; we'd have to be careful to ensure that this doesn't allow arbitrary file reads. Standalone Master hangs when streaming job completes Key: SPARK-6270 URL: https://issues.apache.org/jira/browse/SPARK-6270 Project: Spark Issue Type: Bug Components: Deploy, Streaming Affects Versions: 1.2.0, 1.3.0, 1.2.1 Reporter: Tathagata Das Priority: Critical If the event logging is enabled, the Spark Standalone Master tries to recreate the web UI of a completed Spark application from its event logs. However if this event log is huge (e.g. for a Spark Streaming application), then the master hangs in its attempt to read and recreate the web ui. This hang causes the whole standalone cluster to be unusable. Workaround is to disable the event logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357497#comment-14357497 ] Jeremy Freeman edited comment on SPARK-2429 at 3/11/15 8:10 PM: Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I agree with [~josephkb] that it is worth bringing this into MLlib, as the algorithm itself will translate to future uses, and many groups (including ours!) will find it useful now. It might be worth adding to spark-packages, especially if we expect the review to take awhile. Those seem especially useful as a way to provide easy access to testing experimental pieces of functionality. But I'd probably prioritize just reviewing the patch. Also agree with the others that we should start a new PR with the new algorithm, 1000x faster is a lot! It is worth incorporating some of comments from the old PR if you haven't already, if relevant in the new version. I'd be happy to go through the new PR as I'm quite familiar with the problem / algorithm, but it would help if you could say a little more about what you did so differently here, to help guide me as I look at the code. was (Author: freeman-lab): Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I agree with [~josephkb] that it is worth bringing this into MLlib, as the algorithm itself will translate to future uses, and many groups (including ours!) will find it useful now. It might be worth adding to spark-packages, especially if we expect the review to take awhile. Those seem especially useful as a way to provide easy access to testing new pieces of functionality. But I'd probably prioritize just reviewing the patch. Also agree with the others that we should start a new PR with the new algorithm, 1000x faster is a lot! It is worth incorporating some of comments from the old PR if you haven't already, if relevant in the new version. I'd be happy to go through the new PR as I'm quite familiar with the problem / algorithm, but it would help if you could say a little more about what you did so differently here, to help guide me as I look at the code. Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357497#comment-14357497 ] Jeremy Freeman commented on SPARK-2429: --- Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I agree with [~josephkb] that it is worth bringing this into MLlib, as the algorithm itself will translate to future uses, and many groups (including ours!) will find it useful now. It might be worth adding to spark-packages, especially if we expect the review to take awhile. Those seem especially useful as a way to provide easy access to testing new pieces of functionality. But I'd probably prioritize just reviewing the patch. Also agree with the others that we should start a new PR with the new algorithm, 1000x faster is a lot! It is worth incorporating some of comments from the old PR if you haven't already, if relevant in the new version. I'd be happy to go through the new PR as I'm quite familiar with the problem / algorithm, but it would help if you could say a little more about what you did so differently here, to help guide me as I look at the code. Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5987) Model import/export for GaussianMixtureModel
[ https://issues.apache.org/jira/browse/SPARK-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357502#comment-14357502 ] Apache Spark commented on SPARK-5987: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/4986 Model import/export for GaussianMixtureModel Key: SPARK-5987 URL: https://issues.apache.org/jira/browse/SPARK-5987 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar Support save/load for GaussianMixtureModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5050) Add unit test for sqdist
[ https://issues.apache.org/jira/browse/SPARK-5050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357616#comment-14357616 ] Apache Spark commented on SPARK-5050: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4985 Add unit test for sqdist Key: SPARK-5050 URL: https://issues.apache.org/jira/browse/SPARK-5050 Project: Spark Issue Type: Test Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor Fix For: 1.3.0 Related to #3643. Follow the previous suggestion to add unit test for sqdist in VectorsSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4797) Replace breezeSquaredDistance
[ https://issues.apache.org/jira/browse/SPARK-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357617#comment-14357617 ] Apache Spark commented on SPARK-4797: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4985 Replace breezeSquaredDistance - Key: SPARK-4797 URL: https://issues.apache.org/jira/browse/SPARK-4797 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor Fix For: 1.3.0 This PR replaces slow breezeSquaredDistance. A simple calculation involving 4 squared distances between the vectors of 2 dims shows: * breezeSquaredDistance: ~12 secs * This PR: ~10.5 secs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6206) spark-ec2 script reporting SSL error?
[ https://issues.apache.org/jira/browse/SPARK-6206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357260#comment-14357260 ] Joe O edited comment on SPARK-6206 at 3/11/15 5:38 PM: --- Ok close this one out. The problem was local configuration. I had installed the Google Cloud Services tools, which created a ~/.boto file. The boto library that came packaged with Spark picked up on this and started using the settings in the file, which conflicted with what I was telling Spark to use. Renaming the ~/.boto file temporarily cause the spark-ec2 script to start working again. Documenting everything here in case someone else runs into this problem. was (Author: joe6521): Ok close this one out. The problem was local configuration. I had installed the Google Cloud Services tools, which created a ~/.boto file. The boto library that packaged with Spark picked up on this and started using the settings in the file, which conflicted with what I was telling Spark to use. Renaming the ~/.boto file temporarily cause the spark-ec2 script to start working again. Documenting everything here in case someone else runs into this problem. spark-ec2 script reporting SSL error? - Key: SPARK-6206 URL: https://issues.apache.org/jira/browse/SPARK-6206 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.0 Reporter: Joe O I have been using the spark-ec2 script for several months with no problems. Recently, when executing a script to launch a cluster I got the following error: {code} [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib {code} Nothing launches, the script exits. I am not sure if something on machine changed, this is a problem with EC2's certs, or a problem with Python. It occurs 100% of the time, and has been occurring over at least the last two days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6274) Add streaming examples showing integration with DataFrames and SQL
[ https://issues.apache.org/jira/browse/SPARK-6274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-6274. -- Resolution: Fixed Fix Version/s: 1.3.1 1.4.0 Add streaming examples showing integration with DataFrames and SQL -- Key: SPARK-6274 URL: https://issues.apache.org/jira/browse/SPARK-6274 Project: Spark Issue Type: Improvement Components: Examples, Streaming Reporter: Tathagata Das Assignee: Tathagata Das Fix For: 1.4.0, 1.3.1 Self explanatory -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6227: - Priority: Major (was: Minor) PCA and SVD for PySpark --- Key: SPARK-6227 URL: https://issues.apache.org/jira/browse/SPARK-6227 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.2.1 Reporter: Julien Amelot The Dimensionality Reduction techniques are not available via Python (Scala + Java only). * Principal component analysis (PCA) * Singular value decomposition (SVD) Doc: http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357354#comment-14357354 ] Joseph K. Bradley commented on SPARK-6227: -- I was about to---but then I realized this is a much bigger task than it appears since it will require writing Python wrappers for the various distributed matrix types. I just made this a subtask of another JIRA. Could you take a look at the parent JIRA and the distributed matrices code and figure out a good piece of the work to start with? Hopefully we can break the work into pieces in a natural way. Thanks! PCA and SVD for PySpark --- Key: SPARK-6227 URL: https://issues.apache.org/jira/browse/SPARK-6227 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.2.1 Reporter: Julien Amelot The Dimensionality Reduction techniques are not available via Python (Scala + Java only). * Principal component analysis (PCA) * Singular value decomposition (SVD) Doc: http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357367#comment-14357367 ] RJ Nowling commented on SPARK-2429: --- Hi [~yuu.ishik...@gmail.com] I think the new implementation is great. Did you change the algorithm? I've spoken with [~srowen]. The hierarchical clustering would be valuable to the community -- I actually had a couple people reach out to me about it. However, Spark is currently undergoing the transition to the new ML API and as such, there is concern about accepting code into the older MLlib library. With the announcement of Spark packages, there is also a move to encourage external libraries instead of large commits into Spark itself. Would you be interested in publishing your hierarchical clustering implementation as an external library like [~derrickburns] did for the [KMeans Mini Batch implementation|https://github.com/derrickburns/generalized-kmeans-clustering]? It could be listed in the [Spark packages index|http://spark-packages.org/] along with two other clustering packages so users can find it. Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function
[ https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357382#comment-14357382 ] Sean Owen commented on SPARK-6282: -- http://stackoverflow.com/questions/11133506/importerror-while-importing-winreg-module-of-python It sounds like something you are calling invokes a Windows-only Python library called winreg, but you're executing on Linux. This doesn't sound Spark-related, as certainly Spark does not invoke this. Strange Python import error when using random() in a lambda function Key: SPARK-6282 URL: https://issues.apache.org/jira/browse/SPARK-6282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Kubuntu 14.04, Python 2.7.6 Reporter: Pavel Laskov Priority: Minor Consider the exemplary Python code below: from random import random from pyspark.context import SparkContext from xval_mllib import read_csv_file_as_list if __name__ == __main__: sc = SparkContext(appName=Random() bug test) data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv')) #data = sc.parallelize([1, 2, 3, 4, 5], 2) d = data.map(lambda x: (random(), x)) print d.first() Data is read from a large CSV file. Running this code results in a Python import error: ImportError: No module named _winreg If I use 'import random' and 'random.random()' in the lambda function no error occurs. Also no error occurs, for both kinds of import statements, for a small artificial data set like the one shown in a commented line. The full error trace, the source code of csv reading code (function 'read_csv_file_as_list' is my own) as well as a sample dataset (the original dataset is about 8M large) can be provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357591#comment-14357591 ] Peter Rudenko commented on SPARK-2243: -- Unfortunatelly it doesn't work in spark-jobserver (at least for version 1.2.+). Take a look to [this tread|https://groups.google.com/d/msg/spark-jobserver/f466U2vydMY/s3b0xPNn4U8J]. Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 0.7.0, 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at
[jira] [Commented] (SPARK-6206) spark-ec2 script reporting SSL error?
[ https://issues.apache.org/jira/browse/SPARK-6206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357260#comment-14357260 ] Joe O commented on SPARK-6206: -- Ok close this one out. The problem was local configuration. I had installed the Google Cloud Services tools, which created a ~/.boto file. The boto library that packaged with Spark picked up on this and started using the settings in the file, which conflicted with what I was telling Spark to use. Renaming the ~/.boto file temporarily cause the spark-ec2 script to start working again. Documenting everything here in case someone else runs into this problem. spark-ec2 script reporting SSL error? - Key: SPARK-6206 URL: https://issues.apache.org/jira/browse/SPARK-6206 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.0 Reporter: Joe O I have been using the spark-ec2 script for several months with no problems. Recently, when executing a script to launch a cluster I got the following error: {code} [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib {code} Nothing launches, the script exits. I am not sure if something on machine changed, this is a problem with EC2's certs, or a problem with Python. It occurs 100% of the time, and has been occurring over at least the last two days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size
[ https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357272#comment-14357272 ] Apache Spark commented on SPARK-5186: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4985 Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size - Key: SPARK-5186 URL: https://issues.apache.org/jira/browse/SPARK-5186 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Derrick Burns Assignee: yuhao yang Fix For: 1.3.0 Original Estimate: 0.25h Remaining Estimate: 0.25h The implementation of Vector.equals and Vector.hashCode are correct but slow for SparseVectors that are truly sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6288) Pyrolite calls hashCode to cache previously serialized objects
Xiangrui Meng created SPARK-6288: Summary: Pyrolite calls hashCode to cache previously serialized objects Key: SPARK-6288 URL: https://issues.apache.org/jira/browse/SPARK-6288 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.2.1, 1.1.1, 1.0.2, 1.3.0 Reporter: Xiangrui Meng Assignee: Josh Rosen https://github.com/irmen/Pyrolite/blob/v2.0/java/src/net/razorvine/pickle/Pickler.java#L140 This operation could be quite expensive, compared to serializing the object directly, because hashCode usually needs to access all data stored in the object. Maybe we should disable this feature by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357380#comment-14357380 ] Joseph K. Bradley edited comment on SPARK-2429 at 3/11/15 6:47 PM: --- But as far as the old vs. new algorithm, it sounds like it would make sense to replace the old one with the new one for this PR (though I have not yet had a chance to compare and understand them in detail). Are there tradeoffs? was (Author: josephkb): But as far as the old vs. new algorithm, it sounds like it would make sense to replace the old one with the new one for this PR (though I have not yet had a chance to compare and understand them in detail). Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6286) Handle TASK_ERROR in TaskState
[ https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357385#comment-14357385 ] Sean Owen commented on SPARK-6286: -- Probably OK for 1.4.x. Do you know what to do with this case? [~jongyoul] do you have an opinion? I think you made the update to 0.21.0. Handle TASK_ERROR in TaskState -- Key: SPARK-6286 URL: https://issues.apache.org/jira/browse/SPARK-6286 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Iulian Dragos Priority: Minor Labels: mesos Scala warning: {code} match may not be exhaustive. It would fail on the following input: TASK_ERROR {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357407#comment-14357407 ] Joseph K. Bradley commented on SPARK-2429: -- I'll try to prioritize it, though the next week or so will be difficult because of the Spark Summit. It will be valuable to have your input still since you're familiar with the PR. Thanks for both of your efforts on this! Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357463#comment-14357463 ] Craig Lukasik commented on SPARK-2243: -- The Spark Job Server uses multiple SparkContexts. I think the trick might be based on using a separate class loader. See line 250 here: https://github.com/ooyala/spark-jobserver/blob/master/job-server/src/spark.jobserver/JobManagerActor.scala . Sadly, I could not replicate this trick in Java (as subsequent uses of Spark API in my code resulted in ClassCastException's). I wonder if Scala is more forgiving? Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 0.7.0, 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at
[jira] [Commented] (SPARK-6287) Add support for dynamic allocation in the Mesos coarse-grained scheduler
[ https://issues.apache.org/jira/browse/SPARK-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357262#comment-14357262 ] Apache Spark commented on SPARK-6287: - User 'dragos' has created a pull request for this issue: https://github.com/apache/spark/pull/4984 Add support for dynamic allocation in the Mesos coarse-grained scheduler Key: SPARK-6287 URL: https://issues.apache.org/jira/browse/SPARK-6287 Project: Spark Issue Type: Bug Components: Mesos Reporter: Iulian Dragos Add support inside the coarse-grained Mesos scheduler for dynamic allocation. It amounts to implementing two methods that allow scaling up and down the number of executors: {code} def doKillExecutors(executorIds: Seq[String]) def doRequestTotalExecutors(requestedTotal: Int) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357335#comment-14357335 ] Joseph K. Bradley commented on SPARK-5992: -- I'm not sure what the most important or useful ones would be. I've heard the most about random projections. Are you familiar with the others? Perhaps we can make a list of algorithms and their properties to try to get a reasonable coverage of use cases. Locality Sensitive Hashing (LSH) for MLlib -- Key: SPARK-5992 URL: https://issues.apache.org/jira/browse/SPARK-5992 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Locality Sensitive Hashing (LSH) would be very useful for ML. It would be great to discuss some possible algorithms here, choose an API, and make a PR for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6285) Duplicated code leads to errors
[ https://issues.apache.org/jira/browse/SPARK-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357297#comment-14357297 ] Iulian Dragos commented on SPARK-6285: -- According to the git commit message that introduced the duplicate: {quote} To avoid potential merge conflicts, old testing code are not removed yet. The following classes can be safely removed after most Parquet related PRs are handled: - `ParquetQuerySuite` - `ParquetTestData` {quote} I mentioned the Eclipse build problem in passing, but I can expand: the class *is* a duplicated name, so the Scala compiler is correct in refusing it. It only compiles in Sbt/Maven because the src/main and src/test are compiled in separate compiler runs, and scalac seems to not notice the duplicate name when it comes from bytecode. Eclipse builds src/main and src/test together, and when both classes originate from sources scalac issues an error message. Duplicated code leads to errors --- Key: SPARK-6285 URL: https://issues.apache.org/jira/browse/SPARK-6285 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Iulian Dragos The following class is duplicated inside [ParquetTestData|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala#L39] and [ParquetIOSuite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala#L44], with exact same code and fully qualified name: {code} org.apache.spark.sql.parquet.TestGroupWriteSupport {code} The second one was introduced in [3b395e10|https://github.com/apache/spark/commit/3b395e10510782474789c9098084503f98ca4830], but even though it mentions that `ParquetTestData` should be removed later, I couldn't find a corresponding Jira ticket. This duplicate class causes the Eclipse builder to fail (since src/main and src/test are compiled together in Eclipse, unlike Sbt). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6285) Duplicated code leads to errors
[ https://issues.apache.org/jira/browse/SPARK-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357388#comment-14357388 ] Sean Owen commented on SPARK-6285: -- [~lian cheng] do you think we can now remove these two classes in order to resolve the issue? Duplicated code leads to errors --- Key: SPARK-6285 URL: https://issues.apache.org/jira/browse/SPARK-6285 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Iulian Dragos The following class is duplicated inside [ParquetTestData|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala#L39] and [ParquetIOSuite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala#L44], with exact same code and fully qualified name: {code} org.apache.spark.sql.parquet.TestGroupWriteSupport {code} The second one was introduced in [3b395e10|https://github.com/apache/spark/commit/3b395e10510782474789c9098084503f98ca4830], but even though it mentions that `ParquetTestData` should be removed later, I couldn't find a corresponding Jira ticket. This duplicate class causes the Eclipse builder to fail (since src/main and src/test are compiled together in Eclipse, unlike Sbt). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357389#comment-14357389 ] RJ Nowling commented on SPARK-2429: --- [~josephkb] I think it would be great to get the new implementation into Spark but we need a champion for it. [~yuu.ishik...@gmail.com] did some great work, and I've been trying to shepard the work but we need a committer who wants to bring it in. If you want to do that, then I can step back and let you and [~yuu.ishik...@gmail.com] bring this across the finish line. Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357410#comment-14357410 ] RJ Nowling commented on SPARK-2429: --- I'm familiar with the community interest but I'm not terribly familiar with the implementations (old or now). [~freeman-lab] may be the appropriate person to ask for help -- the original implementation was based on his gist. Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size
[ https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reopened SPARK-5186: -- Re-opened this issue for branch-1.2. Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size - Key: SPARK-5186 URL: https://issues.apache.org/jira/browse/SPARK-5186 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Derrick Burns Assignee: yuhao yang Fix For: 1.3.0 Original Estimate: 0.25h Remaining Estimate: 0.25h The implementation of Vector.equals and Vector.hashCode are correct but slow for SparseVectors that are truly sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function
[ https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357364#comment-14357364 ] Joseph K. Bradley commented on SPARK-6282: -- Do you know where _winreg appears in the code you're running? Is it being brought in by the read_csv_file_as_list method or its containing package? Strange Python import error when using random() in a lambda function Key: SPARK-6282 URL: https://issues.apache.org/jira/browse/SPARK-6282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Kubuntu 14.04, Python 2.7.6 Reporter: Pavel Laskov Priority: Minor Consider the exemplary Python code below: from random import random from pyspark.context import SparkContext from xval_mllib import read_csv_file_as_list if __name__ == __main__: sc = SparkContext(appName=Random() bug test) data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv')) #data = sc.parallelize([1, 2, 3, 4, 5], 2) d = data.map(lambda x: (random(), x)) print d.first() Data is read from a large CSV file. Running this code results in a Python import error: ImportError: No module named _winreg If I use 'import random' and 'random.random()' in the lambda function no error occurs. Also no error occurs, for both kinds of import statements, for a small artificial data set like the one shown in a commented line. The full error trace, the source code of csv reading code (function 'read_csv_file_as_list' is my own) as well as a sample dataset (the original dataset is about 8M large) can be provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6286) Handle TASK_ERROR in TaskState
[ https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357240#comment-14357240 ] Sean Owen commented on SPARK-6286: -- I remember looking at this -- it's a Mesos enum right? I wondered if it were a new-ish value, and handling it would somehow make the code incompatible with older versions of Mesos, so I didn't want to touch it. But that is not based on any real knowledge. If you know of what to do in this state and confirm that it's not a value specific to only some supported Mesos versions, then I'd go for it and add a handler. Handle TASK_ERROR in TaskState -- Key: SPARK-6286 URL: https://issues.apache.org/jira/browse/SPARK-6286 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Iulian Dragos Priority: Minor Labels: mesos Scala warning: {code} match may not be exhaustive. It would fail on the following input: TASK_ERROR {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods
[ https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357248#comment-14357248 ] Joseph K. Bradley commented on SPARK-6189: -- Maybe it's actually OK to allow periods in fields names since SQL does. We could be like SQL, where periods are OK and users just need to make sure to quote the field name so that SQL doesn't think the period is indicating a subfield. I haven't tried this yet with DataFrames to check its behavior. Pandas to DataFrame conversion should check field names for periods --- Key: SPARK-6189 URL: https://issues.apache.org/jira/browse/SPARK-6189 Project: Spark Issue Type: Improvement Components: DataFrame, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Issue I ran into: I imported an R dataset in CSV format into a Pandas DataFrame and then use toDF() to convert that into a Spark DataFrame. The R dataset had a column with a period in it (column GNP.deflator in the longley dataset). When I tried to select it using the Spark DataFrame DSL, I could not because the DSL thought the period was selecting a field within GNP. Also, since GNP is another field's name, it gives an error which could be obscure to users, complaining: {code} org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type DoubleType; {code} We should either handle periods in column names or check during loading and warn/fail gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6206) spark-ec2 script reporting SSL error?
[ https://issues.apache.org/jira/browse/SPARK-6206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe O closed SPARK-6206. Resolution: Invalid spark-ec2 script reporting SSL error? - Key: SPARK-6206 URL: https://issues.apache.org/jira/browse/SPARK-6206 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.0 Reporter: Joe O I have been using the spark-ec2 script for several months with no problems. Recently, when executing a script to launch a cluster I got the following error: {code} [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib {code} Nothing launches, the script exits. I am not sure if something on machine changed, this is a problem with EC2's certs, or a problem with Python. It occurs 100% of the time, and has been occurring over at least the last two days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6287) Add support for dynamic allocation in the Mesos coarse-grained scheduler
Iulian Dragos created SPARK-6287: Summary: Add support for dynamic allocation in the Mesos coarse-grained scheduler Key: SPARK-6287 URL: https://issues.apache.org/jira/browse/SPARK-6287 Project: Spark Issue Type: Bug Components: Mesos Reporter: Iulian Dragos Add support inside the coarse-grained Mesos scheduler for dynamic allocation. It amounts to implementing two methods that allow scaling up and down the number of executors: {code} def doKillExecutors(executorIds: Seq[String]) def doRequestTotalExecutors(requestedTotal: Int) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357380#comment-14357380 ] Joseph K. Bradley commented on SPARK-2429: -- But as far as the old vs. new algorithm, it sounds like it would make sense to replace the old one with the new one for this PR (though I have not yet had a chance to compare and understand them in detail). Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357377#comment-14357377 ] Joseph K. Bradley commented on SPARK-2429: -- I don't think we should discourage contributions to the current MLlib package. It's true we're experimenting with spark.ml and figuring out a better long-term API, but the main work in JIRAs/PRs like this one is in designing and implementing the algorithm itself, which we can easily copy over to the new API when the time comes. I'd vote for trying to get this into spark.mllib and then wrapping it for spark.ml when ready. (It'd be fine to do a Spark package too, but I hope this can get into MLlib itself.) Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6290) spark.ml.param.Params.checkInputColumn bug upon error
Joseph K. Bradley created SPARK-6290: Summary: spark.ml.param.Params.checkInputColumn bug upon error Key: SPARK-6290 URL: https://issues.apache.org/jira/browse/SPARK-6290 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor In checkInputColumn, if data types do not match, it tries to print an error message with this in it: {code} Column param description: ${getParam(colName)} {code} However, getParam cannot be called on the string colName; it needs the parameter name, which this method is not given. This causes a weird error which users may find hard to understand. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6297) EventLog permissions are always set to 770 which causes problems
[ https://issues.apache.org/jira/browse/SPARK-6297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357905#comment-14357905 ] Apache Spark commented on SPARK-6297: - User 'lustefaniak' has created a pull request for this issue: https://github.com/apache/spark/pull/4989 EventLog permissions are always set to 770 which causes problems Key: SPARK-6297 URL: https://issues.apache.org/jira/browse/SPARK-6297 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: All, tested in lcx running different users without common group Reporter: Lukas Stefaniak Priority: Trivial Labels: newbie In EventLogginListener event log files permissions are set explicitly always to 770. There is no way to override it in any way. Problem appears as exception being thrown when driver process and spark master don't share same user or group. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6229) Support encryption in network/common module
[ https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357838#comment-14357838 ] Marcelo Vanzin commented on SPARK-6229: --- /cc [~rxin] [~adav] Hey Reynold, Aaron, I started looking at this and it seems like there's no easy way to insert SASL encryption into the current channel pipeline due to the way the code is organized. On the client side, {{TransportClientFactory.createClient}} seems like the best candidate, but it would have to be changed so that {{SaslClientBootstrap}} can add a handler to the channel pipeline to do the encryption. On the server side, it seems more complicated. I couldn't find a place where the current SASL code has access to the channel, so it can't really set up a handler at the same layer as the client code. At first I thought I could just encrypt things in {{SaslRpcHandler}}, but that looks fishy since that handler is before all the framing and serialization handlers, and we kinda want to encrypt after those. So I'd like to suggest a bigger change that I think would make the code easier to use and at the same time allow this change: internalize all SASL handling in the network library. Basically, let the network library decide whether to add SASL to the picture by looking at the TransportConf; code using the library wouldn't need to care about it anymore (aside from providing a TransportConf and a SecretKeyHolder). So, instead of code like this (from StandaloneWorkerShuffleService): {code} private val transportContext: TransportContext = { val handler = if (useSasl) new SaslRpcHandler(blockHandler, securityManager) else blockHandler new TransportContext(transportConf, handler) } {code} You'd have just: {code} private val transportContext: TransportContext = new TransportContext(transportConf, securityManager, blockHandler) {code} And similarly in other places. What do you guys think? BTW here's some code that does this and is totally transparent to the code creating the RPC server / client: https://github.com/apache/hive/blob/trunk/spark-client/src/main/java/org/apache/hive/spark/client/rpc/SaslHandler.java And the two implementations: https://github.com/apache/hive/blob/trunk/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L197 https://github.com/apache/hive/blob/trunk/spark-client/src/main/java/org/apache/hive/spark/client/rpc/Rpc.java#L390 That's just to give an idea of how it could be done internally in the network library, without the consumers of the library having to care about it. But please let me know if I missed something obvious here. Support encryption in network/common module --- Key: SPARK-6229 URL: https://issues.apache.org/jira/browse/SPARK-6229 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin After SASL support has been added to network/common, supporting encryption should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. Since the latter requires a valid kerberos login to work (and so doesn't really work with executors), encryption would require the use of DIGEST-MD5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6294) PySpark task may hang while call take() on in Java/Scala
[ https://issues.apache.org/jira/browse/SPARK-6294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357872#comment-14357872 ] Apache Spark commented on SPARK-6294: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4987 PySpark task may hang while call take() on in Java/Scala Key: SPARK-6294 URL: https://issues.apache.org/jira/browse/SPARK-6294 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0, 1.2.1 Reporter: Davies Liu Assignee: Davies Liu Priority: Critical {code} rdd = sc.parallelize(range(120)).map(lambda x: str(x)) rdd._jrdd.first() {code} There is the stacktrace while hanging: {code} Executor task launch worker-5 daemon prio=10 tid=0x7f8fd01a9800 nid=0x566 in Object.wait() [0x7f90481d7000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x000630929340 (a org.apache.spark.api.python.PythonRDD$WriterThread) at java.lang.Thread.join(Thread.java:1281) - locked 0x000630929340 (a org.apache.spark.api.python.PythonRDD$WriterThread) at java.lang.Thread.join(Thread.java:1355) at org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:78) at org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:76) at org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:49) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:58) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6296) Add equals operator to Column (v1.3)
[ https://issues.apache.org/jira/browse/SPARK-6296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357894#comment-14357894 ] Apache Spark commented on SPARK-6296: - User 'vlyubin' has created a pull request for this issue: https://github.com/apache/spark/pull/4988 Add equals operator to Column (v1.3) Key: SPARK-6296 URL: https://issues.apache.org/jira/browse/SPARK-6296 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Volodymyr Lyubinets Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods
[ https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357610#comment-14357610 ] mgdadv edited comment on SPARK-6189 at 3/11/15 10:20 PM: - While the dot is legal in R and SQL, I don't think there is a nice way of making it legal in python. So at least in the Spark python code, I think something should be done about it. I just realized that the automatic renaming can cause problems if that entry already exists. For example, what if GNP_deflator was already in the data set and then GNP.deflator gets changed. I think the best thing to do is to just warn the user by printing out a warning message. I have changed the patch accordingly. Here is some example code for pyspark: import StringIO import pandas as pd df = pd.read_csv(StringIO.StringIO(a.b,a,c\n101,102,103\n201,202,203)) spdf = sqlCtx.createDataFrame(df) spdf.take(2) spdf[spdf.a==102].take(2) So far this works, but this fails: spdf[spdf.a.b==101].take(2) In pandas df.a.b doesn't work either, but the fields can be accessed via the string a.b, i.e.: df[a.b] was (Author: mgdadv): While the dot is legal in R and SQL, I don't think there is a nice way of making it legal in python. So at least in the Spark python code, I think something should be done about it. I just realized that the automatic renaming can cause problems if that entry already exists. For example, what if GNP_deflator was already in the data set and then GNP.deflator gets changed. I think the best thing to do is to just warn the user by printing out a warning message. I have changed the patch accordingly. Here is some example code for pyspark: import pandas as pd df = pd.read_csv(StringIO.StringIO(a.b,a,c\n101,102,103\n201,202,203)) spdf = sqlCtx.createDataFrame(df) spdf.take(2) spdf[spdf.a==102].take(2) So far this works, but this fails: spdf[spdf.a.b==101].take(2) In pandas df.a.b doesn't work either, but the fields can be accessed via the string a.b, i.e.: df[a.b] Pandas to DataFrame conversion should check field names for periods --- Key: SPARK-6189 URL: https://issues.apache.org/jira/browse/SPARK-6189 Project: Spark Issue Type: Improvement Components: DataFrame, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Issue I ran into: I imported an R dataset in CSV format into a Pandas DataFrame and then use toDF() to convert that into a Spark DataFrame. The R dataset had a column with a period in it (column GNP.deflator in the longley dataset). When I tried to select it using the Spark DataFrame DSL, I could not because the DSL thought the period was selecting a field within GNP. Also, since GNP is another field's name, it gives an error which could be obscure to users, complaining: {code} org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type DoubleType; {code} We should either handle periods in column names or check during loading and warn/fail gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods
[ https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357713#comment-14357713 ] mgdadv commented on SPARK-6189: --- To expand on the example, this works: spdf[spdf[a]==102].take(2) while this fails: spdf[spdf[a.b]==101].take(2) Pandas to DataFrame conversion should check field names for periods --- Key: SPARK-6189 URL: https://issues.apache.org/jira/browse/SPARK-6189 Project: Spark Issue Type: Improvement Components: DataFrame, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Issue I ran into: I imported an R dataset in CSV format into a Pandas DataFrame and then use toDF() to convert that into a Spark DataFrame. The R dataset had a column with a period in it (column GNP.deflator in the longley dataset). When I tried to select it using the Spark DataFrame DSL, I could not because the DSL thought the period was selecting a field within GNP. Also, since GNP is another field's name, it gives an error which could be obscure to users, complaining: {code} org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type DoubleType; {code} We should either handle periods in column names or check during loading and warn/fail gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6293) SQLContext.implicits should provide automatic conversion for RDD[Row]
[ https://issues.apache.org/jira/browse/SPARK-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6293: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-6116 SQLContext.implicits should provide automatic conversion for RDD[Row] - Key: SPARK-6293 URL: https://issues.apache.org/jira/browse/SPARK-6293 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley When a DataFrame is converted to an RDD[Row], it should be easier to convert it back to a DataFrame via toDF. E.g.: {code} val df: DataFrame = myRDD.toDF(col1, col2) // This works for types like RDD[scala.Tuple2[...]] val splits = df.rdd.randomSplit(...) val split0: RDD[Row] = splits(0) val df0 = split0.toDF(col1, col2) // This fails {code} The failure happens because SQLContext.implicits does not provide an automatic conversion for Rows. (It does handle Products, but Row does not implement Product.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6295) spark.ml.Evaluator should have evaluate method not taking ParamMap
Joseph K. Bradley created SPARK-6295: Summary: spark.ml.Evaluator should have evaluate method not taking ParamMap Key: SPARK-6295 URL: https://issues.apache.org/jira/browse/SPARK-6295 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor spark.ml.Evaluator requires that the user pass a ParamMap, but it is not always necessary. It should have a default implementation with no ParamMap (similar to fit() and transform() in Estimator and Transformer). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6296) Add equals operator to Column (v1.3)
Volodymyr Lyubinets created SPARK-6296: -- Summary: Add equals operator to Column (v1.3) Key: SPARK-6296 URL: https://issues.apache.org/jira/browse/SPARK-6296 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Volodymyr Lyubinets Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6294) PySpark task may hang while call take() on in Java/Scala
[ https://issues.apache.org/jira/browse/SPARK-6294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6294: - Target Version/s: 1.2.2, 1.4.0, 1.3.1 (was: 1.2.2, 1.3.1) PySpark task may hang while call take() on in Java/Scala Key: SPARK-6294 URL: https://issues.apache.org/jira/browse/SPARK-6294 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0, 1.2.1 Reporter: Davies Liu Assignee: Davies Liu Priority: Critical {code} rdd = sc.parallelize(range(120)).map(lambda x: str(x)) rdd._jrdd.first() {code} There is the stacktrace while hanging: {code} Executor task launch worker-5 daemon prio=10 tid=0x7f8fd01a9800 nid=0x566 in Object.wait() [0x7f90481d7000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x000630929340 (a org.apache.spark.api.python.PythonRDD$WriterThread) at java.lang.Thread.join(Thread.java:1281) - locked 0x000630929340 (a org.apache.spark.api.python.PythonRDD$WriterThread) at java.lang.Thread.join(Thread.java:1355) at org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:78) at org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:76) at org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:49) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:58) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356399#comment-14356399 ] Yu Ishikawa commented on SPARK-2429: [~rnowling] I apologize for the delay in replying. I'm still working on this. I will reply the feedback from Freeman and Owen ASSP. By the way, I have a question about the future course of action. I implemented another algorithm which is more scalable and 1000 times faster than current one. Should we continue the PR, or replace it with the new implementation? https://github.com/yu-iskw/more-scalable-hierarchical-clustering-with-spark It is difficult to run the current one with an argument of the number of large clusters, such as 1. Because the dividing processes are executed one-by-one. The dividing processes of the new one is parallel processing. That's because it is more scalable and much faster than the current one. thanks Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356428#comment-14356428 ] Meethu Mathew commented on SPARK-6227: -- Interested to work on this ticket.Could anyone assign to it to me? PCA and SVD for PySpark --- Key: SPARK-6227 URL: https://issues.apache.org/jira/browse/SPARK-6227 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.2.1 Reporter: Julien Amelot Priority: Minor The Dimensionality Reduction techniques are not available via Python (Scala + Java only). * Principal component analysis (PCA) * Singular value decomposition (SVD) Doc: http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6278) Mention the change of step size in the migration guide
[ https://issues.apache.org/jira/browse/SPARK-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356447#comment-14356447 ] Apache Spark commented on SPARK-6278: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4978 Mention the change of step size in the migration guide -- Key: SPARK-6278 URL: https://issues.apache.org/jira/browse/SPARK-6278 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We changed the objective from 1/n .. to 1/(2n) ... in 1.3, so using the same step size will lead to different results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356478#comment-14356478 ] Jianshi Huang commented on SPARK-6277: -- I see. Not relating to hadoop config is fine, how about env variables? It's quite often that I want to change one setting for particular tasks, editing spark-defaults.conf everytime is inconvenient. Env variables is a best fit here because of its dynamic scope. Typesafe's config has similar features for having env variables and it can even allow it override previous settings. https://github.com/typesafehub/config#optional-system-or-env-variable-overrides Jianshi Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf - Key: SPARK-6277 URL: https://issues.apache.org/jira/browse/SPARK-6277 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.0, 1.2.1 Reporter: Jianshi Huang I need to set spark.local.dir to use user local home instead of /tmp, but currently spark-defaults.conf can only allow constant values. What I want to do is to write: bq. spark.local.dir /home/${user.name}/spark/tmp or bq. spark.local.dir /home/${USER}/spark/tmp Otherwise I would have to hack bin/spark-class and pass the option through -Dspark.local.dir Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6281) Support incremental updates for Graph
Takeshi Yamamuro created SPARK-6281: --- Summary: Support incremental updates for Graph Key: SPARK-6281 URL: https://issues.apache.org/jira/browse/SPARK-6281 Project: Spark Issue Type: New Feature Components: GraphX Reporter: Takeshi Yamamuro Priority: Minor Add api to efficiently append new vertices and edges into existing Graph, e.g., Graph#append(newVerts: RDD[(VertexId, VD)], newEdges: RDD[Edge[ED]], defaultVertexAttr: VD) This is useful for time-evolving graphs; new vertices and edges are built from streaming data thru Spark Streaming, and then incrementally appended into a existing graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6276) beeline client class cast exception on partitioned table Spark SQL
Sunil created SPARK-6276: Summary: beeline client class cast exception on partitioned table Spark SQL Key: SPARK-6276 URL: https://issues.apache.org/jira/browse/SPARK-6276 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Sunil When i am accessing partitioned hive table using spark thrift server and beeline. Throwing following error on partitioned column java.lang.RuntimeException: java.lang.ClassCastException at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) results shown if i remove the partitioned column from select. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6278) Mention the change of step size in the migration guide
Xiangrui Meng created SPARK-6278: Summary: Mention the change of step size in the migration guide Key: SPARK-6278 URL: https://issues.apache.org/jira/browse/SPARK-6278 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We changed the objective from 1/n .. to 1/(2n) ... in 1.3, so using the same step size will lead to different results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6022) GraphX `diff` test incorrectly operating on values (not VertexId's)
[ https://issues.apache.org/jira/browse/SPARK-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356452#comment-14356452 ] Takeshi Yamamuro commented on SPARK-6022: - Yeah, ISTM it'd be better to add set difference as Graph#minus. GraphX `diff` test incorrectly operating on values (not VertexId's) --- Key: SPARK-6022 URL: https://issues.apache.org/jira/browse/SPARK-6022 Project: Spark Issue Type: Bug Components: GraphX Reporter: Brennon York The current GraphX {{diff}} test operates on values rather than the VertexId's and, if {{diff}} were working properly (per [SPARK-4600|https://issues.apache.org/jira/browse/SPARK-4600]), it should fail this test. The code to test {{diff}} should look like the below as it correctly generates {{VertexRDD}}'s with different {{VertexId}}'s to {{diff}} against. {code} test(diff functionality with small concrete values) { withSpark { sc = val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id = (id, id.toInt))) // setA := Set((0L, 0), (1L, 1)) val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id = (id, id.toInt+2))) // setB := Set((1L, 3), (2L, 4)) val diff = setA.diff(setB) assert(diff.collect.toSet == Set((2L, 4))) } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6244) Implement VectorSpace to easy create a complicated feature vector
[ https://issues.apache.org/jira/browse/SPARK-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356464#comment-14356464 ] Sean Owen commented on SPARK-6244: -- You can reuse this JIRA but you would have to modify the title and description. You should close your PR and make a new one. At best we are talking about adding a few utility methods to {{Vectors}}, but what is the use case for these? yes, I can imagine some, but would it refactor some repeated usages in the code? I don't think MLlib intends to contain a full suite of vector utility methods as that is what Breeze is used for. I'd rather not add utility code just for its own sake. But if there's a clear common use for these things it could make sense. Implement VectorSpace to easy create a complicated feature vector - Key: SPARK-6244 URL: https://issues.apache.org/jira/browse/SPARK-6244 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Kirill A. Korinskiy Priority: Minor VectorSpace is wrapper what implement three operation: - concat -- concat all vectors to single vector - sum -- sum of vectors - scaled -- multiple scalar to vector Example of usage: ``` import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.VectorSpace // Create a new Vector Space with one dense vector. val vs = VectorSpace.create(Vectors.dense(1.0, 0.0, 3.0)) // Add a to vector space a scaled vector space val vs2 = vs.add(vs.scaled(-1d)) // concat vectors from vector space, result: Vectors.dense(1.0, 0.0, 3.0, -1.0, 0.0, -3.0) val concat = vs2.concat // take a sum from vector space, result: Vectors.dense(0.0, 0.0, 0.0) val sum = vs2.sum ``` This wrapper is very useful when create a complicated feature vector from structured objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6280) Remove Akka systemName from Spark
Shixiong Zhu created SPARK-6280: --- Summary: Remove Akka systemName from Spark Key: SPARK-6280 URL: https://issues.apache.org/jira/browse/SPARK-6280 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Shixiong Zhu `systemName` is a Akka concept. A RPC implementation does not need to support it. We can hard code the system name in Spark and hide it in the internal Akka RPC implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf
Jianshi Huang created SPARK-6277: Summary: Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf Key: SPARK-6277 URL: https://issues.apache.org/jira/browse/SPARK-6277 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.2.1, 1.3.0 Reporter: Jianshi Huang I need to set spark.local.dir to use user local home instead of /tmp, but currently spark-defaults.conf can only allow constant values. What I want to do is to write: bq. spark.local.dir /home/${user.name}/spark/tmp or bq. spark.local.dir /home/${USER}/spark/tmp Otherwise I would have to hack bin/spark-class and pass the option through -Dspark.local.dir Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6185) Deltele repeated TOKEN. TOK_CREATEFUNCTION has existed at Line 84;
[ https://issues.apache.org/jira/browse/SPARK-6185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6185. -- Resolution: Duplicate Rather than make a new JIRA, you could have expanded the scope of this one by changing its title and description. At least please close the old one if you reopen a new one for a different take on the same change. Deltele repeated TOKEN. TOK_CREATEFUNCTION has existed at Line 84; - Key: SPARK-6185 URL: https://issues.apache.org/jira/browse/SPARK-6185 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1 Reporter: DoingDone9 Priority: Trivial TOK_CREATEFUNCTION has existed at Line 84; Line 84TOK_CREATEFUNCTION, Line 85 TOK_DROPFUNCTION, Line 106 TOK_CREATEFUNCTION, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6279) Miss expressions flag s at logging string
zzc created SPARK-6279: -- Summary: Miss expressions flag s at logging string Key: SPARK-6279 URL: https://issues.apache.org/jira/browse/SPARK-6279 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: zzc Priority: Minor In KafkaRDD.scala, Miss expressions flag s at logging string -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6279) Miss expressions flag s at logging string
[ https://issues.apache.org/jira/browse/SPARK-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356472#comment-14356472 ] Sean Owen commented on SPARK-6279: -- (Don't bother making a JIRA for trivial issues, where the description of the problem is largely identical to the fix.) Miss expressions flag s at logging string Key: SPARK-6279 URL: https://issues.apache.org/jira/browse/SPARK-6279 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: zzc Priority: Minor In KafkaRDD.scala, Miss expressions flag s at logging string In logging file, it print `Beginning offset ${part.fromOffset} is the same as ending offset ` but not `log.warn(Beginning offset 111 is the same as ending offset `. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4820) Spark build encounters File name too long on some encrypted filesystems
[ https://issues.apache.org/jira/browse/SPARK-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356688#comment-14356688 ] Theodore Vasiloudis commented on SPARK-4820: Can confirm for 1.2.1 in encfs as well, had to build /tmp instead. Spark build encounters File name too long on some encrypted filesystems - Key: SPARK-4820 URL: https://issues.apache.org/jira/browse/SPARK-4820 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell This was reported by Luchesar Cekov on github along with a proposed fix. The fix has some potential downstream issues (it will modify the classnames) so until we understand better how many users are affected we aren't going to merge it. However, I'd like to include the issue and workaround here. If you encounter this issue please comment on the JIRA so we can assess the frequency. The issue produces this error: {code} [error] == Expanded type of tree == [error] [error] ConstantType(value = Constant(Throwable)) [error] [error] uncaught exception during compilation: java.io.IOException [error] File name too long [error] two errors found {code} The workaround is in maven under the compile options add: {code} + arg-Xmax-classfile-name/arg + arg128/arg {code} In SBT add: {code} +scalacOptions in Compile ++= Seq(-Xmax-classfile-name, 128), {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5499) iterative computing with 1000 iterations causes stage failure
[ https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356699#comment-14356699 ] Tien-Dung LE commented on SPARK-5499: - Many thanks for your reply. I did the same as your suggestion and it worked. iterative computing with 1000 iterations causes stage failure - Key: SPARK-5499 URL: https://issues.apache.org/jira/browse/SPARK-5499 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Tien-Dung LE I got an error org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError when executing an action with 1000 transformations. Here is a code snippet to re-produce the error: {code} import org.apache.spark.rdd.RDD var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) var newPair: RDD[(Long,Long)] = null for (i - 1 to 1000) { newPair = pair.map(_.swap) pair = newPair } println(Count = + pair.count()) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356741#comment-14356741 ] Beniamino commented on SPARK-2344: -- Hi, yes the computation of the next centers are made on the fly avoiding to store the membership matrix. The algorithm already works; the only thing that might be added is the run parameter such as the K-Means' implementation. I've already done the Fukuyama Sugeno validity index computation too. Beniamino Add Fuzzy C-Means algorithm to MLlib Key: SPARK-2344 URL: https://issues.apache.org/jira/browse/SPARK-2344 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Alex Priority: Minor Labels: clustering Original Estimate: 1m Remaining Estimate: 1m I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib. FCM is very similar to K - Means which is already implemented, and they differ only in the degree of relationship each point has with each cluster: (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1. As part of the implementation I would like: - create a base class for K- Means and FCM - implement the relationship for each algorithm differently (in its class) I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5499) iterative computing with 1000 iterations causes stage failure
[ https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE closed SPARK-5499. --- checkpoint mechanism can solve the issue. iterative computing with 1000 iterations causes stage failure - Key: SPARK-5499 URL: https://issues.apache.org/jira/browse/SPARK-5499 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Tien-Dung LE I got an error org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError when executing an action with 1000 transformations. Here is a code snippet to re-produce the error: {code} import org.apache.spark.rdd.RDD var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) var newPair: RDD[(Long,Long)] = null for (i - 1 to 1000) { newPair = pair.map(_.swap) pair = newPair } println(Count = + pair.count()) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6067) Spark sql hive dynamic partitions job will fail if task fails
[ https://issues.apache.org/jira/browse/SPARK-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356745#comment-14356745 ] Apache Spark commented on SPARK-6067: - User 'baishuo' has created a pull request for this issue: https://github.com/apache/spark/pull/4980 Spark sql hive dynamic partitions job will fail if task fails - Key: SPARK-6067 URL: https://issues.apache.org/jira/browse/SPARK-6067 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jason Hubbard Priority: Minor Attachments: job.log When inserting into a hive table from spark sql while using dynamic partitioning, if a task fails it will cause the task to continue to fail and eventually fail the job. /mytable/.hive-staging_hive_2015-02-27_11-53-19_573_222-3/-ext-1/partition=2015-02-04/part-1 for client ip already exists The task may need to clean up after a failed task to write to the location of the previously failed task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356507#comment-14356507 ] Yu Ishikawa commented on SPARK-5992: What kind of LSH algorithms should we support at first? - Bit sampling for Hamming distance - Min-wise independent permutations - Nilsimsa Hash - Random projection - Min Hash Locality Sensitive Hashing (LSH) for MLlib -- Key: SPARK-5992 URL: https://issues.apache.org/jira/browse/SPARK-5992 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Locality Sensitive Hashing (LSH) would be very useful for ML. It would be great to discuss some possible algorithms here, choose an API, and make a PR for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356450#comment-14356450 ] Sean Owen commented on SPARK-6277: -- Although it sounds nice, in practice this raises a number of issues, like having to then parse Hadoop config to parse the core Spark config, dealing with escapes, etc. Suddenly Hadoop config becomes required. For this particular problem, I don't think per-user home directories should be used, though if you must, you've already identified how to do that. Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf - Key: SPARK-6277 URL: https://issues.apache.org/jira/browse/SPARK-6277 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.3.0, 1.2.1 Reporter: Jianshi Huang I need to set spark.local.dir to use user local home instead of /tmp, but currently spark-defaults.conf can only allow constant values. What I want to do is to write: bq. spark.local.dir /home/${user.name}/spark/tmp or bq. spark.local.dir /home/${USER}/spark/tmp Otherwise I would have to hack bin/spark-class and pass the option through -Dspark.local.dir Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5814) Remove JBLAS from runtime dependencies
[ https://issues.apache.org/jira/browse/SPARK-5814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356451#comment-14356451 ] Apache Spark commented on SPARK-5814: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4699 Remove JBLAS from runtime dependencies -- Key: SPARK-5814 URL: https://issues.apache.org/jira/browse/SPARK-5814 Project: Spark Issue Type: Dependency upgrade Components: GraphX, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng We are using mixed breeze/netlib-java and jblas code in MLlib. They take different approaches to utilize native libraries and we should keep only one of them. netlib-java has a clear separation between Java implementation and native JNI libraries, while JBLAS packs statically linked binaries that causes license issues (SPARK-5669). So we want to remove JBLAS from Spark runtime. One issue with this approach is that we have JBLAS' DoubleMatrix exposed (by mistake) in SVDPlusPlus of GraphX. We should deprecate it and replace `DoubleMatrix` by `Array[Double]`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4924) Factor out code to launch Spark applications into a separate library
[ https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4924. Resolution: Fixed Fix Version/s: 1.4.0 Glad to finally have this in. Thanks for all the hard work [~vanzin]! Factor out code to launch Spark applications into a separate library Key: SPARK-4924 URL: https://issues.apache.org/jira/browse/SPARK-4924 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.4.0 Attachments: spark-launcher.txt One of the questions we run into rather commonly is how to start a Spark application from my Java/Scala program?. There currently isn't a good answer to that: - Instantiating SparkContext has limitations (e.g., you can only have one active context at the moment, plus you lose the ability to submit apps in cluster mode) - Calling SparkSubmit directly is doable but you lose a lot of the logic handled by the shell scripts - Calling the shell script directly is doable, but sort of ugly from an API point of view. I think it would be nice to have a small library that handles that for users. On top of that, this library could be used by Spark itself to replace a lot of the code in the current shell scripts, which have a lot of duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode
[ https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3438. -- Resolution: Duplicate Support for accessing secured HDFS in Standalone Mode - Key: SPARK-3438 URL: https://issues.apache.org/jira/browse/SPARK-3438 Project: Spark Issue Type: New Feature Components: Deploy, Spark Core Affects Versions: 1.0.2 Reporter: Zhanfeng Huo Access to secured HDFS is currently supported in YARN using YARN's built in security mechanism. In YARN mode, a user application is authenticated when it is submitted, then it acquires delegation tokens and them ship them (via YARN) securely to workers. In Standalone mode, it would be nice to support a more mechanism for accessing HDFS where we rely on a single shared secret to authenticate communication in the standalone cluster. 1. A company is running a standalone cluster. 2. They are fine if all Spark jobs in the cluster share a global secret, i.e. all Spark jobs can trust one another. 3. They are able to provide a Hadoop login on the driver node via a keytab or kinit. They want tokens from this login to be distributed to the executors to allow access to secure HDFS. 4. They also don't want to trust the network on the cluster. I.e. don't want to allow someone to fetch HDFS tokens easily over a known protocol, without authentication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6279) Miss expressions flag s at logging string
[ https://issues.apache.org/jira/browse/SPARK-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zzc updated SPARK-6279: --- Description: In KafkaRDD.scala, Miss expressions flag s at logging string In logging file, it print `Beginning offset ${part.fromOffset} is the same as ending offset ` but not `log.warn(Beginning offset 111 is the same as ending offset `. was:In KafkaRDD.scala, Miss expressions flag s at logging string Miss expressions flag s at logging string Key: SPARK-6279 URL: https://issues.apache.org/jira/browse/SPARK-6279 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: zzc Priority: Minor In KafkaRDD.scala, Miss expressions flag s at logging string In logging file, it print `Beginning offset ${part.fromOffset} is the same as ending offset ` but not `log.warn(Beginning offset 111 is the same as ending offset `. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348309#comment-14348309 ] Masayoshi TSUZUKI edited comment on SPARK-5389 at 3/11/15 8:18 AM: --- The crashed program findstr.exe in the screenshot seems not to be the one in the C:\Windows\System32 directory. I'm not sure but I think C:\Windows\System32\findstr.exe in Windows 7 shows (QGREP) utililty but not (grep) utility. (Although I don't know the exact English name since I'm not using English version of Windows.) [~yanakad], [~s@r@v@n@n], and [SPARK-6084] seem to be reporting the similar problems. Their workarounds show that the cause might be the polluted %PATH%. The collision of find.exe is well known phenomenon in Windows, but like Linux, the order of %PATH% can control which program is called. If you face the similar problem, you can check by executing the command {{-whereas find-}} {{where find}} to check if the proper program find.exe is used. Would you mind attaching the result of these commands? {quote} where find where findstr echo %PATH% {quote} was (Author: tsudukim): The crashed program findstr.exe in the screenshot seems not to be the one in the C:\Windows\System32 directory. I'm not sure but I think C:\Windows\System32\findstr.exe in Windows 7 shows (QGREP) utililty but not (grep) utility. (Although I don't know the exact English name since I'm not using English version of Windows.) [~yanakad], [~s@r@v@n@n], and [SPARK-6084] seem to be reporting the similar problems. Their workarounds show that the cause might be the polluted %PATH%. The collision of find.exe is well known phenomenon in Windows, but like Linux, the order of %PATH% can control which program is called. If you face the similar problem, you can check by executing the command {{whereas find}} to check if the proper program find.exe is used. Would you mind attaching the result of these commands? {quote} where find where findstr echo %PATH% {quote} spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Environment: Windows 7 Reporter: Yana Kadiyska Attachments: SparkShell_Win7.JPG spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6279) Miss expressions flag s at logging string
[ https://issues.apache.org/jira/browse/SPARK-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356473#comment-14356473 ] Apache Spark commented on SPARK-6279: - User 'zzcclp' has created a pull request for this issue: https://github.com/apache/spark/pull/4979 Miss expressions flag s at logging string Key: SPARK-6279 URL: https://issues.apache.org/jira/browse/SPARK-6279 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: zzc Priority: Minor In KafkaRDD.scala, Miss expressions flag s at logging string In logging file, it print `Beginning offset ${part.fromOffset} is the same as ending offset ` but not `log.warn(Beginning offset 111 is the same as ending offset `. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6225) Resolve most build warnings, 1.3.0 edition
[ https://issues.apache.org/jira/browse/SPARK-6225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6225. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4950 [https://github.com/apache/spark/pull/4950] Resolve most build warnings, 1.3.0 edition -- Key: SPARK-6225 URL: https://issues.apache.org/jira/browse/SPARK-6225 Project: Spark Issue Type: Improvement Components: MLlib, Spark Core, SQL, Streaming Affects Versions: 1.3.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.4.0 Post-1.3.0, I think it would be a good exercise to resolve a number of build warnings that have accumulated recently. See for example efforts begun at https://github.com/apache/spark/pull/4948 https://github.com/apache/spark/pull/4900 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6228) Provide SASL support in network/common module
[ https://issues.apache.org/jira/browse/SPARK-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6228. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4953 [https://github.com/apache/spark/pull/4953] Provide SASL support in network/common module - Key: SPARK-6228 URL: https://issues.apache.org/jira/browse/SPARK-6228 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin Fix For: 1.4.0 Currently, there's support for SASL in network/shuffle, but not in network/common. Moving the SASL code to network/common would enable other applications using that code to also support secure authentication and, later, encryption. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6228) Provide SASL support in network/common module
[ https://issues.apache.org/jira/browse/SPARK-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6228: - Priority: Minor (was: Major) Assignee: Marcelo Vanzin Provide SASL support in network/common module - Key: SPARK-6228 URL: https://issues.apache.org/jira/browse/SPARK-6228 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor Fix For: 1.4.0 Currently, there's support for SASL in network/shuffle, but not in network/common. Moving the SASL code to network/common would enable other applications using that code to also support secure authentication and, later, encryption. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4423) Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior
[ https://issues.apache.org/jira/browse/SPARK-4423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4423. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4696 [https://github.com/apache/spark/pull/4696] Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior --- Key: SPARK-4423 URL: https://issues.apache.org/jira/browse/SPARK-4423 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Josh Rosen Assignee: Ilya Ganelin Fix For: 1.4.0 {{foreach}} seems to be a common source of confusion for new users: in {{local}} mode, {{foreach}} can be used to update local variables on the driver, but programs that do this will not work properly when executed on clusters, since the {{foreach}} will update per-executor variables (note that this _will_ work correctly for accumulators, but not for other types of mutable objects). Similarly, I've seen users become confused when {{.foreach(println)}} doesn't print to the driver's standard output. At a minimum, we should improve the documentation to warn users against unsafe uses of {{foreach}} that won't work properly when transitioning from local mode to a real cluster. We might also consider changes to local mode so that its behavior more closely matches the cluster modes; this will require some discussion, though, since any change of behavior here would technically be a user-visible backwards-incompatible change (I don't think that we made any explicit guarantees about the current local-mode behavior, but someone might be relying on the current implicit behavior). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org