[jira] [Resolved] (SPARK-1293) Support for reading/writing complex types in Parquet
[ https://issues.apache.org/jira/browse/SPARK-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-1293. Resolution: Fixed Fix Version/s: 1.0.1 > Support for reading/writing complex types in Parquet > > > Key: SPARK-1293 > URL: https://issues.apache.org/jira/browse/SPARK-1293 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Andre Schumacher > Fix For: 1.0.1, 1.1.0 > > > Complex types include: Arrays, Maps, and Nested rows (structs). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-768) Fail a task when the remote block it is fetching is not serializable
[ https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-768. --- Resolution: Cannot Reproduce Assignee: Raymond Liu (was: Reynold Xin) > Fail a task when the remote block it is fetching is not serializable > > > Key: SPARK-768 > URL: https://issues.apache.org/jira/browse/SPARK-768 > Project: Spark > Issue Type: Bug >Reporter: Reynold Xin >Assignee: Raymond Liu > > When a task is fetching a remote block (e.g. locality wait exceeded), and if > the block is not serializable, the task would hang. > The block manager should fail the task instead of hanging the task ... once > the task fails, eventually it will get scheduled to the local node to be > executed successfully. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-768) Fail a task when the remote block it is fetching is not serializable
[ https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038536#comment-14038536 ] Reynold Xin commented on SPARK-768: --- Thanks for confirming. I'm going to close this issue then. > Fail a task when the remote block it is fetching is not serializable > > > Key: SPARK-768 > URL: https://issues.apache.org/jira/browse/SPARK-768 > Project: Spark > Issue Type: Bug >Reporter: Reynold Xin >Assignee: Reynold Xin > > When a task is fetching a remote block (e.g. locality wait exceeded), and if > the block is not serializable, the task would hang. > The block manager should fail the task instead of hanging the task ... once > the task fails, eventually it will get scheduled to the local node to be > executed successfully. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-768) Fail a task when the remote block it is fetching is not serializable
[ https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038534#comment-14038534 ] Raymond Liu commented on SPARK-768: --- Hi Reynold If this is the first case, then I think, yes, it won't hang, at least from what I observe from my test and the code in this path. Only that recompute might be a problem? If I do the same thing on the cached RDD for a lot of iterations, eventually, all partitions will have a local block stored in each node. We can either accept this behavior, or need to modify the block ack message to identify this specific case other than return None as block not found. > Fail a task when the remote block it is fetching is not serializable > > > Key: SPARK-768 > URL: https://issues.apache.org/jira/browse/SPARK-768 > Project: Spark > Issue Type: Bug >Reporter: Reynold Xin >Assignee: Reynold Xin > > When a task is fetching a remote block (e.g. locality wait exceeded), and if > the block is not serializable, the task would hang. > The block manager should fail the task instead of hanging the task ... once > the task fails, eventually it will get scheduled to the local node to be > executed successfully. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2177) describe table result contains only one column
[ https://issues.apache.org/jira/browse/SPARK-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2177. Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 > describe table result contains only one column > -- > > Key: SPARK-2177 > URL: https://issues.apache.org/jira/browse/SPARK-2177 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Yin Huai > Fix For: 1.0.1, 1.1.0 > > > {code} > scala> hql("describe src").collect().foreach(println) > [key string None] > [valuestring None] > {code} > The result should contain 3 columns instead of one. This screws up JDBC or > even the downstream consumer of the Scala/Java/Python APIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1477) Add the lifecycle interface
[ https://issues.apache.org/jira/browse/SPARK-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1477: --- Assignee: Guoqiang Li > Add the lifecycle interface > --- > > Key: SPARK-1477 > URL: https://issues.apache.org/jira/browse/SPARK-1477 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0, 1.0.1 >Reporter: Guoqiang Li >Assignee: Guoqiang Li > > Now the Spark in the code, there are a lot of interface or class defines the > stop and start > method,eg:[SchedulerBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala],[HttpServer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala],[ContextCleaner|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala] > . we should use a life cycle interface improve the code -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1477) Add the lifecycle interface
[ https://issues.apache.org/jira/browse/SPARK-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1477: --- Target Version/s: 1.1.0 Affects Version/s: 1.0.1 > Add the lifecycle interface > --- > > Key: SPARK-1477 > URL: https://issues.apache.org/jira/browse/SPARK-1477 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0, 1.0.1 >Reporter: Guoqiang Li > > Now the Spark in the code, there are a lot of interface or class defines the > stop and start > method,eg:[SchedulerBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala],[HttpServer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala],[ContextCleaner|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala] > . we should use a life cycle interface improve the code -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2201) Improve FlumeInputDStream's stability
[ https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038525#comment-14038525 ] chao.wu commented on SPARK-2201: good idea > Improve FlumeInputDStream's stability > - > > Key: SPARK-2201 > URL: https://issues.apache.org/jira/browse/SPARK-2201 > Project: Spark > Issue Type: Improvement >Reporter: sunshangchun > > Currently only one flume receiver can work with FlumeInputDStream and I am > willing to do some works to improve it, my ideas are described as follows: > a ip and port denotes a physical host, and a logical host consists of one or > more physical hosts > In our case, spark flume receivers bind themselves to a logical host when > started, and a flume agent get physical hosts and push events to them. > Two classes are introduced, LogicalHostRouter supplies a map between logical > host and physical host, and LogicalHostRouterListener let relation changes > watchable. > Some works need to be done here: > 1. LogicalHostRouter and LogicalHostRouterListener can be implemented by > zookeeper. when physical host started, create tmp node in zk, listeners just > watch those tmp nodes. > 2. when spark FlumeReceivers started, they acquire a physical host > (localhost's ip and an idle port) and register itself to zookeeper. > 3. A new flume sink. In the method of appendEvents, they get physical hosts > and push data to them in a round-robin manner. > Does it a feasible plan? Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-768) Fail a task when the remote block it is fetching is not serializable
[ https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038523#comment-14038523 ] Reynold Xin commented on SPARK-768: --- I think it was the first case. It used to be the case that when a block was kept in memory in deserialized form, and a task got scheduled to a remote node and tried to fetch the block, if the block was not serializable, the whole thing would hang. Maybe we have already fixed it. If you can verify this is no longer a problem, we can close the ticket. Thanks! > Fail a task when the remote block it is fetching is not serializable > > > Key: SPARK-768 > URL: https://issues.apache.org/jira/browse/SPARK-768 > Project: Spark > Issue Type: Bug >Reporter: Reynold Xin >Assignee: Reynold Xin > > When a task is fetching a remote block (e.g. locality wait exceeded), and if > the block is not serializable, the task would hang. > The block manager should fail the task instead of hanging the task ... once > the task fails, eventually it will get scheduled to the local node to be > executed successfully. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2201) Improve FlumeInputDStream's stability
[ https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sunshangchun updated SPARK-2201: Summary: Improve FlumeInputDStream's stability (was: Improve FlumeInputDStream) > Improve FlumeInputDStream's stability > - > > Key: SPARK-2201 > URL: https://issues.apache.org/jira/browse/SPARK-2201 > Project: Spark > Issue Type: Improvement >Reporter: sunshangchun > > Currently only one flume receiver can work with FlumeInputDStream and I am > willing to do some works to improve it, my ideas are described as follows: > a ip and port denotes a physical host, and a logical host consists of one or > more physical hosts > In our case, spark flume receivers bind themselves to a logical host when > started, and a flume agent get physical hosts and push events to them. > Two classes are introduced, LogicalHostRouter supplies a map between logical > host and physical host, and LogicalHostRouterListener let relation changes > watchable. > Some works need to be done here: > 1. LogicalHostRouter and LogicalHostRouterListener can be implemented by > zookeeper. when physical host started, create tmp node in zk, listeners just > watch those tmp nodes. > 2. when spark FlumeReceivers started, they acquire a physical host > (localhost's ip and an idle port) and register itself to zookeeper. > 3. A new flume sink. In the method of appendEvents, they get physical hosts > and push data to them in a round-robin manner. > Does it a feasible plan? Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2212) HashJoin
[ https://issues.apache.org/jira/browse/SPARK-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038514#comment-14038514 ] Cheng Hao commented on SPARK-2212: -- https://github.com/apache/spark/pull/1147 > HashJoin > > > Key: SPARK-2212 > URL: https://issues.apache.org/jira/browse/SPARK-2212 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2215) Multi-way join
[ https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038513#comment-14038513 ] Cheng Hao commented on SPARK-2215: -- The multi-way join implementation in Shark is quite complicated, but we have real case to show this can improve the join performance incredibly. I can start working the prototype for it soon. > Multi-way join > -- > > Key: SPARK-2215 > URL: https://issues.apache.org/jira/browse/SPARK-2215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > Support the multi-way join (multiple table joins) in a single reduce stage if > they have the same join keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2215) Multi-way join
[ https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-2215: - Description: Support the multi-way join (multiple table joins) in a single reduce stage if they have the same join key. (was: Support the multi-way join (multiple table joins) in a single reduce stage if they has the same join key.) > Multi-way join > -- > > Key: SPARK-2215 > URL: https://issues.apache.org/jira/browse/SPARK-2215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > Support the multi-way join (multiple table joins) in a single reduce stage if > they have the same join key. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2215) Multi-way join
[ https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-2215: - Description: Support the multi-way join (multiple table joins) in a single reduce stage if they have the same join keys. (was: Support the multi-way join (multiple table joins) in a single reduce stage if they have the same join key.) > Multi-way join > -- > > Key: SPARK-2215 > URL: https://issues.apache.org/jira/browse/SPARK-2215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > Support the multi-way join (multiple table joins) in a single reduce stage if > they have the same join keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2216) Cost-based join reordering
[ https://issues.apache.org/jira/browse/SPARK-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038510#comment-14038510 ] Cheng Hao commented on SPARK-2216: -- Yes, this can be a big change, i think we need to add some sub tasks for it, and implement it gradually. > Cost-based join reordering > -- > > Key: SPARK-2216 > URL: https://issues.apache.org/jira/browse/SPARK-2216 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao > > Coat-based join reordering -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2218) rename Equals to EqualTo in Spark SQL expressions
[ https://issues.apache.org/jira/browse/SPARK-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2218: --- Summary: rename Equals to EqualTo in Spark SQL expressions (was: rename Equals to EqualsTo in Spark SQL expressions) > rename Equals to EqualTo in Spark SQL expressions > - > > Key: SPARK-2218 > URL: https://issues.apache.org/jira/browse/SPARK-2218 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > The class name Equals is very error prone because there exists scala.Equals. > I just wasted a bunch of time debugging the optimizer because of this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2218) rename Equals to EqualsTo in Spark SQL expressions
[ https://issues.apache.org/jira/browse/SPARK-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038500#comment-14038500 ] Reynold Xin commented on SPARK-2218: Michael has a PR here https://github.com/apache/spark/pull/734 It is not fully ready yet. > rename Equals to EqualsTo in Spark SQL expressions > -- > > Key: SPARK-2218 > URL: https://issues.apache.org/jira/browse/SPARK-2218 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > The class name Equals is very error prone because there exists scala.Equals. > I just wasted a bunch of time debugging the optimizer because of this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2214) Broadcast Join (aka map join)
[ https://issues.apache.org/jira/browse/SPARK-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2214: --- Summary: Broadcast Join (aka map join) (was: MapSide Join) > Broadcast Join (aka map join) > - > > Key: SPARK-2214 > URL: https://issues.apache.org/jira/browse/SPARK-2214 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2215) Multi-way join
[ https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2215: --- Priority: Minor (was: Major) > Multi-way join > -- > > Key: SPARK-2215 > URL: https://issues.apache.org/jira/browse/SPARK-2215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > Support the multi-way join (multiple table joins) in a single reduce stage if > they has the same join key. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2215) Multi-way join
[ https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038497#comment-14038497 ] Reynold Xin commented on SPARK-2215: I personally find multiway join operator extremely complicated and am not sure if it is the best idea. In Shark we implemented it, but I think there are only 2 people in this world that understand that code ... > Multi-way join > -- > > Key: SPARK-2215 > URL: https://issues.apache.org/jira/browse/SPARK-2215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao > > Support the multi-way join (multiple table joins) in a single reduce stage if > they has the same join key. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2216) Cost-based join reordering
[ https://issues.apache.org/jira/browse/SPARK-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038494#comment-14038494 ] Reynold Xin commented on SPARK-2216: The prerequisite of this change is to design the APIs for cardinality and size estimation for operators. > Cost-based join reordering > -- > > Key: SPARK-2216 > URL: https://issues.apache.org/jira/browse/SPARK-2216 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao > > Coat-based join reordering -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2218) rename Equals to EqualsTo in Spark SQL expressions
Reynold Xin created SPARK-2218: -- Summary: rename Equals to EqualsTo in Spark SQL expressions Key: SPARK-2218 URL: https://issues.apache.org/jira/browse/SPARK-2218 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Reynold Xin The class name Equals is very error prone because there exists scala.Equals. I just wasted a bunch of time debugging the optimizer because of this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2217) When casting BigDecimal to Timestamp, BigDecimal.longValue() may be negative
Cheng Lian created SPARK-2217: - Summary: When casting BigDecimal to Timestamp, BigDecimal.longValue() may be negative Key: SPARK-2217 URL: https://issues.apache.org/jira/browse/SPARK-2217 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian Please refer to this PR comment: https://github.com/apache/spark/pull/1143/files#discussion_r14007203 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2216) Cost-based join reordering
Cheng Hao created SPARK-2216: Summary: Cost-based join reordering Key: SPARK-2216 URL: https://issues.apache.org/jira/browse/SPARK-2216 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Coat-based join reordering -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2215) Multi-way join
Cheng Hao created SPARK-2215: Summary: Multi-way join Key: SPARK-2215 URL: https://issues.apache.org/jira/browse/SPARK-2215 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Support the multi-way join (multiple table joins) in a single reduce stage if they has the same join key. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2214) MapSide Join
Cheng Hao created SPARK-2214: Summary: MapSide Join Key: SPARK-2214 URL: https://issues.apache.org/jira/browse/SPARK-2214 Project: Spark Issue Type: Sub-task Reporter: Cheng Hao -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2213) Sort Merge Join
Cheng Hao created SPARK-2213: Summary: Sort Merge Join Key: SPARK-2213 URL: https://issues.apache.org/jira/browse/SPARK-2213 Project: Spark Issue Type: Sub-task Reporter: Cheng Hao -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2212) HashJoin
Cheng Hao created SPARK-2212: Summary: HashJoin Key: SPARK-2212 URL: https://issues.apache.org/jira/browse/SPARK-2212 Project: Spark Issue Type: Sub-task Reporter: Cheng Hao Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2211) Join Optimization
Cheng Hao created SPARK-2211: Summary: Join Optimization Key: SPARK-2211 URL: https://issues.apache.org/jira/browse/SPARK-2211 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor This includes couple of sub tasks for Join Optimization in Spark-SQL -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2210) cast to boolean on boolean value gets turned into NOT((boolean_condition) = 0)
Reynold Xin created SPARK-2210: -- Summary: cast to boolean on boolean value gets turned into NOT((boolean_condition) = 0) Key: SPARK-2210 URL: https://issues.apache.org/jira/browse/SPARK-2210 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Reynold Xin {code} explain select cast(cast(key=0 as boolean) as boolean) aaa from src {code} should be {code} [Physical execution plan:] [Project [(key#10:0 = 0) AS aaa#7]] [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None] {code} However, it is currently {code} [Physical execution plan:] [Project [NOT((key#10=0) = 0) AS aaa#7]] [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None] {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1949) Servlet 2.5 vs 3.0 conflict in SBT build
[ https://issues.apache.org/jira/browse/SPARK-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038441#comment-14038441 ] Andrew Ash commented on SPARK-1949: --- Sean's PR: https://github.com/apache/spark/pull/906 > Servlet 2.5 vs 3.0 conflict in SBT build > > > Key: SPARK-1949 > URL: https://issues.apache.org/jira/browse/SPARK-1949 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.0 >Reporter: Sean Owen >Priority: Minor > > [~kayousterhout] mentioned that: > {quote} > I had some trouble compiling an application (Shark) against Spark 1.0, > where Shark had a runtime exception (at the bottom of this message) because > it couldn't find the javax.servlet classes. SBT seemed to have trouble > downloading the servlet APIs that are dependencies of Jetty (used by the > Spark web UI), so I had to manually add them to the application's build > file: > libraryDependencies += "org.mortbay.jetty" % "servlet-api" % "3.0.20100224" > Not exactly sure why this happens but thought it might be useful in case > others run into the same problem. > {quote} > This is a symptom of Servlet API conflict which we battled in the Maven > build. The resolution is to nix Servlet 2.5 and odd old Jetty / Netty 3.x > dependencies. It looks like the Hive part of the assembly in the SBT build > doesn't exclude all these entirely. > I'll open a suggested PR to band-aid the SBT build. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2208) local metrics tests can fail on fast machines
[ https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038414#comment-14038414 ] Patrick Wendell commented on SPARK-2208: A hotfix was merged here, but we should really fix the test: https://github.com/apache/spark/pull/1141 > local metrics tests can fail on fast machines > - > > Key: SPARK-2208 > URL: https://issues.apache.org/jira/browse/SPARK-2208 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell > Labels: starter > > I'm temporarily disabling this check. I think the issue is that on fast > machines the fetch wait time can actually be zero, even across all tasks. > We should see if we can write this in a different way to make sure there is a > delay. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1209) SparkHadoopUtil should not use package org.apache.hadoop
[ https://issues.apache.org/jira/browse/SPARK-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038328#comment-14038328 ] Mark Grover commented on SPARK-1209: ok, I will take over. Thanks Sandy. > SparkHadoopUtil should not use package org.apache.hadoop > > > Key: SPARK-1209 > URL: https://issues.apache.org/jira/browse/SPARK-1209 > Project: Spark > Issue Type: Bug >Affects Versions: 0.9.0 >Reporter: Sandy Pérez González >Assignee: Mark Grover > > It's private, so the change won't break compatibility -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-768) Fail a task when the remote block it is fetching is not serializable
[ https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038307#comment-14038307 ] Raymond Liu commented on SPARK-768: --- And for case 2, the problem is that current code seems not make difference between the NonSerializableException been thrown by fetch remote block during computation and the exception been thrown during serialization of the task resut. it wll take it all as the task result is not serializable and abort the whole taskset. Thus the job will fail in the end I think. Is this what you mean hanging? > Fail a task when the remote block it is fetching is not serializable > > > Key: SPARK-768 > URL: https://issues.apache.org/jira/browse/SPARK-768 > Project: Spark > Issue Type: Bug >Reporter: Reynold Xin >Assignee: Reynold Xin > > When a task is fetching a remote block (e.g. locality wait exceeded), and if > the block is not serializable, the task would hang. > The block manager should fail the task instead of hanging the task ... once > the task fails, eventually it will get scheduled to the local node to be > executed successfully. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2209) Cast shouldn't do null check twice
[ https://issues.apache.org/jira/browse/SPARK-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038295#comment-14038295 ] Reynold Xin commented on SPARK-2209: https://github.com/apache/spark/pull/1143 > Cast shouldn't do null check twice > -- > > Key: SPARK-2209 > URL: https://issues.apache.org/jira/browse/SPARK-2209 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.0.1, 1.1.0 > > > Cast does two null checks, one in eval and another one in the function > returned by nullOrCast. It's best to get rid of the one in nullOrCast (since > eval will be the more common code path). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2209) Cast shouldn't do null check twice
Reynold Xin created SPARK-2209: -- Summary: Cast shouldn't do null check twice Key: SPARK-2209 URL: https://issues.apache.org/jira/browse/SPARK-2209 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.0.1, 1.1.0 Cast does two null checks, one in eval and another one in the function returned by nullOrCast. It's best to get rid of the one in nullOrCast (since eval will be the more common code path). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-768) Fail a task when the remote block it is fetching is not serializable
[ https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038288#comment-14038288 ] Raymond Liu commented on SPARK-768: --- Hi Reynold I am trying to figure out this issue. Here is my understanding: when the situation you mentioned happen. it means: the block is stored in memory level without serialization. otherwise, the execption alread been thown in previous steps. So under this condition, I can figure out two cases which might run into this problem : 1. the rdd is cached in memory, and as you mentioned, it got run on other node, in this case, it seems to me that the remote fetch operation of blockmanager will catch the exception in connectionManager and return None to cachemanager, then the task go to compute code path, though this lead to over compute and a second copy of block is stored. But this do not hang the task. and the job eventually got done. And I have write some cases to verify this. This case, we might find some solution to optimize it? 2. you are using BlockRDD in DStream case, and the storage level is Memory, Then upon compute of the BlockRDD on another node, the exception is thown, while in this case, I think the Task Executor will catch the exception and fail the task? So, either case seems to me will eventually finish the job. I am wondering which kind of case I am missing here which will lead to the hanging of the task, Can you kindly give me an example? > Fail a task when the remote block it is fetching is not serializable > > > Key: SPARK-768 > URL: https://issues.apache.org/jira/browse/SPARK-768 > Project: Spark > Issue Type: Bug >Reporter: Reynold Xin >Assignee: Reynold Xin > > When a task is fetching a remote block (e.g. locality wait exceeded), and if > the block is not serializable, the task would hang. > The block manager should fail the task instead of hanging the task ... once > the task fails, eventually it will get scheduled to the local node to be > executed successfully. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2200) breeze DenseVector not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038223#comment-14038223 ] Neville Li edited comment on SPARK-2200 at 6/20/14 1:23 AM: With 0.7 the error went away when reference tracking is set to true. With 0.8.1 it works either way. Turns out in 0.7 the recursive references was caused by this: {code} private final val innerUpdate: ((Int,E) => Unit) = if ((offset == 0) && (stride == 1)) { (i:Int,v:E) => {data(i) = v} } else {(i:Int,v:E) => {data(offset+i*stride)=v} } {code} The function val has an closure $outer that references itself. It was removed in 0.8.1. was (Author: sinisa_lyh): With 0.7 the error went away when reference tracking is set to true. With 0.8.1 it works either way. Turns out in 0.7 the recursive references was caused by this: private final val innerUpdate: ((Int,E) => Unit) = if ((offset == 0) && (stride == 1)) { (i:Int,v:E) => {data(i) = v} } else {(i:Int,v:E) => {data(offset+i*stride)=v} } The function val has an closure $outer that references itself. It was removed in 0.8.1. > breeze DenseVector not serializable with KryoSerializer > --- > > Key: SPARK-2200 > URL: https://issues.apache.org/jira/browse/SPARK-2200 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Neville Li >Priority: Minor > > Spark 1.0.0 depends on breeze 0.7 and for some reason serializing DenseVector > with KryoSerializer throws the following stack trace. Looks like some > recursive field in the object. Upgrading to 0.8.1 solved this. > {code} > java.lang.StackOverflowError > at java.lang.reflect.Field.getDeclaringClass(Field.java:154) > at > sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54) > at > sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38) > at java.lang.reflect.Field.get(Field.java:379) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) > ... > {code} > Code to reproduce: > {code} > import breeze.linalg.DenseVector > import org.apache.spark.SparkConf > import org.apache.spark.serializer.KryoSerializer > object SerializerTest { > def main(args: Array[String]) { > val conf = new SparkConf() > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > .set("spark.kryo.registrator", classOf[MyRegistrator].getName) > .set("spark.kryo.referenceTracking", "false") > .set("spark.kryoserializer.buffer.mb", "8") > val serializer = new KryoSerializer(conf).newInstance() > serializer.serialize(DenseVector.rand(10)) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2200) breeze DenseVector not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038223#comment-14038223 ] Neville Li commented on SPARK-2200: --- With 0.7 the error went away when reference tracking is set to true. With 0.8.1 it works either way. Turns out in 0.7 the recursive references was caused by this: private final val innerUpdate: ((Int,E) => Unit) = if ((offset == 0) && (stride == 1)) { (i:Int,v:E) => {data(i) = v} } else {(i:Int,v:E) => {data(offset+i*stride)=v} } The function val has an closure $outer that references itself. It was removed in 0.8.1. > breeze DenseVector not serializable with KryoSerializer > --- > > Key: SPARK-2200 > URL: https://issues.apache.org/jira/browse/SPARK-2200 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Neville Li >Priority: Minor > > Spark 1.0.0 depends on breeze 0.7 and for some reason serializing DenseVector > with KryoSerializer throws the following stack trace. Looks like some > recursive field in the object. Upgrading to 0.8.1 solved this. > {code} > java.lang.StackOverflowError > at java.lang.reflect.Field.getDeclaringClass(Field.java:154) > at > sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54) > at > sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38) > at java.lang.reflect.Field.get(Field.java:379) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) > ... > {code} > Code to reproduce: > {code} > import breeze.linalg.DenseVector > import org.apache.spark.SparkConf > import org.apache.spark.serializer.KryoSerializer > object SerializerTest { > def main(args: Array[String]) { > val conf = new SparkConf() > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > .set("spark.kryo.registrator", classOf[MyRegistrator].getName) > .set("spark.kryo.referenceTracking", "false") > .set("spark.kryoserializer.buffer.mb", "8") > val serializer = new KryoSerializer(conf).newInstance() > serializer.serialize(DenseVector.rand(10)) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2208) local metrics tests can fail on fast machines
[ https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2208: --- Labels: starter (was: ) > local metrics tests can fail on fast machines > - > > Key: SPARK-2208 > URL: https://issues.apache.org/jira/browse/SPARK-2208 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell > Labels: starter > > I'm temporarily disabling this check. I think the issue is that on fast > machines the fetch wait time can actually be zero, even across all tasks. > We should see if we can write this in a different way to make sure there is a > delay. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2208) local metrics tests can fail on fast machines
Patrick Wendell created SPARK-2208: -- Summary: local metrics tests can fail on fast machines Key: SPARK-2208 URL: https://issues.apache.org/jira/browse/SPARK-2208 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell I'm temporarily disabling this check. I think the issue is that on fast machines the fetch wait time can actually be zero, even across all tasks. We should see if we can write this in a different way to make sure there is a delay. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2192) Examples Data Not in Binary Distribution
[ https://issues.apache.org/jira/browse/SPARK-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038200#comment-14038200 ] Patrick Wendell commented on SPARK-2192: It might be good to have all the example data in src/main/resources. > Examples Data Not in Binary Distribution > > > Key: SPARK-2192 > URL: https://issues.apache.org/jira/browse/SPARK-2192 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.0 >Reporter: Pat McDonough > > The data used by examples is not packaged up with the binary distribution. > The data subdirectory of spark should make it's way in to the distribution > somewhere so the examples can use it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2202) saveAsTextFile hangs on final 2 tasks
[ https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038156#comment-14038156 ] Suren Hiraman commented on SPARK-2202: -- Will do tomorrow. Interesting problem. > saveAsTextFile hangs on final 2 tasks > - > > Key: SPARK-2202 > URL: https://issues.apache.org/jira/browse/SPARK-2202 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: CentOS 5.7 > 16 nodes, 24 cores per node, 14g RAM per executor >Reporter: Suren Hiraman > > I have a flow that takes in about 10 GB of data and writes out about 10 GB of > data. > The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining > tasks, always on the same node. > It seems that the 2 tasks are waiting for data from a remote task/RDD > partition. > After about 2 hours or so, the stuck tasks get a closed connection exception > and you can see the remote side logging that as well. Log lines are below. > My custom settings are: > conf.set("spark.executor.memory", "14g") // TODO make this > configurable > > // shuffle configs > conf.set("spark.default.parallelism", "320") > conf.set("spark.shuffle.file.buffer.kb", "200") > conf.set("spark.reducer.maxMbInFlight", "96") > > conf.set("spark.rdd.compress","true") > > conf.set("spark.worker.timeout","180") > > // akka settings > conf.set("spark.akka.threads", "300") > conf.set("spark.akka.timeout", "180") > conf.set("spark.akka.frameSize", "100") > conf.set("spark.akka.batchSize", "30") > conf.set("spark.akka.askTimeout", "30") > > // block manager > conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18") > conf.set("spark.blockManagerHeartBeatMs", "8") > "STUCK" WORKER > 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from > connection to ConnectionManagerId(172.16.25.103,57626) > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) > at sun.nio.ch.IOUtil.read(IOUtil.java:224) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) > at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496) > REMOTE WORKER > 14/06/18 19:41:18 INFO network.ConnectionManager: Removing > ReceivingConnection to ConnectionManagerId(172.16.25.124,55610) > 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding > SendingConnectionManagerId not found -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2151) spark-submit issue (int format expected for memory parameter)
[ https://issues.apache.org/jira/browse/SPARK-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2151: --- Description: Get this exception when invoking spark-submit in standalone cluster mode: {code} Exception in thread "main" java.lang.NumberFormatException: For input string: "38g" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at org.apache.spark.deploy.ClientArguments.parse(ClientArguments.scala:55) at org.apache.spark.deploy.ClientArguments.(ClientArguments.scala:47) at org.apache.spark.deploy.Client$.main(Client.scala:148) at org.apache.spark.deploy.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} was: Get this exception when invoking spark-submit in standalone cluster mode: Exception in thread "main" java.lang.NumberFormatException: For input string: "38g" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.parseInt(Integer.java:527) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) at org.apache.spark.deploy.ClientArguments.parse(ClientArguments.scala:55) at org.apache.spark.deploy.ClientArguments.(ClientArguments.scala:47) at org.apache.spark.deploy.Client$.main(Client.scala:148) at org.apache.spark.deploy.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > spark-submit issue (int format expected for memory parameter) > - > > Key: SPARK-2151 > URL: https://issues.apache.org/jira/browse/SPARK-2151 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > Fix For: 1.0.1, 1.1.0 > > > Get this exception when invoking spark-submit in standalone cluster mode: > {code} > Exception in thread "main" java.lang.NumberFormatException: For input string: > "38g" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:492) > at java.lang.Integer.parseInt(Integer.java:527) > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) > at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) > at > org.apache.spark.deploy.ClientArguments.parse(ClientArguments.scala:55) > at > org.apache.spark.deploy.ClientArguments.(ClientArguments.scala:47) > at org.apache.spark.deploy.Client$.main(Client.scala:148) > at org.apache.spark.deploy.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2151) spark-submit issue (int format expected for memory parameter)
[ https://issues.apache.org/jira/browse/SPARK-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2151. Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 Assignee: Nishkam Ravi > spark-submit issue (int format expected for memory parameter) > - > > Key: SPARK-2151 > URL: https://issues.apache.org/jira/browse/SPARK-2151 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi >Assignee: Nishkam Ravi > Fix For: 1.0.1, 1.1.0 > > > Get this exception when invoking spark-submit in standalone cluster mode: > {code} > Exception in thread "main" java.lang.NumberFormatException: For input string: > "38g" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:492) > at java.lang.Integer.parseInt(Integer.java:527) > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) > at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) > at > org.apache.spark.deploy.ClientArguments.parse(ClientArguments.scala:55) > at > org.apache.spark.deploy.ClientArguments.(ClientArguments.scala:47) > at org.apache.spark.deploy.Client$.main(Client.scala:148) > at org.apache.spark.deploy.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2202) saveAsTextFile hangs on final 2 tasks
[ https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038101#comment-14038101 ] Patrick Wendell commented on SPARK-2202: Yes, please do! > saveAsTextFile hangs on final 2 tasks > - > > Key: SPARK-2202 > URL: https://issues.apache.org/jira/browse/SPARK-2202 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: CentOS 5.7 > 16 nodes, 24 cores per node, 14g RAM per executor >Reporter: Suren Hiraman > > I have a flow that takes in about 10 GB of data and writes out about 10 GB of > data. > The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining > tasks, always on the same node. > It seems that the 2 tasks are waiting for data from a remote task/RDD > partition. > After about 2 hours or so, the stuck tasks get a closed connection exception > and you can see the remote side logging that as well. Log lines are below. > My custom settings are: > conf.set("spark.executor.memory", "14g") // TODO make this > configurable > > // shuffle configs > conf.set("spark.default.parallelism", "320") > conf.set("spark.shuffle.file.buffer.kb", "200") > conf.set("spark.reducer.maxMbInFlight", "96") > > conf.set("spark.rdd.compress","true") > > conf.set("spark.worker.timeout","180") > > // akka settings > conf.set("spark.akka.threads", "300") > conf.set("spark.akka.timeout", "180") > conf.set("spark.akka.frameSize", "100") > conf.set("spark.akka.batchSize", "30") > conf.set("spark.akka.askTimeout", "30") > > // block manager > conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18") > conf.set("spark.blockManagerHeartBeatMs", "8") > "STUCK" WORKER > 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from > connection to ConnectionManagerId(172.16.25.103,57626) > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) > at sun.nio.ch.IOUtil.read(IOUtil.java:224) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) > at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496) > REMOTE WORKER > 14/06/18 19:41:18 INFO network.ConnectionManager: Removing > ReceivingConnection to ConnectionManagerId(172.16.25.124,55610) > 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding > SendingConnectionManagerId not found -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on wrong executors
[ https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastien Rainville updated SPARK-2204: --- Summary: Scheduler for Mesos in fine-grained mode launches tasks on wrong executors (was: Scheduler for Mesos in fine-grained mode launches tasks on random executors) > Scheduler for Mesos in fine-grained mode launches tasks on wrong executors > -- > > Key: SPARK-2204 > URL: https://issues.apache.org/jira/browse/SPARK-2204 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Sebastien Rainville >Priority: Blocker > > MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is > assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning > task lists in the same order as the offers it was passed, but in the current > implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid > assigning the tasks always to the same executors. The result is that the > tasks are launched on random executors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1545) Add Random Forest algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manish Amde updated SPARK-1545: --- Target Version/s: 1.1.0 > Add Random Forest algorithm to MLlib > > > Key: SPARK-1545 > URL: https://issues.apache.org/jira/browse/SPARK-1545 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding Random Forest support to Spark MLlib. The > implementation needs to adapt the classic algorithm to the scalable tree > implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1536) Add multiclass classification support to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manish Amde updated SPARK-1536: --- Target Version/s: 1.1.0 > Add multiclass classification support to MLlib > -- > > Key: SPARK-1536 > URL: https://issues.apache.org/jira/browse/SPARK-1536 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 0.9.0 >Reporter: Manish Amde >Assignee: Manish Amde > > The current decision tree implementation in MLlib only supports binary > classification. This task involves adding multiclass classification support > to the decision tree implementation. > The tasks involves: > - Choosing a good strategy for multiclass classification among multiple > options: > -- add multi class support to impurity but it won't work well with the > categorical features since the centriod-based ordering assumptions won't hold > true > -- error-correcting output codes > -- one-vs-all > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1546) Add AdaBoost algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manish Amde updated SPARK-1546: --- Affects Version/s: (was: 1.0.0) 1.1.0 > Add AdaBoost algorithm to Spark MLlib > - > > Key: SPARK-1546 > URL: https://issues.apache.org/jira/browse/SPARK-1546 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the AdaBoost algorithm to Spark MLlib. The > implementation needs to adapt the classic AdaBoost algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manish Amde updated SPARK-1547: --- Target Version/s: 1.1.0 > Add gradient boosting algorithm to MLlib > > > Key: SPARK-1547 > URL: https://issues.apache.org/jira/browse/SPARK-1547 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the gradient boosting algorithm to Spark MLlib. The > implementation needs to adapt the gradient boosting algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2207) Add minimum information gain and minimum instances per node as training parameters for decision tree.
[ https://issues.apache.org/jira/browse/SPARK-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2207: - Assignee: Manish Amde > Add minimum information gain and minimum instances per node as training > parameters for decision tree. > - > > Key: SPARK-2207 > URL: https://issues.apache.org/jira/browse/SPARK-2207 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2206) Automatically infer the number of classification classes in multiclass classification
[ https://issues.apache.org/jira/browse/SPARK-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2206: - Assignee: Manish Amde > Automatically infer the number of classification classes in multiclass > classification > - > > Key: SPARK-2206 > URL: https://issues.apache.org/jira/browse/SPARK-2206 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > Currently, the user needs to specify the numClassesForClassification > parameter explicitly during multiclass classification for decision trees. > This feature will automatically infer this information (and possibly class > histograms) from the training data. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2202) saveAsTextFile hangs on final 2 tasks
[ https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037979#comment-14037979 ] Suren Hiraman commented on SPARK-2202: -- So it turns out that when we remove all of our custom setttings (leaving only executor memory and default parallelism), the flow completes. Would you like me to re-run with the above settings and provide you with JStack output? > saveAsTextFile hangs on final 2 tasks > - > > Key: SPARK-2202 > URL: https://issues.apache.org/jira/browse/SPARK-2202 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: CentOS 5.7 > 16 nodes, 24 cores per node, 14g RAM per executor >Reporter: Suren Hiraman > > I have a flow that takes in about 10 GB of data and writes out about 10 GB of > data. > The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining > tasks, always on the same node. > It seems that the 2 tasks are waiting for data from a remote task/RDD > partition. > After about 2 hours or so, the stuck tasks get a closed connection exception > and you can see the remote side logging that as well. Log lines are below. > My custom settings are: > conf.set("spark.executor.memory", "14g") // TODO make this > configurable > > // shuffle configs > conf.set("spark.default.parallelism", "320") > conf.set("spark.shuffle.file.buffer.kb", "200") > conf.set("spark.reducer.maxMbInFlight", "96") > > conf.set("spark.rdd.compress","true") > > conf.set("spark.worker.timeout","180") > > // akka settings > conf.set("spark.akka.threads", "300") > conf.set("spark.akka.timeout", "180") > conf.set("spark.akka.frameSize", "100") > conf.set("spark.akka.batchSize", "30") > conf.set("spark.akka.askTimeout", "30") > > // block manager > conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18") > conf.set("spark.blockManagerHeartBeatMs", "8") > "STUCK" WORKER > 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from > connection to ConnectionManagerId(172.16.25.103,57626) > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) > at sun.nio.ch.IOUtil.read(IOUtil.java:224) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) > at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496) > REMOTE WORKER > 14/06/18 19:41:18 INFO network.ConnectionManager: Removing > ReceivingConnection to ConnectionManagerId(172.16.25.124,55610) > 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding > SendingConnectionManagerId not found -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2207) Add minimum information gain and minimum instances per node as training parameters for decision tree.
[ https://issues.apache.org/jira/browse/SPARK-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manish Amde updated SPARK-2207: --- Summary: Add minimum information gain and minimum instances per node as training parameters for decision tree. (was: Add minimum info gain and min instances per node as training parameters for decision tree) > Add minimum information gain and minimum instances per node as training > parameters for decision tree. > - > > Key: SPARK-2207 > URL: https://issues.apache.org/jira/browse/SPARK-2207 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2207) Add minimum info gain and min instances per node as training parameters for decision tree
[ https://issues.apache.org/jira/browse/SPARK-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manish Amde updated SPARK-2207: --- Target Version/s: 1.1.0 > Add minimum info gain and min instances per node as training parameters for > decision tree > - > > Key: SPARK-2207 > URL: https://issues.apache.org/jira/browse/SPARK-2207 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2206) Automatically infer the number of classification classes in multiclass classification
[ https://issues.apache.org/jira/browse/SPARK-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manish Amde updated SPARK-2206: --- Target Version/s: 1.1.0 Affects Version/s: 1.0.0 > Automatically infer the number of classification classes in multiclass > classification > - > > Key: SPARK-2206 > URL: https://issues.apache.org/jira/browse/SPARK-2206 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde > > Currently, the user needs to specify the numClassesForClassification > parameter explicitly during multiclass classification for decision trees. > This feature will automatically infer this information (and possibly class > histograms) from the training data. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2207) Add minimum info gain and min instances per node as training parameters for decision tree
Manish Amde created SPARK-2207: -- Summary: Add minimum info gain and min instances per node as training parameters for decision tree Key: SPARK-2207 URL: https://issues.apache.org/jira/browse/SPARK-2207 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Manish Amde -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2206) Automatically infer the number of classification classes in multiclass classification
Manish Amde created SPARK-2206: -- Summary: Automatically infer the number of classification classes in multiclass classification Key: SPARK-2206 URL: https://issues.apache.org/jira/browse/SPARK-2206 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Manish Amde Currently, the user needs to specify the numClassesForClassification parameter explicitly during multiclass classification for decision trees. This feature will automatically infer this information (and possibly class histograms) from the training data. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-1544) Add support for deep decision trees.
[ https://issues.apache.org/jira/browse/SPARK-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manish Amde closed SPARK-1544. -- The PR has been accepted. > Add support for deep decision trees. > > > Key: SPARK-1544 > URL: https://issues.apache.org/jira/browse/SPARK-1544 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > Fix For: 1.0.0 > > > The current tree implementation stores an Array[Double] of size O(#features \ > #splits * 2^maxDepth)* in memory for aggregating histograms over partitions. > The current implementation might not scale to very deep trees since the > memory requirement grows exponentially with tree depth. > This task enables construction of arbitrary deep trees. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.
[ https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037909#comment-14037909 ] Yin Huai commented on SPARK-2205: - The cause of this bug is that in HashJoin, outputPartitioning returns the output partitioning of its left child. > Unnecessary exchange operators in a join on multiple tables with the same > join key. > --- > > Key: SPARK-2205 > URL: https://issues.apache.org/jira/browse/SPARK-2205 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > {code} > hql("select * from src x join src y on (x.key=y.key) join src z on > (y.key=z.key)") > SchemaRDD[1] at RDD at SchemaRDD.scala:100 > == Query Plan == > Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5] > HashJoin [key#6], [key#8], BuildRight > Exchange (HashPartitioning [key#6], 200) >HashJoin [key#4], [key#6], BuildRight > Exchange (HashPartitioning [key#4], 200) > HiveTableScan [key#4,value#5], (MetastoreRelation default, src, > Some(x)), None > Exchange (HashPartitioning [key#6], 200) > HiveTableScan [key#6,value#7], (MetastoreRelation default, src, > Some(y)), None > Exchange (HashPartitioning [key#8], 200) >HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), > None > {code} > However, this is fine... > {code} > hql("select * from src x join src y on (x.key=y.key) join src z on > (x.key=z.key)") > res5: org.apache.spark.sql.SchemaRDD = > SchemaRDD[5] at RDD at SchemaRDD.scala:100 > == Query Plan == > Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5] > HashJoin [key#26], [key#30], BuildRight > HashJoin [key#26], [key#28], BuildRight >Exchange (HashPartitioning [key#26], 200) > HiveTableScan [key#26,value#27], (MetastoreRelation default, src, > Some(x)), None >Exchange (HashPartitioning [key#28], 200) > HiveTableScan [key#28,value#29], (MetastoreRelation default, src, > Some(y)), None > Exchange (HashPartitioning [key#30], 200) >HiveTableScan [key#30,value#31], (MetastoreRelation default, src, > Some(z)), None > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-704) ConnectionManager sometimes cannot detect loss of sending connections
[ https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037904#comment-14037904 ] Henry Saputra commented on SPARK-704: - Trying to reproduce and understand the issue. After a new SendingConnection is created it is creating its own channel then register to the ConnectionManager#selector to listen to state changes. When SendingConnection is being ask to send message, it will call Connection#registerInterest to ready for write which later in the Detecting whether SendingConnection is disconnected will be done when there is an attempt to write to the channel which will throw an exception which I believe should be sufficient for the purpose of the issue? Just want to clarify if I understand the problem correctly. > ConnectionManager sometimes cannot detect loss of sending connections > - > > Key: SPARK-704 > URL: https://issues.apache.org/jira/browse/SPARK-704 > Project: Spark > Issue Type: Bug >Reporter: Charles Reiss >Assignee: Henry Saputra > > ConnectionManager currently does not detect when SendingConnections > disconnect except if it is trying to send through them. As a result, a node > failure just after a connection is initiated but before any acknowledgement > messages can be sent may result in a hang. > ConnectionManager has code intended to detect this case by detecting the > failure of a corresponding ReceivingConnection, but this code assumes that > the remote host:port of the ReceivingConnection is the same as the > ConnectionManagerId, which is almost never true. Additionally, there does not > appear to be any reason to assume a corresponding ReceivingConnection will > exist. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on random executors
[ https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037903#comment-14037903 ] Sebastien Rainville commented on SPARK-2204: Created PR: https://github.com/apache/spark/pull/1140 > Scheduler for Mesos in fine-grained mode launches tasks on random executors > --- > > Key: SPARK-2204 > URL: https://issues.apache.org/jira/browse/SPARK-2204 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Sebastien Rainville >Priority: Blocker > > MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is > assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning > task lists in the same order as the offers it was passed, but in the current > implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid > assigning the tasks always to the same executors. The result is that the > tasks are launched on random executors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2191) Double execution with CREATE TABLE AS SELECT
[ https://issues.apache.org/jira/browse/SPARK-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2191. Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 Assignee: Michael Armbrust > Double execution with CREATE TABLE AS SELECT > > > Key: SPARK-2191 > URL: https://issues.apache.org/jira/browse/SPARK-2191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 1.0.1, 1.1.0 > > > Reproduction: > {code} > scala> hql("CREATE TABLE foo AS select unix_timestamp() from src limit > 1").collect() > res5: Array[org.apache.spark.sql.Row] = Array() > scala> hql("SELECT * FROM foo").collect() > res6: Array[org.apache.spark.sql.Row] = Array([1403159129], [1403159130]) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.
Yin Huai created SPARK-2205: --- Summary: Unnecessary exchange operators in a join on multiple tables with the same join key. Key: SPARK-2205 URL: https://issues.apache.org/jira/browse/SPARK-2205 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai {code} hql("select * from src x join src y on (x.key=y.key) join src z on (y.key=z.key)") SchemaRDD[1] at RDD at SchemaRDD.scala:100 == Query Plan == Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5] HashJoin [key#6], [key#8], BuildRight Exchange (HashPartitioning [key#6], 200) HashJoin [key#4], [key#6], BuildRight Exchange (HashPartitioning [key#4], 200) HiveTableScan [key#4,value#5], (MetastoreRelation default, src, Some(x)), None Exchange (HashPartitioning [key#6], 200) HiveTableScan [key#6,value#7], (MetastoreRelation default, src, Some(y)), None Exchange (HashPartitioning [key#8], 200) HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), None {code} However, this is fine... {code} hql("select * from src x join src y on (x.key=y.key) join src z on (x.key=z.key)") res5: org.apache.spark.sql.SchemaRDD = SchemaRDD[5] at RDD at SchemaRDD.scala:100 == Query Plan == Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5] HashJoin [key#26], [key#30], BuildRight HashJoin [key#26], [key#28], BuildRight Exchange (HashPartitioning [key#26], 200) HiveTableScan [key#26,value#27], (MetastoreRelation default, src, Some(x)), None Exchange (HashPartitioning [key#28], 200) HiveTableScan [key#28,value#29], (MetastoreRelation default, src, Some(y)), None Exchange (HashPartitioning [key#30], 200) HiveTableScan [key#30,value#31], (MetastoreRelation default, src, Some(z)), None {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on random executors
[ https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastien Rainville updated SPARK-2204: --- Description: MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning task lists in the same order as the offers it was passed, but in the current implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid assigning the tasks always to the same executors. The result is that the tasks are launched on random executors. (was: MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning task lists in the same order as the offers it was passed, but in the current implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid assigning the tasks always to the same executors. The result is that the tasks are launched on random executors.6) > Scheduler for Mesos in fine-grained mode launches tasks on random executors > --- > > Key: SPARK-2204 > URL: https://issues.apache.org/jira/browse/SPARK-2204 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Sebastien Rainville >Priority: Blocker > > MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is > assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning > task lists in the same order as the offers it was passed, but in the current > implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid > assigning the tasks always to the same executors. The result is that the > tasks are launched on random executors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on random executors
[ https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastien Rainville updated SPARK-2204: --- Fix Version/s: (was: 1.0.1) > Scheduler for Mesos in fine-grained mode launches tasks on random executors > --- > > Key: SPARK-2204 > URL: https://issues.apache.org/jira/browse/SPARK-2204 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Sebastien Rainville >Priority: Blocker > > MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is > assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning > task lists in the same order as the offers it was passed, but in the current > implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid > assigning the tasks always to the same executors. The result is that the > tasks are launched on random executors.6 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1800) Add broadcast hash join operator
[ https://issues.apache.org/jira/browse/SPARK-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037880#comment-14037880 ] Yin Huai commented on SPARK-1800: - Maybe add an improvement in future that tasks in the same node can share those hashtables. Also, if we have a star join, maybe we want to limit the total size of those hashtables? So, they will not occupy too much space. > Add broadcast hash join operator > > > Key: SPARK-1800 > URL: https://issues.apache.org/jira/browse/SPARK-1800 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 1.1.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on random executors
Sebastien Rainville created SPARK-2204: -- Summary: Scheduler for Mesos in fine-grained mode launches tasks on random executors Key: SPARK-2204 URL: https://issues.apache.org/jira/browse/SPARK-2204 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Sebastien Rainville Priority: Blocker Fix For: 1.0.1 MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning task lists in the same order as the offers it was passed, but in the current implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid assigning the tasks always to the same executors. The result is that the tasks are launched on random executors.6 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2177) describe table result contains only one column
[ https://issues.apache.org/jira/browse/SPARK-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037809#comment-14037809 ] Yin Huai commented on SPARK-2177: - Generally Hive generates results of DDL statements as plain text (unless we use "set hive.ddl.output.format=json"). It is not quite easy to parse those plain strings and I think it is not a good idea to understand how Hive works for every describe commands and write our code to generate the exactly same output. With changes made in this PR, Spark SQL can support a subset of describe commands which are commonly used. This subset is defined by {code} DESCRIBE [EXTENDED] [db_name.]table_name {code} All other cases are still treated as native commands. > describe table result contains only one column > -- > > Key: SPARK-2177 > URL: https://issues.apache.org/jira/browse/SPARK-2177 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Yin Huai > > {code} > scala> hql("describe src").collect().foreach(println) > [key string None] > [valuestring None] > {code} > The result should contain 3 columns instead of one. This screws up JDBC or > even the downstream consumer of the Scala/Java/Python APIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2177) describe table result contains only one column
[ https://issues.apache.org/jira/browse/SPARK-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037810#comment-14037810 ] Yin Huai commented on SPARK-2177: - We should also put what cases we support in the release note. But, where is that field? > describe table result contains only one column > -- > > Key: SPARK-2177 > URL: https://issues.apache.org/jira/browse/SPARK-2177 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Yin Huai > > {code} > scala> hql("describe src").collect().foreach(println) > [key string None] > [valuestring None] > {code} > The result should contain 3 columns instead of one. This screws up JDBC or > even the downstream consumer of the Scala/Java/Python APIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2203) PySpark does not infer default numPartitions in same way as Spark
Aaron Davidson created SPARK-2203: - Summary: PySpark does not infer default numPartitions in same way as Spark Key: SPARK-2203 URL: https://issues.apache.org/jira/browse/SPARK-2203 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Aaron Davidson Assignee: Aaron Davidson For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster. In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2126) Move MapOutputTracker behind ShuffleManager interface
[ https://issues.apache.org/jira/browse/SPARK-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037746#comment-14037746 ] Nan Zhu commented on SPARK-2126: [~pwendell] Yes, [~markhamstra] just emailed me Yes, I have been working on it for two evenings, it's a big change and I haven't make any significant change, so I don't mind that a core developer come to lead this and I'm still willing to contribute anything I can > Move MapOutputTracker behind ShuffleManager interface > - > > Key: SPARK-2126 > URL: https://issues.apache.org/jira/browse/SPARK-2126 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Reporter: Matei Zaharia >Assignee: Nan Zhu > > This will require changing the interface between the DAGScheduler and > MapOutputTracker to be method calls on the ShuffleManager instead. However, > it will make it easier to do push-based shuffle and other ideas requiring > changes to map output tracking. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2038) Don't shadow "conf" variable in saveAsHadoop functions
[ https://issues.apache.org/jira/browse/SPARK-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037739#comment-14037739 ] Nan Zhu commented on SPARK-2038: [~pwendell] Yeah, it's a good idea, just submit a new PR: https://github.com/apache/spark/pull/1137 > Don't shadow "conf" variable in saveAsHadoop functions > -- > > Key: SPARK-2038 > URL: https://issues.apache.org/jira/browse/SPARK-2038 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Patrick Wendell >Assignee: Nan Zhu >Priority: Minor > Labels: api-breaking > Fix For: 1.1.0 > > > This could lead to a lot of bugs. We should just change it to hadoopConf. I > noticed this when reviewing SPARK-1677. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-2038) Don't shadow "conf" variable in saveAsHadoop functions
[ https://issues.apache.org/jira/browse/SPARK-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-2038: > Don't shadow "conf" variable in saveAsHadoop functions > -- > > Key: SPARK-2038 > URL: https://issues.apache.org/jira/browse/SPARK-2038 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Patrick Wendell >Assignee: Nan Zhu >Priority: Minor > Labels: api-breaking > Fix For: 1.1.0 > > > This could lead to a lot of bugs. We should just change it to hadoopConf. I > noticed this when reviewing SPARK-1677. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2038) Don't shadow "conf" variable in saveAsHadoop functions
[ https://issues.apache.org/jira/browse/SPARK-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037703#comment-14037703 ] Patrick Wendell commented on SPARK-2038: Hey [~CodingCat] - I realized there is actually an intermediate fix. Don't change the name of the method argument, but inside of the method immediately do `val hadoopConf = conf` then add a comment that it's to avoid naming collision. So I think you could still submit your patch with that change. Does that make sense? > Don't shadow "conf" variable in saveAsHadoop functions > -- > > Key: SPARK-2038 > URL: https://issues.apache.org/jira/browse/SPARK-2038 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Patrick Wendell >Assignee: Nan Zhu >Priority: Minor > Labels: api-breaking > Fix For: 1.1.0 > > > This could lead to a lot of bugs. We should just change it to hadoopConf. I > noticed this when reviewing SPARK-1677. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2202) saveAsTextFile hangs on final 2 tasks
[ https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037696#comment-14037696 ] Patrick Wendell commented on SPARK-2202: When the tasks are hanging. Could you go to the individual node and run `jstack` on the Executor process? It's possible there is a bug in the HDFS client library, in Spark, or somewhere else. > saveAsTextFile hangs on final 2 tasks > - > > Key: SPARK-2202 > URL: https://issues.apache.org/jira/browse/SPARK-2202 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: CentOS 5.7 > 16 nodes, 24 cores per node, 14g RAM per executor >Reporter: Suren Hiraman >Priority: Blocker > > I have a flow that takes in about 10 GB of data and writes out about 10 GB of > data. > The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining > tasks, always on the same node. > It seems that the 2 tasks are waiting for data from a remote task/RDD > partition. > After about 2 hours or so, the stuck tasks get a closed connection exception > and you can see the remote side logging that as well. Log lines are below. > My custom settings are: > conf.set("spark.executor.memory", "14g") // TODO make this > configurable > > // shuffle configs > conf.set("spark.default.parallelism", "320") > conf.set("spark.shuffle.file.buffer.kb", "200") > conf.set("spark.reducer.maxMbInFlight", "96") > > conf.set("spark.rdd.compress","true") > > conf.set("spark.worker.timeout","180") > > // akka settings > conf.set("spark.akka.threads", "300") > conf.set("spark.akka.timeout", "180") > conf.set("spark.akka.frameSize", "100") > conf.set("spark.akka.batchSize", "30") > conf.set("spark.akka.askTimeout", "30") > > // block manager > conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18") > conf.set("spark.blockManagerHeartBeatMs", "8") > "STUCK" WORKER > 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from > connection to ConnectionManagerId(172.16.25.103,57626) > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) > at sun.nio.ch.IOUtil.read(IOUtil.java:224) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) > at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496) > REMOTE WORKER > 14/06/18 19:41:18 INFO network.ConnectionManager: Removing > ReceivingConnection to ConnectionManagerId(172.16.25.124,55610) > 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding > SendingConnectionManagerId not found -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2202) saveAsTextFile hangs on final 2 tasks
[ https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2202: --- Priority: Major (was: Blocker) > saveAsTextFile hangs on final 2 tasks > - > > Key: SPARK-2202 > URL: https://issues.apache.org/jira/browse/SPARK-2202 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: CentOS 5.7 > 16 nodes, 24 cores per node, 14g RAM per executor >Reporter: Suren Hiraman > > I have a flow that takes in about 10 GB of data and writes out about 10 GB of > data. > The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining > tasks, always on the same node. > It seems that the 2 tasks are waiting for data from a remote task/RDD > partition. > After about 2 hours or so, the stuck tasks get a closed connection exception > and you can see the remote side logging that as well. Log lines are below. > My custom settings are: > conf.set("spark.executor.memory", "14g") // TODO make this > configurable > > // shuffle configs > conf.set("spark.default.parallelism", "320") > conf.set("spark.shuffle.file.buffer.kb", "200") > conf.set("spark.reducer.maxMbInFlight", "96") > > conf.set("spark.rdd.compress","true") > > conf.set("spark.worker.timeout","180") > > // akka settings > conf.set("spark.akka.threads", "300") > conf.set("spark.akka.timeout", "180") > conf.set("spark.akka.frameSize", "100") > conf.set("spark.akka.batchSize", "30") > conf.set("spark.akka.askTimeout", "30") > > // block manager > conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18") > conf.set("spark.blockManagerHeartBeatMs", "8") > "STUCK" WORKER > 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from > connection to ConnectionManagerId(172.16.25.103,57626) > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) > at sun.nio.ch.IOUtil.read(IOUtil.java:224) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) > at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496) > REMOTE WORKER > 14/06/18 19:41:18 INFO network.ConnectionManager: Removing > ReceivingConnection to ConnectionManagerId(172.16.25.124,55610) > 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding > SendingConnectionManagerId not found -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2202) saveAsTextFile hangs on final 2 tasks
[ https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037698#comment-14037698 ] Patrick Wendell commented on SPARK-2202: I changed the priority because we usually wait until we've diagnosed the exact issue to assign something as a blocker. > saveAsTextFile hangs on final 2 tasks > - > > Key: SPARK-2202 > URL: https://issues.apache.org/jira/browse/SPARK-2202 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: CentOS 5.7 > 16 nodes, 24 cores per node, 14g RAM per executor >Reporter: Suren Hiraman > > I have a flow that takes in about 10 GB of data and writes out about 10 GB of > data. > The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining > tasks, always on the same node. > It seems that the 2 tasks are waiting for data from a remote task/RDD > partition. > After about 2 hours or so, the stuck tasks get a closed connection exception > and you can see the remote side logging that as well. Log lines are below. > My custom settings are: > conf.set("spark.executor.memory", "14g") // TODO make this > configurable > > // shuffle configs > conf.set("spark.default.parallelism", "320") > conf.set("spark.shuffle.file.buffer.kb", "200") > conf.set("spark.reducer.maxMbInFlight", "96") > > conf.set("spark.rdd.compress","true") > > conf.set("spark.worker.timeout","180") > > // akka settings > conf.set("spark.akka.threads", "300") > conf.set("spark.akka.timeout", "180") > conf.set("spark.akka.frameSize", "100") > conf.set("spark.akka.batchSize", "30") > conf.set("spark.akka.askTimeout", "30") > > // block manager > conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18") > conf.set("spark.blockManagerHeartBeatMs", "8") > "STUCK" WORKER > 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from > connection to ConnectionManagerId(172.16.25.103,57626) > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) > at sun.nio.ch.IOUtil.read(IOUtil.java:224) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) > at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496) > REMOTE WORKER > 14/06/18 19:41:18 INFO network.ConnectionManager: Removing > ReceivingConnection to ConnectionManagerId(172.16.25.124,55610) > 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding > SendingConnectionManagerId not found -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2180) HiveQL doesn't support GROUP BY with HAVING clauses
[ https://issues.apache.org/jira/browse/SPARK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037697#comment-14037697 ] William Benton commented on SPARK-2180: --- PR is here: https://github.com/apache/spark/pull/1136 > HiveQL doesn't support GROUP BY with HAVING clauses > --- > > Key: SPARK-2180 > URL: https://issues.apache.org/jira/browse/SPARK-2180 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: William Benton >Priority: Minor > > The HiveQL implementation doesn't support HAVING clauses for aggregations. > This prevents some of the TPCDS benchmarks from running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2126) Move MapOutputTracker behind ShuffleManager interface
[ https://issues.apache.org/jira/browse/SPARK-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037692#comment-14037692 ] Patrick Wendell commented on SPARK-2126: Hey All, This proposal is a fairly hairy refactoring of Spark internals. It might not be the best candidate for an external contribution. [~CodingCat] if you wanted to take a initial attempt at this, go right ahead! Just a warning though, it might be that we use your code as a starting point for the design. The final version of this patch will probably need to be written by someone who has worked a lot on these internals ([~markhamstra] you'd actually be a good candidate yourself! but not sure you have the cycles). > Move MapOutputTracker behind ShuffleManager interface > - > > Key: SPARK-2126 > URL: https://issues.apache.org/jira/browse/SPARK-2126 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Reporter: Matei Zaharia >Assignee: Nan Zhu > > This will require changing the interface between the DAGScheduler and > MapOutputTracker to be method calls on the ShuffleManager instead. However, > it will make it easier to do push-based shuffle and other ideas requiring > changes to map output tracking. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037659#comment-14037659 ] Valeriy Avanesov commented on SPARK-2199: - Here is the implementation we currently have. https://github.com/akopich/dplsa Robust and non robust PLSA are implemented but no regularizers are currently supported. > Distributed probabilistic latent semantic analysis in MLlib > --- > > Key: SPARK-2199 > URL: https://issues.apache.org/jira/browse/SPARK-2199 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Denis Turdakov > Labels: features > > Probabilistic latent semantic analysis (PLSA) is a topic model which extracts > topics from text corpus. PLSA was historically a predecessor of LDA. However > recent research shows that modifications of PLSA sometimes performs better > then LDA[1]. Furthermore, the most recent paper by same authors shows that > there is a clear way to extend PLSA to LDA and beyond[2]. > We should implement distributed versions of PLSA. In addition it should be > possible to easily add user defined regularizers or combination of them. We > will implement regularizers that allows > * extract sparse topics > * extract human interpretable topics > * perform semi-supervised training > * sort out non-topic specific terms. > [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In > Proceedings of ECIR'13. > [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive > Regularization for Stochastic Matrix Factorization. > http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2126) Move MapOutputTracker behind ShuffleManager interface
[ https://issues.apache.org/jira/browse/SPARK-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra updated SPARK-2126: Assignee: Nan Zhu > Move MapOutputTracker behind ShuffleManager interface > - > > Key: SPARK-2126 > URL: https://issues.apache.org/jira/browse/SPARK-2126 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Reporter: Matei Zaharia >Assignee: Nan Zhu > > This will require changing the interface between the DAGScheduler and > MapOutputTracker to be method calls on the ShuffleManager instead. However, > it will make it easier to do push-based shuffle and other ideas requiring > changes to map output tracking. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2200) breeze DenseVector not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037637#comment-14037637 ] Xiangrui Meng commented on SPARK-2200: -- [~neville] Do you know the root cause and how this is fixed in breeze 0.8.1? You disabled reference tracking, which may be the reason. > breeze DenseVector not serializable with KryoSerializer > --- > > Key: SPARK-2200 > URL: https://issues.apache.org/jira/browse/SPARK-2200 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Neville Li >Priority: Minor > > Spark 1.0.0 depends on breeze 0.7 and for some reason serializing DenseVector > with KryoSerializer throws the following stack trace. Looks like some > recursive field in the object. Upgrading to 0.8.1 solved this. > {code} > java.lang.StackOverflowError > at java.lang.reflect.Field.getDeclaringClass(Field.java:154) > at > sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54) > at > sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38) > at java.lang.reflect.Field.get(Field.java:379) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) > ... > {code} > Code to reproduce: > {code} > import breeze.linalg.DenseVector > import org.apache.spark.SparkConf > import org.apache.spark.serializer.KryoSerializer > object SerializerTest { > def main(args: Array[String]) { > val conf = new SparkConf() > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > .set("spark.kryo.registrator", classOf[MyRegistrator].getName) > .set("spark.kryo.referenceTracking", "false") > .set("spark.kryoserializer.buffer.mb", "8") > val serializer = new KryoSerializer(conf).newInstance() > serializer.serialize(DenseVector.rand(10)) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2202) saveAsTextFile hangs on final 2 tasks
Suren Hiraman created SPARK-2202: Summary: saveAsTextFile hangs on final 2 tasks Key: SPARK-2202 URL: https://issues.apache.org/jira/browse/SPARK-2202 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: CentOS 5.7 16 nodes, 24 cores per node, 14g RAM per executor Reporter: Suren Hiraman Priority: Blocker I have a flow that takes in about 10 GB of data and writes out about 10 GB of data. The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining tasks, always on the same node. It seems that the 2 tasks are waiting for data from a remote task/RDD partition. After about 2 hours or so, the stuck tasks get a closed connection exception and you can see the remote side logging that as well. Log lines are below. My custom settings are: conf.set("spark.executor.memory", "14g") // TODO make this configurable // shuffle configs conf.set("spark.default.parallelism", "320") conf.set("spark.shuffle.file.buffer.kb", "200") conf.set("spark.reducer.maxMbInFlight", "96") conf.set("spark.rdd.compress","true") conf.set("spark.worker.timeout","180") // akka settings conf.set("spark.akka.threads", "300") conf.set("spark.akka.timeout", "180") conf.set("spark.akka.frameSize", "100") conf.set("spark.akka.batchSize", "30") conf.set("spark.akka.askTimeout", "30") // block manager conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18") conf.set("spark.blockManagerHeartBeatMs", "8") "STUCK" WORKER 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from connection to ConnectionManagerId(172.16.25.103,57626) java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) at sun.nio.ch.IOUtil.read(IOUtil.java:224) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496) REMOTE WORKER 14/06/18 19:41:18 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(172.16.25.124,55610) 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2051) spark.yarn.dist.* configs are not supported in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-2051. -- Resolution: Fixed Fix Version/s: 1.1.0 > spark.yarn.dist.* configs are not supported in yarn-cluster mode > > > Key: SPARK-2051 > URL: https://issues.apache.org/jira/browse/SPARK-2051 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Guoqiang Li >Assignee: Guoqiang Li > Fix For: 1.1.0 > > > Spark configuration > {{conf/spark-defaults.conf}}: > {quote} > spark.yarn.dist.archives /toona/conf > spark.executor.extraClassPath ./conf > spark.driver.extraClassPath ./conf > {quote} > > HDFS directory > {{hadoop dfs -cat /toona/conf/toona.conf}} : > {quote} > redis.num=4 > {quote} > > The following command execution fails > {code} > YARN_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --num-executors 2 > --driver-memory 2g --executor-memory 2g --master yarn-cluster --class > toona.DeployTest toona-assembly.jar > {code} > > The following is the test code > {code} > package toona > import com.typesafe.config.Config > import com.typesafe.config.ConfigFactory > object DeployTest { > def main(args: Array[String]) { > val conf = ConfigFactory.load("toona.conf") > val redisNum = conf.getInt("redis.num") // Here will throw an > `ConfigException` exception > assert(redisNum == 4) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2198) Partition the scala build file so that it is easier to maintain
[ https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037467#comment-14037467 ] Helena Edelson commented on SPARK-2198: --- I am sad to hear that the Maven POMs will be primary (vs scala SBT) and staying. It was very odd to see the SBT/Maven redundancies however. > Partition the scala build file so that it is easier to maintain > --- > > Key: SPARK-2198 > URL: https://issues.apache.org/jira/browse/SPARK-2198 > Project: Spark > Issue Type: Task > Components: Build >Reporter: Helena Edelson >Priority: Minor > Original Estimate: 3h > Remaining Estimate: 3h > > Partition to standard Dependencies, Version, Settings, Publish.scala. keeping > the SparkBuild clean to describe the modules and their deps so that changes > in versions, for example, need only be made in Version.scala, settings > changes such as in scalac in Settings.scala, etc. > I'd be happy to do this ([~helena_e]) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2201) Improve FlumeInputDStream
[ https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sunshangchun updated SPARK-2201: Description: Currently only one flume receiver can work with FlumeInputDStream and I am willing to do some works to improve it, my ideas are described as follows: a ip and port denotes a physical host, and a logical host consists of one or more physical hosts In our case, spark flume receivers bind themselves to a logical host when started, and a flume agent get physical hosts and push events to them. Two classes are introduced, LogicalHostRouter supplies a map between logical host and physical host, and LogicalHostRouterListener let relation changes watchable. Some works need to be done here: 1. LogicalHostRouter and LogicalHostRouterListener can be implemented by zookeeper. when physical host started, create tmp node in zk, listeners just watch those tmp nodes. 2. when spark FlumeReceivers started, they acquire a physical host (localhost's ip and an idle port) and register itself to zookeeper. 3. A new flume sink. In the method of appendEvents, they get physical hosts and push data to them in a round-robin manner. Does it a feasible plan? Thanks. > Improve FlumeInputDStream > - > > Key: SPARK-2201 > URL: https://issues.apache.org/jira/browse/SPARK-2201 > Project: Spark > Issue Type: Improvement >Reporter: sunshangchun > > Currently only one flume receiver can work with FlumeInputDStream and I am > willing to do some works to improve it, my ideas are described as follows: > a ip and port denotes a physical host, and a logical host consists of one or > more physical hosts > In our case, spark flume receivers bind themselves to a logical host when > started, and a flume agent get physical hosts and push events to them. > Two classes are introduced, LogicalHostRouter supplies a map between logical > host and physical host, and LogicalHostRouterListener let relation changes > watchable. > Some works need to be done here: > 1. LogicalHostRouter and LogicalHostRouterListener can be implemented by > zookeeper. when physical host started, create tmp node in zk, listeners just > watch those tmp nodes. > 2. when spark FlumeReceivers started, they acquire a physical host > (localhost's ip and an idle port) and register itself to zookeeper. > 3. A new flume sink. In the method of appendEvents, they get physical hosts > and push data to them in a round-robin manner. > Does it a feasible plan? Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2201) Improve FlumeInputDStream
sunshangchun created SPARK-2201: --- Summary: Improve FlumeInputDStream Key: SPARK-2201 URL: https://issues.apache.org/jira/browse/SPARK-2201 Project: Spark Issue Type: Improvement Reporter: sunshangchun -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2198) Partition the scala build file so that it is easier to maintain
[ https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037431#comment-14037431 ] Mark Hamstra commented on SPARK-2198: - While this is an admirable goal, I'm afraid that hand editing the SBT build files won't be a very durable solution. That is because it is currently our goal to consolidate the Maven and SBT builds by deriving the SBT build configuration from the Maven POMs: https://issues.apache.org/jira/browse/SPARK-1776. As such, any partitioning of the SBT build file will really need to be incorporated into the code that is generating that file from the Maven input. > Partition the scala build file so that it is easier to maintain > --- > > Key: SPARK-2198 > URL: https://issues.apache.org/jira/browse/SPARK-2198 > Project: Spark > Issue Type: Task > Components: Build >Reporter: Helena Edelson >Priority: Minor > Original Estimate: 3h > Remaining Estimate: 3h > > Partition to standard Dependencies, Version, Settings, Publish.scala. keeping > the SparkBuild clean to describe the modules and their deps so that changes > in versions, for example, need only be made in Version.scala, settings > changes such as in scalac in Settings.scala, etc. > I'd be happy to do this ([~helena_e]) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2200) breeze DenseVector not serializable with KryoSerializer
Neville Li created SPARK-2200: - Summary: breeze DenseVector not serializable with KryoSerializer Key: SPARK-2200 URL: https://issues.apache.org/jira/browse/SPARK-2200 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Neville Li Priority: Minor Spark 1.0.0 depends on breeze 0.7 and for some reason serializing DenseVector with KryoSerializer throws the following stack trace. Looks like some recursive field in the object. Upgrading to 0.8.1 solved this. {code} java.lang.StackOverflowError at java.lang.reflect.Field.getDeclaringClass(Field.java:154) at sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54) at sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38) at java.lang.reflect.Field.get(Field.java:379) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) ... {code} Code to reproduce: {code} import breeze.linalg.DenseVector import org.apache.spark.SparkConf import org.apache.spark.serializer.KryoSerializer object SerializerTest { def main(args: Array[String]) { val conf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrator", classOf[MyRegistrator].getName) .set("spark.kryo.referenceTracking", "false") .set("spark.kryoserializer.buffer.mb", "8") val serializer = new KryoSerializer(conf).newInstance() serializer.serialize(DenseVector.rand(10)) } } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2200) breeze DenseVector not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037424#comment-14037424 ] Neville Li commented on SPARK-2200: --- https://github.com/apache/spark/pull/940 addresses this. > breeze DenseVector not serializable with KryoSerializer > --- > > Key: SPARK-2200 > URL: https://issues.apache.org/jira/browse/SPARK-2200 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Neville Li >Priority: Minor > > Spark 1.0.0 depends on breeze 0.7 and for some reason serializing DenseVector > with KryoSerializer throws the following stack trace. Looks like some > recursive field in the object. Upgrading to 0.8.1 solved this. > {code} > java.lang.StackOverflowError > at java.lang.reflect.Field.getDeclaringClass(Field.java:154) > at > sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54) > at > sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38) > at java.lang.reflect.Field.get(Field.java:379) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) > ... > {code} > Code to reproduce: > {code} > import breeze.linalg.DenseVector > import org.apache.spark.SparkConf > import org.apache.spark.serializer.KryoSerializer > object SerializerTest { > def main(args: Array[String]) { > val conf = new SparkConf() > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > .set("spark.kryo.registrator", classOf[MyRegistrator].getName) > .set("spark.kryo.referenceTracking", "false") > .set("spark.kryoserializer.buffer.mb", "8") > val serializer = new KryoSerializer(conf).newInstance() > serializer.serialize(DenseVector.rand(10)) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2181) The keys for sorting the columns of Executor page in SparkUI are incorrect
[ https://issues.apache.org/jira/browse/SPARK-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037420#comment-14037420 ] Guoqiang Li commented on SPARK-2181: PR: https://github.com/apache/spark/pull/1135 > The keys for sorting the columns of Executor page in SparkUI are incorrect > -- > > Key: SPARK-2181 > URL: https://issues.apache.org/jira/browse/SPARK-2181 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shuo Xiang >Assignee: Guoqiang Li >Priority: Minor > > Under the Executor page of SparkUI, each column is sorted alphabetically > (after clicking). However, it should be sorted by the value, not the string. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Turdakov updated SPARK-2199: -- Description: Probabilistic latent semantic analysis (PLSA) is a topic model which extracts topics from text corpus. PLSA was historically a predecessor of LDA. However recent research shows that modifications of PLSA sometimes performs better then LDA[1]. Furthermore, the most recent paper by same authors shows that there is a clear way to extend PLSA to LDA and beyond[2]. We should implement distributed versions of PLSA. In addition it should be possible to easily add user defined regularizers or combination of them. We will implement regularizers that allows * extract sparse topics * extract human interpretable topics * perform semi-supervised training * sort out non-topic specific terms. [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In Proceedings of ECIR'13. [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf was: Probabilistic latent semantic analysis (PLSA) is a topic model which extracts topics from text corpus. PLSA was historically a predecessor of LDA. However recent research shows that modifications of PLSA sometimes performs better then LDA[1]. Furthermore, the most recent paper by same authors shows that there is a clear way to extend PLSA to LDA and beyond[2]. (empty line) We should implement distributed versions of PLSA. In addition it should be possible to easily add user defined regularizers or combination of them. We will implement regularizers that allows * extract sparse topics * extract human interpretable topics * perform semi-supervised training * sort out non-topic specific terms. [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In Proceedings of ECIR'13. [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf > Distributed probabilistic latent semantic analysis in MLlib > --- > > Key: SPARK-2199 > URL: https://issues.apache.org/jira/browse/SPARK-2199 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Denis Turdakov > Labels: features > > Probabilistic latent semantic analysis (PLSA) is a topic model which extracts > topics from text corpus. PLSA was historically a predecessor of LDA. However > recent research shows that modifications of PLSA sometimes performs better > then LDA[1]. Furthermore, the most recent paper by same authors shows that > there is a clear way to extend PLSA to LDA and beyond[2]. > We should implement distributed versions of PLSA. In addition it should be > possible to easily add user defined regularizers or combination of them. We > will implement regularizers that allows > * extract sparse topics > * extract human interpretable topics > * perform semi-supervised training > * sort out non-topic specific terms. > [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In > Proceedings of ECIR'13. > [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive > Regularization for Stochastic Matrix Factorization. > http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Turdakov updated SPARK-2199: -- Description: Probabilistic latent semantic analysis (PLSA) is a topic model which extracts topics from text corpus. PLSA was historically a predecessor of LDA. However recent research shows that modifications of PLSA sometimes performs better then LDA[1]. Furthermore, the most recent paper by same authors shows that there is a clear way to extend PLSA to LDA and beyond[2]. (empty line) We should implement distributed versions of PLSA. In addition it should be possible to easily add user defined regularizers or combination of them. We will implement regularizers that allows * extract sparse topics * extract human interpretable topics * perform semi-supervised training * sort out non-topic specific terms. [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In Proceedings of ECIR'13. [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf was: Probabilistic latent semantic analysis (PLSA) is a topic model which extracts topics from text corpus. PLSA was historically a predecessor of LDA. However recent research shows that modifications of PLSA sometimes performs better then LDA[1]. Furthermore, the most recent paper by same authors shows that there is a clear way to extend PLSA to LDA and beyond[2]. We should implement distributed versions of PLSA. In addition it should be possible to easily add user defined regularizers or combination of them. We will implement regularizers that allows • extract sparse topics • extract human interpretable topics • perform semi-supervised training • sort out non-topic specific terms. [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In Proceedings of ECIR'13. [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf > Distributed probabilistic latent semantic analysis in MLlib > --- > > Key: SPARK-2199 > URL: https://issues.apache.org/jira/browse/SPARK-2199 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Denis Turdakov > Labels: features > > Probabilistic latent semantic analysis (PLSA) is a topic model which extracts > topics from text corpus. PLSA was historically a predecessor of LDA. However > recent research shows that modifications of PLSA sometimes performs better > then LDA[1]. Furthermore, the most recent paper by same authors shows that > there is a clear way to extend PLSA to LDA and beyond[2]. > (empty line) > We should implement distributed versions of PLSA. In addition it should be > possible to easily add user defined regularizers or combination of them. We > will implement regularizers that allows > * extract sparse topics > * extract human interpretable topics > * perform semi-supervised training > * sort out non-topic specific terms. > [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In > Proceedings of ECIR'13. > [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive > Regularization for Stochastic Matrix Factorization. > http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib
Denis Turdakov created SPARK-2199: - Summary: Distributed probabilistic latent semantic analysis in MLlib Key: SPARK-2199 URL: https://issues.apache.org/jira/browse/SPARK-2199 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0 Reporter: Denis Turdakov Probabilistic latent semantic analysis (PLSA) is a topic model which extracts topics from text corpus. PLSA was historically a predecessor of LDA. However recent research shows that modifications of PLSA sometimes performs better then LDA[1]. Furthermore, the most recent paper by same authors shows that there is a clear way to extend PLSA to LDA and beyond[2]. We should implement distributed versions of PLSA. In addition it should be possible to easily add user defined regularizers or combination of them. We will implement regularizers that allows • extract sparse topics • extract human interpretable topics • perform semi-supervised training • sort out non-topic specific terms. [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In Proceedings of ECIR'13. [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2194) EC2 Scripts don't work in europe
[ https://issues.apache.org/jira/browse/SPARK-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2194. - Resolution: Cannot Reproduce After waiting a few hours the error message went away. > EC2 Scripts don't work in europe > > > Key: SPARK-2194 > URL: https://issues.apache.org/jira/browse/SPARK-2194 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.0.0 >Reporter: Michael Armbrust > > When i tried to create a cluster I got: > {code} > Setting up security groups... > ERROR:boto:400 Bad Request > ERROR:boto: > InvalidParameterValueInvalid > value 'null' for protocol. VPC security group rules must specify protocols > explicitly.a9a2a9b3-bcc4-443b-889b-61b0e459f54d > {code} > Switching back to US-EAST fixed the issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2198) Partition the scala build file so that it is easier to maintain
[ https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Helena Edelson updated SPARK-2198: -- Remaining Estimate: 3h (was: 2h) Original Estimate: 3h (was: 2h) > Partition the scala build file so that it is easier to maintain > --- > > Key: SPARK-2198 > URL: https://issues.apache.org/jira/browse/SPARK-2198 > Project: Spark > Issue Type: Task > Components: Build >Reporter: Helena Edelson >Priority: Minor > Original Estimate: 3h > Remaining Estimate: 3h > > Partition to standard Dependencies, Version, Settings, Publish.scala. keeping > the SparkBuild clean to describe the modules and their deps so that changes > in versions, for example, need only be made in Version.scala, settings > changes such as in scalac in Settings.scala, etc. > I'd be happy to do this ([~helena_e] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2198) Partition the scala build file so that it is easier to maintain
[ https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Helena Edelson updated SPARK-2198: -- Remaining Estimate: 2h (was: 1m) Original Estimate: 2h (was: 1m) > Partition the scala build file so that it is easier to maintain > --- > > Key: SPARK-2198 > URL: https://issues.apache.org/jira/browse/SPARK-2198 > Project: Spark > Issue Type: Task > Components: Build >Reporter: Helena Edelson >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > Partition to standard Dependencies, Version, Settings, Publish.scala. keeping > the SparkBuild clean to describe the modules and their deps so that changes > in versions, for example, need only be made in Version.scala, settings > changes such as in scalac in Settings.scala, etc. > I'd be happy to do this ([~helena_e] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2198) Partition the scala build file so that it is easier to maintain
[ https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Helena Edelson updated SPARK-2198: -- Description: Partition to standard Dependencies, Version, Settings, Publish.scala. keeping the SparkBuild clean to describe the modules and their deps so that changes in versions, for example, need only be made in Version.scala, settings changes such as in scalac in Settings.scala, etc. I'd be happy to do this ([~helena_e]) was: Partition to standard Dependencies, Version, Settings, Publish.scala. keeping the SparkBuild clean to describe the modules and their deps so that changes in versions, for example, need only be made in Version.scala, settings changes such as in scalac in Settings.scala, etc. I'd be happy to do this ([~helena_e] > Partition the scala build file so that it is easier to maintain > --- > > Key: SPARK-2198 > URL: https://issues.apache.org/jira/browse/SPARK-2198 > Project: Spark > Issue Type: Task > Components: Build >Reporter: Helena Edelson >Priority: Minor > Original Estimate: 3h > Remaining Estimate: 3h > > Partition to standard Dependencies, Version, Settings, Publish.scala. keeping > the SparkBuild clean to describe the modules and their deps so that changes > in versions, for example, need only be made in Version.scala, settings > changes such as in scalac in Settings.scala, etc. > I'd be happy to do this ([~helena_e]) -- This message was sent by Atlassian JIRA (v6.2#6252)