[jira] [Commented] (SPARK-12409) JDBC AND operator push down
[ https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063708#comment-15063708 ] Apache Spark commented on SPARK-12409: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/10369 > JDBC AND operator push down > > > Key: SPARK-12409 > URL: https://issues.apache.org/jira/browse/SPARK-12409 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Huaxin Gao >Priority: Minor > > For simple AND such as > select * from test where THEID = 1 AND NAME = 'fred', > The filters pushed down to JDBC layers are EqualTo(THEID,1), > EqualTo(Name,fred). These are handled OK by the current code. > For query such as > SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" , > the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2))) > So need to add And filter in JDBC layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11148) Unable to create views
[ https://issues.apache.org/jira/browse/SPARK-11148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063721#comment-15063721 ] Cheng Lian commented on SPARK-11148: Did you mean the Windows ODBC driver provided by Simba? AFAIK Databricks only provides download links to Simba's Spark ODBC drivers. If that's the case, you might want to check with Simba since these drivers are not open sourced. > Unable to create views > -- > > Key: SPARK-11148 > URL: https://issues.apache.org/jira/browse/SPARK-11148 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: Ubuntu 14.04 > Spark-1.5.1-bin-hadoop2.6 > (I don't have Hadoop or Hive installed) > Start spark-all.sh and thriftserver with mysql jar driver >Reporter: Lunen >Priority: Critical > > I am unable to create views within spark SQL. > Creating tables without specifying the column names work. eg. > CREATE TABLE trade2 > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:mysql://192.168.30.191:3318/?user=root", > dbtable "database.trade", > driver "com.mysql.jdbc.Driver" > ); > Ceating tables with datatypes gives an error: > CREATE TABLE trade2( > COL1 timestamp, > COL2 STRING, > COL3 STRING) > USING org.apache.spark.sql.jdbc > OPTIONS ( > url "jdbc:mysql://192.168.30.191:3318/?user=root", > dbtable "database.trade", > driver "com.mysql.jdbc.Driver" > ); > Error: org.apache.spark.sql.AnalysisException: > org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow > user-specified schemas.; SQLState: null ErrorCode: 0 > Trying to create a VIEW from the table that was created.(The select statement > below returns data) > CREATE VIEW viewtrade as Select Col1 from trade2; > Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: > SemanticException [Error 10004]: Line 1:30 Invalid table alias or column > reference 'Col1': (possible column names are: col) > SQLState: null > ErrorCode: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12421) Fix copy() method of GenericRow
Burkard Doepfner created SPARK-12421: Summary: Fix copy() method of GenericRow Key: SPARK-12421 URL: https://issues.apache.org/jira/browse/SPARK-12421 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Burkard Doepfner Priority: Minor The copy() method of the GenericRow class does actually not copy itself. The method just returns itself. Simple reproduction code of the issue: import org.apache.spark.sql.Row; val row = Row.fromSeq(Array(1,2,3,4,5)) val arr = row.toSeq.toArray arr(0) = 6 row // first value changed to 6 val rowCopied = row.copy() val arrCopied = rowCopied.toSeq.toArray arrCopied(0) = 7 row // first value still changed (to 7) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12420) Have a built-in CSV data source implementation
[ https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12420: Target Version/s: 2.0.0 > Have a built-in CSV data source implementation > -- > > Key: SPARK-12420 > URL: https://issues.apache.org/jira/browse/SPARK-12420 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > CSV is the most common data format in the "small data" world. It is often the > first format people want to try when they see Spark on a single node. Having > to rely on a 3rd party component for this is a very bad user experience for > new users. > We should consider inlining https://github.com/databricks/spark-csv -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12420) Have a built-in CSV data source implementation
Reynold Xin created SPARK-12420: --- Summary: Have a built-in CSV data source implementation Key: SPARK-12420 URL: https://issues.apache.org/jira/browse/SPARK-12420 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin CSV is the most common data format in the "small data" world. It is often the first format people want to try when they see Spark on a single node. Having to rely on a 3rd party component for this is a very bad user experience for new users. We should consider inlining https://github.com/databricks/spark-csv -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12421) Fix copy() method of GenericRow
[ https://issues.apache.org/jira/browse/SPARK-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12421: Assignee: (was: Apache Spark) > Fix copy() method of GenericRow > > > Key: SPARK-12421 > URL: https://issues.apache.org/jira/browse/SPARK-12421 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Burkard Doepfner >Priority: Minor > > The copy() method of the GenericRow class does actually not copy itself. The > method just returns itself. > Simple reproduction code of the issue: > import org.apache.spark.sql.Row; > val row = Row.fromSeq(Array(1,2,3,4,5)) > val arr = row.toSeq.toArray > arr(0) = 6 > row // first value changed to 6 > val rowCopied = row.copy() > val arrCopied = rowCopied.toSeq.toArray > arrCopied(0) = 7 > row // first value still changed (to 7) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12421) Fix copy() method of GenericRow
[ https://issues.apache.org/jira/browse/SPARK-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12421: Assignee: Apache Spark > Fix copy() method of GenericRow > > > Key: SPARK-12421 > URL: https://issues.apache.org/jira/browse/SPARK-12421 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Burkard Doepfner >Assignee: Apache Spark >Priority: Minor > > The copy() method of the GenericRow class does actually not copy itself. The > method just returns itself. > Simple reproduction code of the issue: > import org.apache.spark.sql.Row; > val row = Row.fromSeq(Array(1,2,3,4,5)) > val arr = row.toSeq.toArray > arr(0) = 6 > row // first value changed to 6 > val rowCopied = row.copy() > val arrCopied = rowCopied.toSeq.toArray > arrCopied(0) = 7 > row // first value still changed (to 7) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
[ https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lunen updated SPARK-12403: -- Affects Version/s: 1.5.0 Fix Version/s: 1.4.1 > "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore > > > Key: SPARK-12403 > URL: https://issues.apache.org/jira/browse/SPARK-12403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: ODBC connector query >Reporter: Lunen > Fix For: 1.3.1, 1.4.1 > > > We are unable to query the SPARK tables using the ODBC driver from Simba > Spark(Databricks - "Simba Spark ODBC Driver 1.0") We are able to do a show > databases and show tables, but not any queries. eg. > Working: > Select * from openquery(SPARK,'SHOW DATABASES') > Select * from openquery(SPARK,'SHOW TABLES') > Not working: > Select * from openquery(SPARK,'Select * from lunentest') > The error I get is: > OLE DB provider "MSDASQL" for linked server "SPARK" returned message > "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest". > Msg 7321, Level 16, State 2, Line 2 > An error occurred while preparing the query "Select * from lunentest" for > execution against OLE DB provider "MSDASQL" for linked server "SPARK" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12421) Fix copy() method of GenericRow
[ https://issues.apache.org/jira/browse/SPARK-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063730#comment-15063730 ] Apache Spark commented on SPARK-12421: -- User 'Apo1' has created a pull request for this issue: https://github.com/apache/spark/pull/10374 > Fix copy() method of GenericRow > > > Key: SPARK-12421 > URL: https://issues.apache.org/jira/browse/SPARK-12421 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Burkard Doepfner >Priority: Minor > > The copy() method of the GenericRow class does actually not copy itself. The > method just returns itself. > Simple reproduction code of the issue: > import org.apache.spark.sql.Row; > val row = Row.fromSeq(Array(1,2,3,4,5)) > val arr = row.toSeq.toArray > arr(0) = 6 > row // first value changed to 6 > val rowCopied = row.copy() > val arrCopied = rowCopied.toSeq.toArray > arrCopied(0) = 7 > row // first value still changed (to 7) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12420) Have a built-in CSV data source implementation
[ https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063736#comment-15063736 ] Jeff Zhang commented on SPARK-12420: +1, this is very common use data format. Not sure why it is not built in at the beginning. If there's no license issue, then definitely should make it built-in > Have a built-in CSV data source implementation > -- > > Key: SPARK-12420 > URL: https://issues.apache.org/jira/browse/SPARK-12420 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > CSV is the most common data format in the "small data" world. It is often the > first format people want to try when they see Spark on a single node. Having > to rely on a 3rd party component for this is a very bad user experience for > new users. > We should consider inlining https://github.com/databricks/spark-csv -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12420) Have a built-in CSV data source implementation
[ https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063736#comment-15063736 ] Jeff Zhang edited comment on SPARK-12420 at 12/18/15 9:15 AM: -- +1, this is very common data format. Not sure why it is not built in at the beginning. If there's no license issue, then definitely should make it built-in was (Author: zjffdu): +1, this is very common use data format. Not sure why it is not built in at the beginning. If there's no license issue, then definitely should make it built-in > Have a built-in CSV data source implementation > -- > > Key: SPARK-12420 > URL: https://issues.apache.org/jira/browse/SPARK-12420 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > CSV is the most common data format in the "small data" world. It is often the > first format people want to try when they see Spark on a single node. Having > to rely on a 3rd party component for this is a very bad user experience for > new users. > We should consider inlining https://github.com/databricks/spark-csv -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12417) Orc bloom filter options are not propagated during file write in spark
[ https://issues.apache.org/jira/browse/SPARK-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063778#comment-15063778 ] Apache Spark commented on SPARK-12417: -- User 'rajeshbalamohan' has created a pull request for this issue: https://github.com/apache/spark/pull/10375 > Orc bloom filter options are not propagated during file write in spark > -- > > Key: SPARK-12417 > URL: https://issues.apache.org/jira/browse/SPARK-12417 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Rajesh Balamohan > Attachments: SPARK-12417.1.patch > > > ORC bloom filter is supported by the version of hive used in Spark 1.5.2. > However, when trying to create orc file with bloom filter option, it does not > make use of it. > E.g, following orc output does not create the bloom filter even though the > options are specified. > {noformat} > Map orcOption = new HashMap(); > orcOption.put("orc.bloom.filter.columns", "*"); > hiveContext.sql("select * from accounts where > effective_date='2015-12-30'").write(). > format("orc").options(orcOption).save("/tmp/accounts"); > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12417) Orc bloom filter options are not propagated during file write in spark
[ https://issues.apache.org/jira/browse/SPARK-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12417: Assignee: Apache Spark > Orc bloom filter options are not propagated during file write in spark > -- > > Key: SPARK-12417 > URL: https://issues.apache.org/jira/browse/SPARK-12417 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Rajesh Balamohan >Assignee: Apache Spark > Attachments: SPARK-12417.1.patch > > > ORC bloom filter is supported by the version of hive used in Spark 1.5.2. > However, when trying to create orc file with bloom filter option, it does not > make use of it. > E.g, following orc output does not create the bloom filter even though the > options are specified. > {noformat} > Map orcOption = new HashMap(); > orcOption.put("orc.bloom.filter.columns", "*"); > hiveContext.sql("select * from accounts where > effective_date='2015-12-30'").write(). > format("orc").options(orcOption).save("/tmp/accounts"); > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12313) getPartitionsByFilter doesnt handle predicates on all / multiple Partition Columns
[ https://issues.apache.org/jira/browse/SPARK-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gobinathan SP updated SPARK-12313: -- Description: When enabled spark.sql.hive.metastorePartitionPruning, the getPartitionsByFilter is used For a table partitioned by p1 and p2, when triggered hc.sql("select col from tabl1 where p1='p1V' and p2= 'p2V' ") The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' and col2= 'p2V' . The same is passed to the getPartitionsByFilter method as filter string. On these cases the partitions are not returned from Hive's getPartitionsByFilter method. As a result, for the sql, the number of returned rows is always zero. However, filter on a single column always works. Probalbly it doesn't come through this route I'm using Oracle for Metstore V0.13.1 was: When enabled spark.sql.hive.metastorePartitionPruning, the getPartitionsByFilter is used For a table partitioned by p1 and p2, when triggered hc.sql("select col from tabl1 where p1='p1V' and p2= 'p2V' ") The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' and col2= 'p2V' . On these cases the partitions are not returned from Hive's getPartitionsByFilter method. As a result, for the sql, the number of returned rows is always zero. However, filter on a single column always works. Probalbly it doesn't come through this route I'm using Oracle for Metstore V0.13.1 > getPartitionsByFilter doesnt handle predicates on all / multiple Partition > Columns > -- > > Key: SPARK-12313 > URL: https://issues.apache.org/jira/browse/SPARK-12313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Gobinathan SP >Priority: Critical > > When enabled spark.sql.hive.metastorePartitionPruning, the > getPartitionsByFilter is used > For a table partitioned by p1 and p2, when triggered hc.sql("select col > from tabl1 where p1='p1V' and p2= 'p2V' ") > The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' > and col2= 'p2V' . The same is passed to the getPartitionsByFilter method as > filter string. > On these cases the partitions are not returned from Hive's > getPartitionsByFilter method. As a result, for the sql, the number of > returned rows is always zero. > However, filter on a single column always works. Probalbly it doesn't come > through this route > I'm using Oracle for Metstore V0.13.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12400) Avoid writing a shuffle file if a partition has no output (empty)
[ https://issues.apache.org/jira/browse/SPARK-12400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12400: Assignee: (was: Apache Spark) > Avoid writing a shuffle file if a partition has no output (empty) > - > > Key: SPARK-12400 > URL: https://issues.apache.org/jira/browse/SPARK-12400 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Reynold Xin > > A Spark user was asking for automatic setting of # reducers. When I pushed > for more, it turned out the problem for them is that 200 creates too many > files, when most partitions are empty. > It seems like a simple thing we can do is to avoid creating shuffle files if > a partition is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12400) Avoid writing a shuffle file if a partition has no output (empty)
[ https://issues.apache.org/jira/browse/SPARK-12400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12400: Assignee: Apache Spark > Avoid writing a shuffle file if a partition has no output (empty) > - > > Key: SPARK-12400 > URL: https://issues.apache.org/jira/browse/SPARK-12400 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Reynold Xin >Assignee: Apache Spark > > A Spark user was asking for automatic setting of # reducers. When I pushed > for more, it turned out the problem for them is that 200 creates too many > files, when most partitions are empty. > It seems like a simple thing we can do is to avoid creating shuffle files if > a partition is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12413) Mesos ZK persistence throws a NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-12413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-12413: --- Assignee: Michael Gummelt > Mesos ZK persistence throws a NotSerializableException > -- > > Key: SPARK-12413 > URL: https://issues.apache.org/jira/browse/SPARK-12413 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Michael Gummelt >Assignee: Michael Gummelt > > https://github.com/apache/spark/pull/10359 breaks ZK persistence due to > https://issues.scala-lang.org/browse/SI-6654 > This line throws a NotSerializable exception: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster > The MesosClusterDispatcher attempts to serialize MesosDriverDescription > objects to ZK, but https://github.com/apache/spark/pull/10359 makes it so the > {{command}} property is unserializable > Offer id: 72f4d1ce-67f7-41b0-95a3-aa6fb208df32-O189, cpu: 3.0, mem: 12995.0 > 15/12/17 21:52:44 DEBUG ClientCnxn: Got ping response for sessionid: > 0x151b1d1567e0002 after 0ms > 15/12/17 21:52:44 DEBUG nio: created > SCEP@2e746d70{l(/10.0.6.166:41456)<->r(/10.0.0.240:17386),s=0,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=0}-{AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0} > 15/12/17 21:52:44 DEBUG HttpParser: filled 1591/1591 > 15/12/17 21:52:44 DEBUG Server: REQUEST /v1/submissions/create on > AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=2,l=2,c=1174},r=1 > 15/12/17 21:52:44 DEBUG ContextHandler: scope null||/v1/submissions/create @ > o.s.j.s.ServletContextHandler{/,null} > 15/12/17 21:52:44 DEBUG ContextHandler: context=||/v1/submissions/create @ > o.s.j.s.ServletContextHandler{/,null} > 15/12/17 21:52:44 DEBUG ServletHandler: servlet |/v1/submissions/create|null > -> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet-368e091 > 15/12/17 21:52:44 DEBUG ServletHandler: chain=null > 15/12/17 21:52:44 WARN ServletHandler: /v1/submissions/create > java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1 > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at org.apache.spark.util.Utils$.serialize(Utils.scala:83) > at > org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:110) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:166) > at > org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:132) > at > org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:258) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12413) Mesos ZK persistence throws a NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-12413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-12413. Resolution: Fixed > Mesos ZK persistence throws a NotSerializableException > -- > > Key: SPARK-12413 > URL: https://issues.apache.org/jira/browse/SPARK-12413 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Michael Gummelt >Assignee: Michael Gummelt > Fix For: 1.6.0, 2.0.0 > > > https://github.com/apache/spark/pull/10359 breaks ZK persistence due to > https://issues.scala-lang.org/browse/SI-6654 > This line throws a NotSerializable exception: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster > The MesosClusterDispatcher attempts to serialize MesosDriverDescription > objects to ZK, but https://github.com/apache/spark/pull/10359 makes it so the > {{command}} property is unserializable > Offer id: 72f4d1ce-67f7-41b0-95a3-aa6fb208df32-O189, cpu: 3.0, mem: 12995.0 > 15/12/17 21:52:44 DEBUG ClientCnxn: Got ping response for sessionid: > 0x151b1d1567e0002 after 0ms > 15/12/17 21:52:44 DEBUG nio: created > SCEP@2e746d70{l(/10.0.6.166:41456)<->r(/10.0.0.240:17386),s=0,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=0}-{AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0} > 15/12/17 21:52:44 DEBUG HttpParser: filled 1591/1591 > 15/12/17 21:52:44 DEBUG Server: REQUEST /v1/submissions/create on > AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=2,l=2,c=1174},r=1 > 15/12/17 21:52:44 DEBUG ContextHandler: scope null||/v1/submissions/create @ > o.s.j.s.ServletContextHandler{/,null} > 15/12/17 21:52:44 DEBUG ContextHandler: context=||/v1/submissions/create @ > o.s.j.s.ServletContextHandler{/,null} > 15/12/17 21:52:44 DEBUG ServletHandler: servlet |/v1/submissions/create|null > -> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet-368e091 > 15/12/17 21:52:44 DEBUG ServletHandler: chain=null > 15/12/17 21:52:44 WARN ServletHandler: /v1/submissions/create > java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1 > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at org.apache.spark.util.Utils$.serialize(Utils.scala:83) > at > org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:110) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:166) > at > org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:132) > at > org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:258) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12413) Mesos ZK persistence throws a NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-12413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-12413: --- Fix Version/s: 2.0.0 1.6.0 > Mesos ZK persistence throws a NotSerializableException > -- > > Key: SPARK-12413 > URL: https://issues.apache.org/jira/browse/SPARK-12413 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Michael Gummelt >Assignee: Michael Gummelt > Fix For: 1.6.0, 2.0.0 > > > https://github.com/apache/spark/pull/10359 breaks ZK persistence due to > https://issues.scala-lang.org/browse/SI-6654 > This line throws a NotSerializable exception: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster > The MesosClusterDispatcher attempts to serialize MesosDriverDescription > objects to ZK, but https://github.com/apache/spark/pull/10359 makes it so the > {{command}} property is unserializable > Offer id: 72f4d1ce-67f7-41b0-95a3-aa6fb208df32-O189, cpu: 3.0, mem: 12995.0 > 15/12/17 21:52:44 DEBUG ClientCnxn: Got ping response for sessionid: > 0x151b1d1567e0002 after 0ms > 15/12/17 21:52:44 DEBUG nio: created > SCEP@2e746d70{l(/10.0.6.166:41456)<->r(/10.0.0.240:17386),s=0,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=0}-{AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0} > 15/12/17 21:52:44 DEBUG HttpParser: filled 1591/1591 > 15/12/17 21:52:44 DEBUG Server: REQUEST /v1/submissions/create on > AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=2,l=2,c=1174},r=1 > 15/12/17 21:52:44 DEBUG ContextHandler: scope null||/v1/submissions/create @ > o.s.j.s.ServletContextHandler{/,null} > 15/12/17 21:52:44 DEBUG ContextHandler: context=||/v1/submissions/create @ > o.s.j.s.ServletContextHandler{/,null} > 15/12/17 21:52:44 DEBUG ServletHandler: servlet |/v1/submissions/create|null > -> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet-368e091 > 15/12/17 21:52:44 DEBUG ServletHandler: chain=null > 15/12/17 21:52:44 WARN ServletHandler: /v1/submissions/create > java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1 > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at org.apache.spark.util.Utils$.serialize(Utils.scala:83) > at > org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:110) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:166) > at > org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:132) > at > org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:258) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12413) Mesos ZK persistence throws a NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-12413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063845#comment-15063845 ] Kousuke Saruta commented on SPARK-12413: Memorandum: If 1.6.0-RC4 is not cut, we should modify Fix Versions from 1.6.0 to 1.6.1. > Mesos ZK persistence throws a NotSerializableException > -- > > Key: SPARK-12413 > URL: https://issues.apache.org/jira/browse/SPARK-12413 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.6.0 >Reporter: Michael Gummelt >Assignee: Michael Gummelt > Fix For: 1.6.0, 2.0.0 > > > https://github.com/apache/spark/pull/10359 breaks ZK persistence due to > https://issues.scala-lang.org/browse/SI-6654 > This line throws a NotSerializable exception: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster > The MesosClusterDispatcher attempts to serialize MesosDriverDescription > objects to ZK, but https://github.com/apache/spark/pull/10359 makes it so the > {{command}} property is unserializable > Offer id: 72f4d1ce-67f7-41b0-95a3-aa6fb208df32-O189, cpu: 3.0, mem: 12995.0 > 15/12/17 21:52:44 DEBUG ClientCnxn: Got ping response for sessionid: > 0x151b1d1567e0002 after 0ms > 15/12/17 21:52:44 DEBUG nio: created > SCEP@2e746d70{l(/10.0.6.166:41456)<->r(/10.0.0.240:17386),s=0,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=0}-{AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0} > 15/12/17 21:52:44 DEBUG HttpParser: filled 1591/1591 > 15/12/17 21:52:44 DEBUG Server: REQUEST /v1/submissions/create on > AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=2,l=2,c=1174},r=1 > 15/12/17 21:52:44 DEBUG ContextHandler: scope null||/v1/submissions/create @ > o.s.j.s.ServletContextHandler{/,null} > 15/12/17 21:52:44 DEBUG ContextHandler: context=||/v1/submissions/create @ > o.s.j.s.ServletContextHandler{/,null} > 15/12/17 21:52:44 DEBUG ServletHandler: servlet |/v1/submissions/create|null > -> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet-368e091 > 15/12/17 21:52:44 DEBUG ServletHandler: chain=null > 15/12/17 21:52:44 WARN ServletHandler: /v1/submissions/create > java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1 > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at org.apache.spark.util.Utils$.serialize(Utils.scala:83) > at > org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:110) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:166) > at > org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:132) > at > org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:258) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12393) Add read.text and write.text for SparkR
[ https://issues.apache.org/jira/browse/SPARK-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12393: Assignee: (was: Apache Spark) > Add read.text and write.text for SparkR > --- > > Key: SPARK-12393 > URL: https://issues.apache.org/jira/browse/SPARK-12393 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Yanbo Liang > > Add read.text and write.text for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12393) Add read.text and write.text for SparkR
[ https://issues.apache.org/jira/browse/SPARK-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12393: Assignee: Apache Spark > Add read.text and write.text for SparkR > --- > > Key: SPARK-12393 > URL: https://issues.apache.org/jira/browse/SPARK-12393 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Yanbo Liang >Assignee: Apache Spark > > Add read.text and write.text for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12400) Avoid writing a shuffle file if a partition has no output (empty)
[ https://issues.apache.org/jira/browse/SPARK-12400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063891#comment-15063891 ] Apache Spark commented on SPARK-12400: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/10376 > Avoid writing a shuffle file if a partition has no output (empty) > - > > Key: SPARK-12400 > URL: https://issues.apache.org/jira/browse/SPARK-12400 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Reynold Xin > > A Spark user was asking for automatic setting of # reducers. When I pushed > for more, it turned out the problem for them is that 200 creates too many > files, when most partitions are empty. > It seems like a simple thing we can do is to avoid creating shuffle files if > a partition is empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12422) Binding Spark Standalone Master to public IP fails
Bennet Jeutter created SPARK-12422: -- Summary: Binding Spark Standalone Master to public IP fails Key: SPARK-12422 URL: https://issues.apache.org/jira/browse/SPARK-12422 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.5.2 Environment: Fails on direct deployment on Mac OSX and also in Docker Environment (running on OSX or Ubuntu) Reporter: Bennet Jeutter Priority: Blocker The start of the Spark Standalone Master fails, when the host specified equals the public IP address. For example I created a Docker Machine with public IP 192.168.99.100, then I run: /usr/spark/bin/spark-class org.apache.spark.deploy.master.Master -h 192.168.99.100 It'll fail with: Exception in thread "main" java.net.BindException: Failed to bind to: /192.168.99.100:7093: Service 'sparkMaster' failed after 16 retries! at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393) at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389) at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) at scala.util.Try$.apply(Try.scala:161) at scala.util.Success.map(Try.scala:206) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) So I thought oh well, lets just bind to the local IP and access it via public IP - this doesn't work, it will give: dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://sparkMaster@192.168.99.100:7077/]] arriving at [akka.tcp://sparkMaster@192.168.99.100:7077] inbound addresses are [akka.tcp://sparkMaster@spark-master:7077] So there is currently no possibility to run all this... related stackoverflow issues: * http://stackoverflow.com/questions/31659228/getting-java-net-bindexception-when-attempting-to-start-spark-master-on-ec2-node * http://stackoverflow.com/questions/33768029/access-apache-spark-standalone-master-via-ip -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12423) Mesos executor home should not be resolved on the driver's file system
Iulian Dragos created SPARK-12423: - Summary: Mesos executor home should not be resolved on the driver's file system Key: SPARK-12423 URL: https://issues.apache.org/jira/browse/SPARK-12423 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.6.0 Reporter: Iulian Dragos {{spark.mesos.executor.home}} should be an uninterpreted string. It is very possible that this path does not exist on the driver, and if it does, it may be a symlink that should not be resolved. Currently, this leads to failures in client mode. For example, setting it to {{/var/spark/spark-1.6.0-bin-hadoop2.6/}} leads to executors failing: {code} sh: 1: /private/var/spark/spark-1.6.0-bin-hadoop2.6/bin/spark-class: not found {code} {{getCanonicalPath}} transforms {{/var/spark...}} into {{/private/var..}} because on my system there is a symlink from one to the other. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env
[ https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6936: - Assignee: Michael Armbrust > SQLContext.sql() caused deadlock in multi-thread env > > > Key: SPARK-6936 > URL: https://issues.apache.org/jira/browse/SPARK-6936 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: JDK 1.8.x, RedHat > Linux version 2.6.32-431.23.3.el6.x86_64 > (mockbu...@x86-027.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red > Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014 >Reporter: Paul Wu >Assignee: Michael Armbrust > Labels: deadlock, sql, threading > Fix For: 1.5.0 > > > Doing (the same query) in more than one threads with SQLConext.sql may lead > to deadlock. Here is a way to reproduce it (since this is multi-thread issue, > the reproduction may or may not be so easy). > 1. Register a relatively big table. > 2. Create two different classes and in the classes, do the same query in a > method and put the results in a set and print out the set size. > 3. Create two threads to use an object from each class in the run method. > Start the threads. For my tests, it can have a deadlock just in a few runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12396) Once driver client registered successfully,it still retry to connected to master.
[ https://issues.apache.org/jira/browse/SPARK-12396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12396: -- Flags: (was: Patch) Target Version/s: (was: 1.5.2) Labels: (was: patch) Fix Version/s: (was: 1.5.2) [~ZhangMei] please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Don't set target/fix version, and there's no 'patch' type or flag used here. I don't see a pull request. > Once driver client registered successfully,it still retry to connected to > master. > - > > Key: SPARK-12396 > URL: https://issues.apache.org/jira/browse/SPARK-12396 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.5.1, 1.5.2 >Reporter: echo >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > As description in AppClient.scala,Once driver connect to a master > successfully, all scheduling work and Futures will be cancelled. But at > currently,it still try to connect to master. And it should not happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12387) JDBC IN operator push down
[ https://issues.apache.org/jira/browse/SPARK-12387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12387: -- Fix Version/s: (was: 1.6.0) > JDBC IN operator push down > --- > > Key: SPARK-12387 > URL: https://issues.apache.org/jira/browse/SPARK-12387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > For SQL IN operator such as > SELECT column_name(s) > FROM table_name > WHERE column_name IN (value1,value2,...) > Currently this is not pushed down for JDBC datasource. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
[ https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12403: -- Fix Version/s: (was: 1.4.1) (was: 1.3.1) [~lunendl] please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark It doesn't make sense to set fix version, let alone to 1.3.1/1.4.1 > "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore > > > Key: SPARK-12403 > URL: https://issues.apache.org/jira/browse/SPARK-12403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: ODBC connector query >Reporter: Lunen > > We are unable to query the SPARK tables using the ODBC driver from Simba > Spark(Databricks - "Simba Spark ODBC Driver 1.0") We are able to do a show > databases and show tables, but not any queries. eg. > Working: > Select * from openquery(SPARK,'SHOW DATABASES') > Select * from openquery(SPARK,'SHOW TABLES') > Not working: > Select * from openquery(SPARK,'Select * from lunentest') > The error I get is: > OLE DB provider "MSDASQL" for linked server "SPARK" returned message > "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest". > Msg 7321, Level 16, State 2, Line 2 > An error occurred while preparing the query "Select * from lunentest" for > execution against OLE DB provider "MSDASQL" for linked server "SPARK" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12401) Add support for enums in postgres
[ https://issues.apache.org/jira/browse/SPARK-12401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12401: -- Priority: Minor (was: Major) Component/s: SQL > Add support for enums in postgres > - > > Key: SPARK-12401 > URL: https://issues.apache.org/jira/browse/SPARK-12401 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jaka Jancar >Priority: Minor > > JSON and JSONB types [are now > converted|https://github.com/apache/spark/pull/8948/files] into strings on > the Spark side instead of throwing. It would be great it [enumerated > types|http://www.postgresql.org/docs/current/static/datatype-enum.html] were > treated similarly instead of failing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12346) GLM summary crashes with NoSuchElementException if attributes are missing names
[ https://issues.apache.org/jira/browse/SPARK-12346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12346: -- Component/s: SparkR > GLM summary crashes with NoSuchElementException if attributes are missing > names > --- > > Key: SPARK-12346 > URL: https://issues.apache.org/jira/browse/SPARK-12346 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Eric Liang > > In getModelFeatures() of SparkRWrappers.scala, we call _.name.get on all the > feature column attributes. This fails when the attribute name is not defined. > One way of reproducing this is to perform glm() in R with a vector-type input > feature that lacks ML attrs, then trying to call summary() on it, for example: > {code} > df <- sql(sqlContext, "SELECT * FROM testData") > df2 <- withColumnRenamed(df, "f1", "f2") // This drops the ML attrs from f1 > lrModel <- glm(hours_per_week ~ f2, data = df2, family = "gaussian") > summary(lrModel) // NoSuchElementException > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12418) spark shuffle FAILED_TO_UNCOMPRESS
[ https://issues.apache.org/jira/browse/SPARK-12418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12418. --- Resolution: Duplicate Target Version/s: (was: 1.5.1) Please search JIRA first > spark shuffle FAILED_TO_UNCOMPRESS > -- > > Key: SPARK-12418 > URL: https://issues.apache.org/jira/browse/SPARK-12418 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 > Environment: hadoop 2.3.0 > spark 1.5.1 >Reporter: dirk.zhang > > when use default compression snappy,I get error when spark doing shuffle > Job aborted due to stage failure: Task 19 in stage 2.3 failed 4 times, > most recent failure: Lost task 19.3 in stage 2.3 (TID 10311, 192.168.6.36): > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:480) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159) > at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1179) > at > org.apache.spark.shuffle.hash.HashShuffleReader$$anonfun$3.apply(HashShuffleReader.scala:53) > at > org.apache.spark.shuffle.hash.HashShuffleReader$$anonfun$3.apply(HashShuffleReader.scala:52) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12370) Documentation should link to examples from its own release version
[ https://issues.apache.org/jira/browse/SPARK-12370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12370: -- Priority: Minor (was: Major) Component/s: Documentation > Documentation should link to examples from its own release version > -- > > Key: SPARK-12370 > URL: https://issues.apache.org/jira/browse/SPARK-12370 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Brian London >Priority: Minor > > When documentation is built is should reference examples from the same build. > There are times when the docs have links that point to files in the github > head which may not be valid on the current release. > As an example the spark streaming page for 1.5.2 (currently at > http://spark.apache.org/docs/latest/streaming-programming-guide.html) links > to the stateful network word count example (at > https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala). > That example file utilizes a number of 1.6 features that are not available > in 1.5.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12369) DataFrameReader fails on globbing parquet paths
[ https://issues.apache.org/jira/browse/SPARK-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12369: Assignee: Apache Spark > DataFrameReader fails on globbing parquet paths > --- > > Key: SPARK-12369 > URL: https://issues.apache.org/jira/browse/SPARK-12369 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yana Kadiyska >Assignee: Apache Spark > > Start with a list of parquet paths where some or all do not exist: > {noformat} > val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet") > sqlContext.read.parquet(paths:_*) > java.lang.NullPointerException > at org.apache.hadoop.fs.Globber.glob(Globber.java:218) > at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625) > at > org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251) > at > org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258) > at > org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264) > at > org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260) > {noformat} > It would be better to produce a dataframe from the paths that do exist and > log a warning that a path was missing. Not sure for "all paths are missing > case" -- probably return an emptyDF with no schema since that method already > does so on empty path list.But I would prefer not to have to pre-validate > paths -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9057) Add Scala, Java and Python example to show DStream.transform
[ https://issues.apache.org/jira/browse/SPARK-9057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9057. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 8431 [https://github.com/apache/spark/pull/8431] > Add Scala, Java and Python example to show DStream.transform > > > Key: SPARK-9057 > URL: https://issues.apache.org/jira/browse/SPARK-9057 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das > Labels: starter > Fix For: 2.0.0 > > > Currently there is no example to show the use of transform. Would be good to > add an example, that uses transform to join a static RDD with the RDDs of a > DStream. > Need to be done for all 3 supported languages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12369) DataFrameReader fails on globbing parquet paths
[ https://issues.apache.org/jira/browse/SPARK-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12369: Assignee: (was: Apache Spark) > DataFrameReader fails on globbing parquet paths > --- > > Key: SPARK-12369 > URL: https://issues.apache.org/jira/browse/SPARK-12369 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yana Kadiyska > > Start with a list of parquet paths where some or all do not exist: > {noformat} > val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet") > sqlContext.read.parquet(paths:_*) > java.lang.NullPointerException > at org.apache.hadoop.fs.Globber.glob(Globber.java:218) > at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625) > at > org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251) > at > org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258) > at > org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264) > at > org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260) > {noformat} > It would be better to produce a dataframe from the paths that do exist and > log a warning that a path was missing. Not sure for "all paths are missing > case" -- probably return an emptyDF with no schema since that method already > does so on empty path list.But I would prefer not to have to pre-validate > paths -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8318) Spark Streaming Starter JIRAs
[ https://issues.apache.org/jira/browse/SPARK-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8318. -- Resolution: Implemented > Spark Streaming Starter JIRAs > - > > Key: SPARK-8318 > URL: https://issues.apache.org/jira/browse/SPARK-8318 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Tathagata Das >Priority: Minor > Labels: starter > > This is a master JIRA to collect together all starter tasks related to Spark > Streaming. These are simple tasks that contributors can do to get familiar > with the process of contributing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9057) Add Scala, Java and Python example to show DStream.transform
[ https://issues.apache.org/jira/browse/SPARK-9057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9057: - Assignee: Jeff Lam > Add Scala, Java and Python example to show DStream.transform > > > Key: SPARK-9057 > URL: https://issues.apache.org/jira/browse/SPARK-9057 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Jeff Lam > Labels: starter > Fix For: 2.0.0 > > > Currently there is no example to show the use of transform. Would be good to > add an example, that uses transform to join a static RDD with the RDDs of a > DStream. > Need to be done for all 3 supported languages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12319) Address endian specific problems surfaced in 1.6
[ https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Roberts updated SPARK-12319: - Environment: Problems are evident on BE (was: BE platforms) > Address endian specific problems surfaced in 1.6 > > > Key: SPARK-12319 > URL: https://issues.apache.org/jira/browse/SPARK-12319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Problems are evident on BE >Reporter: Adam Roberts >Priority: Critical > > JIRA to cover endian specific problems - since testing 1.6 I've noticed > problems with DataFrames on BE platforms, e.g. > https://issues.apache.org/jira/browse/SPARK-9858 > [~joshrosen] [~yhuai] > Current progress: using com.google.common.io.LittleEndianDataInputStream and > com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer > fixes three test failures in ExchangeCoordinatorSuite but I'm concerned > around performance/wider functional implications > "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input > with reordering" fails as we expect "one, 1" but instead get "one, 9" - we > believe the issue lies within BitSetMethods.java, specifically around: return > (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12369) DataFrameReader fails on globbing parquet paths
[ https://issues.apache.org/jira/browse/SPARK-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064059#comment-15064059 ] Apache Spark commented on SPARK-12369: -- User 'yanakad' has created a pull request for this issue: https://github.com/apache/spark/pull/10379 > DataFrameReader fails on globbing parquet paths > --- > > Key: SPARK-12369 > URL: https://issues.apache.org/jira/browse/SPARK-12369 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yana Kadiyska > > Start with a list of parquet paths where some or all do not exist: > {noformat} > val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet") > sqlContext.read.parquet(paths:_*) > java.lang.NullPointerException > at org.apache.hadoop.fs.Globber.glob(Globber.java:218) > at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625) > at > org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251) > at > org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258) > at > org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264) > at > org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260) > {noformat} > It would be better to produce a dataframe from the paths that do exist and > log a warning that a path was missing. Not sure for "all paths are missing > case" -- probably return an emptyDF with no schema since that method already > does so on empty path list.But I would prefer not to have to pre-validate > paths -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12424) The implementation of ParamMap#filter is wrong.
Kousuke Saruta created SPARK-12424: -- Summary: The implementation of ParamMap#filter is wrong. Key: SPARK-12424 URL: https://issues.apache.org/jira/browse/SPARK-12424 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.6.0, 2.0.0 Reporter: Kousuke Saruta ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` is collection.Map, not mutable.Map but the result is casted to mutable.Map using `asInstanceOf` so we get `ClassCastException`. Also, the return type of Map#filterKeys is not Serializable. It's the issue of Scala (https://issues.scala-lang.org/browse/SI-6654). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12424) The implementation of ParamMap#filter is wrong.
[ https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064113#comment-15064113 ] Apache Spark commented on SPARK-12424: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/10381 > The implementation of ParamMap#filter is wrong. > --- > > Key: SPARK-12424 > URL: https://issues.apache.org/jira/browse/SPARK-12424 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 2.0.0 >Reporter: Kousuke Saruta > > ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` > is collection.Map, not mutable.Map but the result is casted to mutable.Map > using `asInstanceOf` so we get `ClassCastException`. > Also, the return type of Map#filterKeys is not Serializable. It's the issue > of Scala (https://issues.scala-lang.org/browse/SI-6654). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12319) Address endian specific problems surfaced in 1.6
[ https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Roberts updated SPARK-12319: - Environment: Problems apparent on BE, LE could be impacted too (was: Problems are evident on BE) > Address endian specific problems surfaced in 1.6 > > > Key: SPARK-12319 > URL: https://issues.apache.org/jira/browse/SPARK-12319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Problems apparent on BE, LE could be impacted too >Reporter: Adam Roberts >Priority: Critical > > JIRA to cover endian specific problems - since testing 1.6 I've noticed > problems with DataFrames on BE platforms, e.g. > https://issues.apache.org/jira/browse/SPARK-9858 > [~joshrosen] [~yhuai] > Current progress: using com.google.common.io.LittleEndianDataInputStream and > com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer > fixes three test failures in ExchangeCoordinatorSuite but I'm concerned > around performance/wider functional implications > "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input > with reordering" fails as we expect "one, 1" but instead get "one, 9" - we > believe the issue lies within BitSetMethods.java, specifically around: return > (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12424) The implementation of ParamMap#filter is wrong.
[ https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12424: Assignee: (was: Apache Spark) > The implementation of ParamMap#filter is wrong. > --- > > Key: SPARK-12424 > URL: https://issues.apache.org/jira/browse/SPARK-12424 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 2.0.0 >Reporter: Kousuke Saruta > > ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` > is collection.Map, not mutable.Map but the result is casted to mutable.Map > using `asInstanceOf` so we get `ClassCastException`. > Also, the return type of Map#filterKeys is not Serializable. It's the issue > of Scala (https://issues.scala-lang.org/browse/SI-6654). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12424) The implementation of ParamMap#filter is wrong.
[ https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12424: Assignee: Apache Spark > The implementation of ParamMap#filter is wrong. > --- > > Key: SPARK-12424 > URL: https://issues.apache.org/jira/browse/SPARK-12424 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 2.0.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark > > ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` > is collection.Map, not mutable.Map but the result is casted to mutable.Map > using `asInstanceOf` so we get `ClassCastException`. > Also, the return type of Map#filterKeys is not Serializable. It's the issue > of Scala (https://issues.scala-lang.org/browse/SI-6654). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12424) The implementation of ParamMap#filter is wrong.
[ https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12424: Assignee: (was: Apache Spark) > The implementation of ParamMap#filter is wrong. > --- > > Key: SPARK-12424 > URL: https://issues.apache.org/jira/browse/SPARK-12424 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 2.0.0 >Reporter: Kousuke Saruta > > ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` > is collection.Map, not mutable.Map but the result is casted to mutable.Map > using `asInstanceOf` so we get `ClassCastException`. > Also, the return type of Map#filterKeys is not Serializable. It's the issue > of Scala (https://issues.scala-lang.org/browse/SI-6654). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12424) The implementation of ParamMap#filter is wrong.
[ https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12424: Assignee: Apache Spark > The implementation of ParamMap#filter is wrong. > --- > > Key: SPARK-12424 > URL: https://issues.apache.org/jira/browse/SPARK-12424 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 2.0.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark > > ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` > is collection.Map, not mutable.Map but the result is casted to mutable.Map > using `asInstanceOf` so we get `ClassCastException`. > Also, the return type of Map#filterKeys is not Serializable. It's the issue > of Scala (https://issues.scala-lang.org/browse/SI-6654). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12424) The implementation of ParamMap#filter is wrong.
[ https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12424: Assignee: (was: Apache Spark) > The implementation of ParamMap#filter is wrong. > --- > > Key: SPARK-12424 > URL: https://issues.apache.org/jira/browse/SPARK-12424 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 2.0.0 >Reporter: Kousuke Saruta > > ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` > is collection.Map, not mutable.Map but the result is casted to mutable.Map > using `asInstanceOf` so we get `ClassCastException`. > Also, the return type of Map#filterKeys is not Serializable. It's the issue > of Scala (https://issues.scala-lang.org/browse/SI-6654). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12424) The implementation of ParamMap#filter is wrong.
[ https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12424: Assignee: Apache Spark > The implementation of ParamMap#filter is wrong. > --- > > Key: SPARK-12424 > URL: https://issues.apache.org/jira/browse/SPARK-12424 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 2.0.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark > > ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` > is collection.Map, not mutable.Map but the result is casted to mutable.Map > using `asInstanceOf` so we get `ClassCastException`. > Also, the return type of Map#filterKeys is not Serializable. It's the issue > of Scala (https://issues.scala-lang.org/browse/SI-6654). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11701) YARN - dynamic allocation and speculation active task accounting wrong
[ https://issues.apache.org/jira/browse/SPARK-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064200#comment-15064200 ] Thomas Graves commented on SPARK-11701: --- I ran into another instance of this and its when the job has multiple stages, if its not the last stage and both speculative tasks finish, they are both marked as success. One of them gets ignored which can leave counts wrong and it shows that an executor still has a task. 15/12/18 16:01:08 INFO scheduler.TaskSetManager: Ignoring task-finished event for 8.1 in stage 0.0 because task 8 has already completed successfully In this case the TaskCommit code and DAG scheduler won't handle it, the TaskSetManager.handleSuccessfulTask needs to handle it. > YARN - dynamic allocation and speculation active task accounting wrong > -- > > Key: SPARK-11701 > URL: https://issues.apache.org/jira/browse/SPARK-11701 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Critical > > I am using dynamic container allocation and speculation and am seeing issues > with the active task accounting. The Executor UI still shows active tasks on > the an executor but the job/stage is all completed. I think its also > affecting the dynamic allocation being able to release containers because it > thinks there are still tasks. > Its easily reproduce by using spark-shell, turn on dynamic allocation, then > run just a wordcount on decent sized file and set the speculation parameters > low: > spark.dynamicAllocation.enabled true > spark.shuffle.service.enabled true > spark.dynamicAllocation.maxExecutors 10 > spark.dynamicAllocation.minExecutors 2 > spark.dynamicAllocation.initialExecutors 10 > spark.dynamicAllocation.executorIdleTimeout 40s > $SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf > spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 > --master yarn --deploy-mode client --executor-memory 4g --driver-memory 4g -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10291) Add statsByKey method to compute StatCounters for each key in an RDD
[ https://issues.apache.org/jira/browse/SPARK-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064266#comment-15064266 ] Sean Owen commented on SPARK-10291: --- My POV is that this isn't likely worth adding a method for. I appreciate the value of utility methods but have to weight it against adding another item to a core API and how often it'd be used. This is also straightforward to express in Spark SQL on a dataframe, no? > Add statsByKey method to compute StatCounters for each key in an RDD > > > Key: SPARK-10291 > URL: https://issues.apache.org/jira/browse/SPARK-10291 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Erik Shilts >Priority: Minor > > A common task is to summarize numerical data for different groups. Having a > `statsByKey` method would simplify this so the user would not have to write > the aggregators for all the statistics or manage collecting by key and > computing individual StatCounters. > This should be a straightforward addition to PySpark. I can look into adding > to Scala and R if we want to maintain feature parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12409) JDBC AND operator push down
[ https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12409: Assignee: Apache Spark > JDBC AND operator push down > > > Key: SPARK-12409 > URL: https://issues.apache.org/jira/browse/SPARK-12409 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > For simple AND such as > select * from test where THEID = 1 AND NAME = 'fred', > The filters pushed down to JDBC layers are EqualTo(THEID,1), > EqualTo(Name,fred). These are handled OK by the current code. > For query such as > SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" , > the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2))) > So need to add And filter in JDBC layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12409) JDBC AND operator push down
[ https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12409: Assignee: (was: Apache Spark) > JDBC AND operator push down > > > Key: SPARK-12409 > URL: https://issues.apache.org/jira/browse/SPARK-12409 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Huaxin Gao >Priority: Minor > > For simple AND such as > select * from test where THEID = 1 AND NAME = 'fred', > The filters pushed down to JDBC layers are EqualTo(THEID,1), > EqualTo(Name,fred). These are handled OK by the current code. > For query such as > SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" , > the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2))) > So need to add And filter in JDBC layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12350) VectorAssembler#transform() initially throws an exception
[ https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-12350. Resolution: Fixed Assignee: Marcelo Vanzin (was: Apache Spark) Fix Version/s: 2.0.0 > VectorAssembler#transform() initially throws an exception > - > > Key: SPARK-12350 > URL: https://issues.apache.org/jira/browse/SPARK-12350 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell > Environment: sparkShell command from sbt >Reporter: Jakob Odersky >Assignee: Marcelo Vanzin > Fix For: 2.0.0 > > > Calling VectorAssembler.transform() initially throws an exception, subsequent > calls work. > h3. Steps to reproduce > In spark-shell, > 1. Create a dummy dataframe and define an assembler > {code} > import org.apache.spark.ml.feature.VectorAssembler > val df = sc.parallelize(List((1,2), (3,4))).toDF > val assembler = new VectorAssembler().setInputCols(Array("_1", > "_2")).setOutputCol("features") > {code} > 2. Run > {code} > assembler.transform(df).show > {code} > Initially the following exception is thrown: > {code} > 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream > /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request > from /9.72.139.102:60610 > java.lang.IllegalArgumentException: requirement failed: File not found: > /classes/org/apache/spark/sql/catalyst/expressions/Object.class > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60) > at > org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > Subsequent calls work: > {code} > +---+---+-+ > | _1| _2| features| > +---+---+-+ > | 1| 2|[1.0,2.0]| > | 3| 4|[3.0,4.0]| > +---+---+-+ > {code} > It seems as though there is some internal state that is not initialized. > [~iyounus] originally found this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11619) cannot use UDTF in DataFrame.selectExpr
[ https://issues.apache.org/jira/browse/SPARK-11619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-11619. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 9981 [https://github.com/apache/spark/pull/9981] > cannot use UDTF in DataFrame.selectExpr > --- > > Key: SPARK-11619 > URL: https://issues.apache.org/jira/browse/SPARK-11619 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Priority: Minor > Fix For: 2.0.0 > > > Currently if use UDTF like `explode`, `json_tuple` in `DataFrame.selectExpr`, > it will be parsed into `UnresolvedFunction` first, and then alias it with > `expr.prettyString`. However, UDTF may need MultiAlias so we will get error > if we run: > {code} > val df = Seq((Map("1" -> 1), 1)).toDF("a", "b") > df.selectExpr("explode(a)").show() > {code} > [info] org.apache.spark.sql.AnalysisException: Expect multiple names given > for org.apache.spark.sql.catalyst.expressions.Explode, > [info] but only single name ''explode(a)' specified; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11619) cannot use UDTF in DataFrame.selectExpr
[ https://issues.apache.org/jira/browse/SPARK-11619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-11619: - Assignee: Dilip Biswal > cannot use UDTF in DataFrame.selectExpr > --- > > Key: SPARK-11619 > URL: https://issues.apache.org/jira/browse/SPARK-11619 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Dilip Biswal >Priority: Minor > Fix For: 2.0.0 > > > Currently if use UDTF like `explode`, `json_tuple` in `DataFrame.selectExpr`, > it will be parsed into `UnresolvedFunction` first, and then alias it with > `expr.prettyString`. However, UDTF may need MultiAlias so we will get error > if we run: > {code} > val df = Seq((Map("1" -> 1), 1)).toDF("a", "b") > df.selectExpr("explode(a)").show() > {code} > [info] org.apache.spark.sql.AnalysisException: Expect multiple names given > for org.apache.spark.sql.catalyst.expressions.Explode, > [info] but only single name ''explode(a)' specified; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12353) wrong output for countByValue and countByValueAndWindow
[ https://issues.apache.org/jira/browse/SPARK-12353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12353: Assignee: Apache Spark > wrong output for countByValue and countByValueAndWindow > --- > > Key: SPARK-12353 > URL: https://issues.apache.org/jira/browse/SPARK-12353 > Project: Spark > Issue Type: Bug > Components: Documentation, Input/Output, PySpark, Streaming >Affects Versions: 1.5.2 > Environment: Ubuntu 14.04, Python 2.7.6 >Reporter: Bo Jin >Assignee: Apache Spark > Labels: easyfix > Original Estimate: 2h > Remaining Estimate: 2h > > http://stackoverflow.com/q/34114585/4698425 > In PySpark Streaming, function countByValue and countByValueAndWindow return > one single number which is the count of distinct elements, instead of a list > of (k,v) pairs. > It's inconsistent with the documentation: > countByValue: When called on a DStream of elements of type K, return a new > DStream of (K, Long) pairs where the value of each key is its frequency in > each RDD of the source DStream. > countByValueAndWindow: When called on a DStream of (K, V) pairs, returns a > new DStream of (K, Long) pairs where the value of each key is its frequency > within a sliding window. Like in reduceByKeyAndWindow, the number of reduce > tasks is configurable through an optional argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12353) wrong output for countByValue and countByValueAndWindow
[ https://issues.apache.org/jira/browse/SPARK-12353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12353: Assignee: (was: Apache Spark) > wrong output for countByValue and countByValueAndWindow > --- > > Key: SPARK-12353 > URL: https://issues.apache.org/jira/browse/SPARK-12353 > Project: Spark > Issue Type: Bug > Components: Documentation, Input/Output, PySpark, Streaming >Affects Versions: 1.5.2 > Environment: Ubuntu 14.04, Python 2.7.6 >Reporter: Bo Jin > Labels: easyfix > Original Estimate: 2h > Remaining Estimate: 2h > > http://stackoverflow.com/q/34114585/4698425 > In PySpark Streaming, function countByValue and countByValueAndWindow return > one single number which is the count of distinct elements, instead of a list > of (k,v) pairs. > It's inconsistent with the documentation: > countByValue: When called on a DStream of elements of type K, return a new > DStream of (K, Long) pairs where the value of each key is its frequency in > each RDD of the source DStream. > countByValueAndWindow: When called on a DStream of (K, V) pairs, returns a > new DStream of (K, Long) pairs where the value of each key is its frequency > within a sliding window. Like in reduceByKeyAndWindow, the number of reduce > tasks is configurable through an optional argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12425) DStream union optimisation
Guillaume Poulin created SPARK-12425: Summary: DStream union optimisation Key: SPARK-12425 URL: https://issues.apache.org/jira/browse/SPARK-12425 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Guillaume Poulin Priority: Minor Currently, `DStream.union` always use `UnionRDD` on the underlying `RDD`. However using `PartitionerAwareUnionRDD` when possible would yield to better performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12425) DStream union optimisation
[ https://issues.apache.org/jira/browse/SPARK-12425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064376#comment-15064376 ] Apache Spark commented on SPARK-12425: -- User 'gpoulin' has created a pull request for this issue: https://github.com/apache/spark/pull/10382 > DStream union optimisation > -- > > Key: SPARK-12425 > URL: https://issues.apache.org/jira/browse/SPARK-12425 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Guillaume Poulin >Priority: Minor > > Currently, `DStream.union` always use `UnionRDD` on the underlying `RDD`. > However using `PartitionerAwareUnionRDD` when possible would yield to better > performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12054) Consider nullable in codegen
[ https://issues.apache.org/jira/browse/SPARK-12054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12054. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10333 [https://github.com/apache/spark/pull/10333] > Consider nullable in codegen > > > Key: SPARK-12054 > URL: https://issues.apache.org/jira/browse/SPARK-12054 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > Currently, we always check the nullability for results of expressions, we > could skip that if the expression is not nullable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12391) JDBC OR operator push down
[ https://issues.apache.org/jira/browse/SPARK-12391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12391: - Target Version/s: (was: 1.6.0) > JDBC OR operator push down > -- > > Key: SPARK-12391 > URL: https://issues.apache.org/jira/browse/SPARK-12391 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > For SQL OR operator such as > SELECT * > FROM table_name > WHERE column_name1 = value1 OR column_name2 = value2 > Will push down to JDBC datasource -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12409) JDBC AND operator push down
[ https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12409: - Target Version/s: (was: 1.6.0) > JDBC AND operator push down > > > Key: SPARK-12409 > URL: https://issues.apache.org/jira/browse/SPARK-12409 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Huaxin Gao >Priority: Minor > > For simple AND such as > select * from test where THEID = 1 AND NAME = 'fred', > The filters pushed down to JDBC layers are EqualTo(THEID,1), > EqualTo(Name,fred). These are handled OK by the current code. > For query such as > SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" , > the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2))) > So need to add And filter in JDBC layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12335) CentralMomentAgg should be nullable
[ https://issues.apache.org/jira/browse/SPARK-12335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12335: -- Assignee: Davies Liu (was: Apache Spark) > CentralMomentAgg should be nullable > --- > > Key: SPARK-12335 > URL: https://issues.apache.org/jira/browse/SPARK-12335 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu > > According to the {{getStatistics}} method overriden in all its subclasses, > {{CentralMomentAgg}} should be nullable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12335) CentralMomentAgg should be nullable
[ https://issues.apache.org/jira/browse/SPARK-12335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12335. Resolution: Fixed Fix Version/s: 2.0.0 https://github.com/apache/spark/pull/10333 > CentralMomentAgg should be nullable > --- > > Key: SPARK-12335 > URL: https://issues.apache.org/jira/browse/SPARK-12335 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu > Fix For: 2.0.0 > > > According to the {{getStatistics}} method overriden in all its subclasses, > {{CentralMomentAgg}} should be nullable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12336) Outer join using multiple columns results in wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12336: -- Assignee: Davies Liu (was: Cheng Lian) > Outer join using multiple columns results in wrong nullability > -- > > Key: SPARK-12336 > URL: https://issues.apache.org/jira/browse/SPARK-12336 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu > Fix For: 2.0.0 > > > When joining two DataFrames using multiple columns, a temporary inner join is > used to compute join output. Then a real join operator is created and > projected. However, the final projection list is based on the inner join > rather than real join operator. When the real join operator is an outer join, > nullability of the final projection can be wrong, since outer join may alter > nullability of its child plan(s). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12336) Outer join using multiple columns results in wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12336. Resolution: Fixed Fix Version/s: 2.0.0 https://github.com/apache/spark/pull/10333 > Outer join using multiple columns results in wrong nullability > -- > > Key: SPARK-12336 > URL: https://issues.apache.org/jira/browse/SPARK-12336 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu > Fix For: 2.0.0 > > > When joining two DataFrames using multiple columns, a temporary inner join is > used to compute join output. Then a real join operator is created and > projected. However, the final projection list is based on the inner join > rather than real join operator. When the real join operator is an outer join, > nullability of the final projection can be wrong, since outer join may alter > nullability of its child plan(s). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12342) Corr (Pearson correlation) should be nullable
[ https://issues.apache.org/jira/browse/SPARK-12342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12342: -- Assignee: Davies Liu (was: Cheng Lian) > Corr (Pearson correlation) should be nullable > - > > Key: SPARK-12342 > URL: https://issues.apache.org/jira/browse/SPARK-12342 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12341) The "comment" field of DESCRIBE result set should be nullable
[ https://issues.apache.org/jira/browse/SPARK-12341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12341: -- Assignee: Davies Liu (was: Apache Spark) > The "comment" field of DESCRIBE result set should be nullable > - > > Key: SPARK-12341 > URL: https://issues.apache.org/jira/browse/SPARK-12341 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12341) The "comment" field of DESCRIBE result set should be nullable
[ https://issues.apache.org/jira/browse/SPARK-12341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12341. Resolution: Fixed Fix Version/s: 2.0.0 https://github.com/apache/spark/pull/10333 > The "comment" field of DESCRIBE result set should be nullable > - > > Key: SPARK-12341 > URL: https://issues.apache.org/jira/browse/SPARK-12341 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12342) Corr (Pearson correlation) should be nullable
[ https://issues.apache.org/jira/browse/SPARK-12342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12342. Resolution: Fixed Fix Version/s: 2.0.0 https://github.com/apache/spark/pull/10333 > Corr (Pearson correlation) should be nullable > - > > Key: SPARK-12342 > URL: https://issues.apache.org/jira/browse/SPARK-12342 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Davies Liu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12426) Docker JDBC integration tests are failing again
Mark Grover created SPARK-12426: --- Summary: Docker JDBC integration tests are failing again Key: SPARK-12426 URL: https://issues.apache.org/jira/browse/SPARK-12426 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.6.0 Reporter: Mark Grover The Docker JDBC integration tests were fixed in SPARK-11796 but they seem to be failing again on my machine (Ubuntu Precise). This was the same box that I tested my previous commit on. Also, I am not confident this failure has much to do with Spark, since a well known commit where the tests were passing, fails now, in the same environment. [~sowen] mentioned on the Spark 1.6 voting thread that the tests were failing on his Ubuntu 15 box as well. Here's the error, fyi: {code} 15/12/18 10:12:50 INFO SparkContext: Successfully stopped SparkContext 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. *** RUN ABORTED *** com.spotify.docker.client.DockerException: java.util.concurrent.ExecutionException: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: java.io.IOException: No such file or directory at com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141) at com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082) at com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281) at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76) at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58) at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492) at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528) ... Cause: java.util.concurrent.ExecutionException: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: java.io.IOException: No such file or directory at jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) at jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) at jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080) at com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281) at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76) at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58) ... Cause: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: java.io.IOException: No such file or directory at org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481) at org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at jersey.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110) at jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:50) at jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:37) at org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:487) 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. at org.glassfish.jersey.client.ClientRuntime$2.run(ClientRuntime.java:177) ... Cause: java.io.IOException: No such file or directory at jnr.unixsocket.UnixSocketChannel.doConnect(UnixSocketChannel.java:94) at jnr.unixsocket.UnixSocketChannel.connect(UnixSocketChannel.java:102) at com.spotify.docker.client.ApacheUnixSocket.connect(ApacheUnixSocket.java:73) at com.spotify.docker.client.UnixConnectionSocketFactory.c
[jira] [Commented] (SPARK-7142) Minor enhancement to BooleanSimplification Optimizer rule
[ https://issues.apache.org/jira/browse/SPARK-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064447#comment-15064447 ] Apache Spark commented on SPARK-7142: - User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/10383 > Minor enhancement to BooleanSimplification Optimizer rule > - > > Key: SPARK-7142 > URL: https://issues.apache.org/jira/browse/SPARK-7142 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yash Datta >Assignee: Yash Datta >Priority: Minor > Fix For: 1.6.0 > > > Add simplification using these rules : > A and (not(A) or B) => A and B > not(A and B) => not(A) or not(B) > not(A or B) => not(A) and not(B) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12218) Invalid splitting of nested AND expressions in Data Source filter API
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-12218: Assignee: Yin Huai > Invalid splitting of nested AND expressions in Data Source filter API > - > > Key: SPARK-12218 > URL: https://issues.apache.org/jira/browse/SPARK-12218 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.5.3, 1.6.0, 2.0.0 > > > Two identical queries produce different results > In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( > PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff'))").count() > Out[2]: 18 > In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( > not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff')))").count() > Out[3]: 28 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12218) Invalid splitting of nested AND expressions in Data Source filter API
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-12218. -- Resolution: Fixed Fix Version/s: 1.6.0 1.5.3 2.0.0 Issue resolved by pull request 10362 [https://github.com/apache/spark/pull/10362] > Invalid splitting of nested AND expressions in Data Source filter API > - > > Key: SPARK-12218 > URL: https://issues.apache.org/jira/browse/SPARK-12218 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Priority: Blocker > Fix For: 2.0.0, 1.5.3, 1.6.0 > > > Two identical queries produce different results > In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( > PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff'))").count() > Out[2]: 18 > In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( > not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff')))").count() > Out[3]: 28 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12372) Document limitations of MLlib local linear algebra
[ https://issues.apache.org/jira/browse/SPARK-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064476#comment-15064476 ] Christos Iraklis Tsatsoulis commented on SPARK-12372: - You are very welcome > Document limitations of MLlib local linear algebra > -- > > Key: SPARK-12372 > URL: https://issues.apache.org/jira/browse/SPARK-12372 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Christos Iraklis Tsatsoulis > > This JIRA is now for documenting limitations of MLlib's local linear algebra > types. Basically, we should make it clear in the user guide that they > provide simple functionality but are not a full-fledged local linear library. > We should also recommend libraries for users to use in the meantime: > probably Breeze for Scala (and Java?) and numpy/scipy for Python. > *Original JIRA title*: Unary operator "-" fails for MLlib vectors > *Original JIRA text, as an example of the need for better docs*: > Consider the following snippet in pyspark 1.5.2: > {code:none} > >>> from pyspark.mllib.linalg import Vectors > >>> x = Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]) > >>> x > DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]) > >>> -x > Traceback (most recent call last): > File "", line 1, in > TypeError: func() takes exactly 2 arguments (1 given) > >>> y = Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]) > >>> y > DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]) > >>> x-y > DenseVector([-2.0, 1.0, -3.0, 3.0, -5.0]) > >>> -y+x > Traceback (most recent call last): > File "", line 1, in > TypeError: func() takes exactly 2 arguments (1 given) > >>> -1*x > DenseVector([-0.0, -1.0, -0.0, -7.0, -0.0]) > {code} > Clearly, the unary operator {{-}} (minus) for vectors fails, giving errors > for expressions like {{-x}} and {{-y+x}}, despite the fact that {{x-y}} > behaves as expected. > The last operation, {{-1*x}}, although mathematically "correct", includes > minus signs for the zero entries, which again is normally not expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12427) spark builds filling up jenkins' disk
shane knapp created SPARK-12427: --- Summary: spark builds filling up jenkins' disk Key: SPARK-12427 URL: https://issues.apache.org/jira/browse/SPARK-12427 Project: Spark Issue Type: Bug Components: Build Reporter: shane knapp Priority: Critical problem summary: a few spark builds are filling up the jenkins master's disk with millions of little log files as build artifacts. currently, we have a raid10 array set up with 5.4T of storage. we're currently using 4.0T, 99.9% of which is spark unit test and junit logs. the worst offenders, with more than 100G of disk usage per job, are: 193G./Spark-1.6-Maven-with-YARN 194G./Spark-1.5-Maven-with-YARN 205G./Spark-1.6-Maven-pre-YARN 216G./Spark-1.5-Maven-pre-YARN 387G./Spark-Master-Maven-with-YARN 420G./Spark-Master-Maven-pre-YARN 520G./Spark-1.6-SBT 733G./Spark-1.5-SBT 812G./Spark-Master-SBT i have attached a full report w/all builds listed as well. each of these builds is keeping their build history for 90 days. keep in mind that for each new matrix build, we're looking at another 200-500G per for the SBT/pre-YARN/with-YARN jobs. a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk usage. on the hardware config side, we can move from raid10 to raid 5 and get ~3T additional storage. if we ditch raid altogether and put in bigger disks, we can get a total of 16-20T storage on master. another option is to have a NFS mount to a deep storage server. all of these options will require significant downtime. quesitons: * can we lower the number of days that we keep build information? * there are other options in jenkins that we can set as well: max number of builds to keep, max # days to keep artifacts, max # of builds to keep w/artifacts * can we make the junit and unit test logs smaller (probably not) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12427) spark builds filling up jenkins' disk
[ https://issues.apache.org/jira/browse/SPARK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp updated SPARK-12427: Attachment: jenkins_disk_usage.txt > spark builds filling up jenkins' disk > - > > Key: SPARK-12427 > URL: https://issues.apache.org/jira/browse/SPARK-12427 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: shane knapp >Priority: Critical > Labels: build, jenkins > Attachments: jenkins_disk_usage.txt > > > problem summary: > a few spark builds are filling up the jenkins master's disk with millions of > little log files as build artifacts. > currently, we have a raid10 array set up with 5.4T of storage. we're > currently using 4.0T, 99.9% of which is spark unit test and junit logs. > the worst offenders, with more than 100G of disk usage per job, are: > 193G./Spark-1.6-Maven-with-YARN > 194G./Spark-1.5-Maven-with-YARN > 205G./Spark-1.6-Maven-pre-YARN > 216G./Spark-1.5-Maven-pre-YARN > 387G./Spark-Master-Maven-with-YARN > 420G./Spark-Master-Maven-pre-YARN > 520G./Spark-1.6-SBT > 733G./Spark-1.5-SBT > 812G./Spark-Master-SBT > i have attached a full report w/all builds listed as well. > each of these builds is keeping their build history for 90 days. > keep in mind that for each new matrix build, we're looking at another > 200-500G per for the SBT/pre-YARN/with-YARN jobs. > a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk > usage. > on the hardware config side, we can move from raid10 to raid 5 and get ~3T > additional storage. if we ditch raid altogether and put in bigger disks, we > can get a total of 16-20T storage on master. another option is to have a NFS > mount to a deep storage server. all of these options will require > significant downtime. > quesitons: > * can we lower the number of days that we keep build information? > * there are other options in jenkins that we can set as well: max number of > builds to keep, max # days to keep artifacts, max # of builds to keep > w/artifacts > * can we make the junit and unit test logs smaller (probably not) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12427) spark builds filling up jenkins' disk
[ https://issues.apache.org/jira/browse/SPARK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp updated SPARK-12427: Attachment: graph.png disk usage over the past year, for lols. > spark builds filling up jenkins' disk > - > > Key: SPARK-12427 > URL: https://issues.apache.org/jira/browse/SPARK-12427 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: shane knapp >Priority: Critical > Labels: build, jenkins > Attachments: graph.png, jenkins_disk_usage.txt > > > problem summary: > a few spark builds are filling up the jenkins master's disk with millions of > little log files as build artifacts. > currently, we have a raid10 array set up with 5.4T of storage. we're > currently using 4.0T, 99.9% of which is spark unit test and junit logs. > the worst offenders, with more than 100G of disk usage per job, are: > 193G./Spark-1.6-Maven-with-YARN > 194G./Spark-1.5-Maven-with-YARN > 205G./Spark-1.6-Maven-pre-YARN > 216G./Spark-1.5-Maven-pre-YARN > 387G./Spark-Master-Maven-with-YARN > 420G./Spark-Master-Maven-pre-YARN > 520G./Spark-1.6-SBT > 733G./Spark-1.5-SBT > 812G./Spark-Master-SBT > i have attached a full report w/all builds listed as well. > each of these builds is keeping their build history for 90 days. > keep in mind that for each new matrix build, we're looking at another > 200-500G per for the SBT/pre-YARN/with-YARN jobs. > a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk > usage. > on the hardware config side, we can move from raid10 to raid 5 and get ~3T > additional storage. if we ditch raid altogether and put in bigger disks, we > can get a total of 16-20T storage on master. another option is to have a NFS > mount to a deep storage server. all of these options will require > significant downtime. > quesitons: > * can we lower the number of days that we keep build information? > * there are other options in jenkins that we can set as well: max number of > builds to keep, max # days to keep artifacts, max # of builds to keep > w/artifacts > * can we make the junit and unit test logs smaller (probably not) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12427) spark builds filling up jenkins' disk
[ https://issues.apache.org/jira/browse/SPARK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064568#comment-15064568 ] Sean Owen commented on SPARK-12427: --- I doubt we really need build history for more than a week or two. Does reducing to 2 weeks help enough to keep out of trouble for a while? If the next major release in 2.0, and it drops support for most old Hadoop variations, at least we have no more separate pre/post YARN builds. > spark builds filling up jenkins' disk > - > > Key: SPARK-12427 > URL: https://issues.apache.org/jira/browse/SPARK-12427 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: shane knapp >Priority: Critical > Labels: build, jenkins > Attachments: graph.png, jenkins_disk_usage.txt > > > problem summary: > a few spark builds are filling up the jenkins master's disk with millions of > little log files as build artifacts. > currently, we have a raid10 array set up with 5.4T of storage. we're > currently using 4.0T, 99.9% of which is spark unit test and junit logs. > the worst offenders, with more than 100G of disk usage per job, are: > 193G./Spark-1.6-Maven-with-YARN > 194G./Spark-1.5-Maven-with-YARN > 205G./Spark-1.6-Maven-pre-YARN > 216G./Spark-1.5-Maven-pre-YARN > 387G./Spark-Master-Maven-with-YARN > 420G./Spark-Master-Maven-pre-YARN > 520G./Spark-1.6-SBT > 733G./Spark-1.5-SBT > 812G./Spark-Master-SBT > i have attached a full report w/all builds listed as well. > each of these builds is keeping their build history for 90 days. > keep in mind that for each new matrix build, we're looking at another > 200-500G per for the SBT/pre-YARN/with-YARN jobs. > a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk > usage. > on the hardware config side, we can move from raid10 to raid 5 and get ~3T > additional storage. if we ditch raid altogether and put in bigger disks, we > can get a total of 16-20T storage on master. another option is to have a NFS > mount to a deep storage server. all of these options will require > significant downtime. > quesitons: > * can we lower the number of days that we keep build information? > * there are other options in jenkins that we can set as well: max number of > builds to keep, max # days to keep artifacts, max # of builds to keep > w/artifacts > * can we make the junit and unit test logs smaller (probably not) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12427) spark builds filling up jenkins' disk
[ https://issues.apache.org/jira/browse/SPARK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064572#comment-15064572 ] shane knapp commented on SPARK-12427: - [~joshrosen] > spark builds filling up jenkins' disk > - > > Key: SPARK-12427 > URL: https://issues.apache.org/jira/browse/SPARK-12427 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: shane knapp >Priority: Critical > Labels: build, jenkins > Attachments: graph.png, jenkins_disk_usage.txt > > > problem summary: > a few spark builds are filling up the jenkins master's disk with millions of > little log files as build artifacts. > currently, we have a raid10 array set up with 5.4T of storage. we're > currently using 4.0T, 99.9% of which is spark unit test and junit logs. > the worst offenders, with more than 100G of disk usage per job, are: > 193G./Spark-1.6-Maven-with-YARN > 194G./Spark-1.5-Maven-with-YARN > 205G./Spark-1.6-Maven-pre-YARN > 216G./Spark-1.5-Maven-pre-YARN > 387G./Spark-Master-Maven-with-YARN > 420G./Spark-Master-Maven-pre-YARN > 520G./Spark-1.6-SBT > 733G./Spark-1.5-SBT > 812G./Spark-Master-SBT > i have attached a full report w/all builds listed as well. > each of these builds is keeping their build history for 90 days. > keep in mind that for each new matrix build, we're looking at another > 200-500G per for the SBT/pre-YARN/with-YARN jobs. > a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk > usage. > on the hardware config side, we can move from raid10 to raid 5 and get ~3T > additional storage. if we ditch raid altogether and put in bigger disks, we > can get a total of 16-20T storage on master. another option is to have a NFS > mount to a deep storage server. all of these options will require > significant downtime. > quesitons: > * can we lower the number of days that we keep build information? > * there are other options in jenkins that we can set as well: max number of > builds to keep, max # days to keep artifacts, max # of builds to keep > w/artifacts > * can we make the junit and unit test logs smaller (probably not) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12428) Write a script to run all PySpark MLlib examples for testing
holdenk created SPARK-12428: --- Summary: Write a script to run all PySpark MLlib examples for testing Key: SPARK-12428 URL: https://issues.apache.org/jira/browse/SPARK-12428 Project: Spark Issue Type: Sub-task Components: PySpark, Tests Reporter: holdenk See parent for design sketch -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12428) Write a script to run all PySpark MLlib examples for testing
[ https://issues.apache.org/jira/browse/SPARK-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064610#comment-15064610 ] holdenk commented on SPARK-12428: - I can start working on this a bit over the holidays :) > Write a script to run all PySpark MLlib examples for testing > > > Key: SPARK-12428 > URL: https://issues.apache.org/jira/browse/SPARK-12428 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Reporter: holdenk > > See parent for design sketch -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12427) spark builds filling up jenkins' disk
[ https://issues.apache.org/jira/browse/SPARK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064628#comment-15064628 ] shane knapp commented on SPARK-12427: - if we NEED to store for longer than 2 weeks, we can absolutely rejigger storage. > spark builds filling up jenkins' disk > - > > Key: SPARK-12427 > URL: https://issues.apache.org/jira/browse/SPARK-12427 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: shane knapp >Priority: Critical > Labels: build, jenkins > Attachments: graph.png, jenkins_disk_usage.txt > > > problem summary: > a few spark builds are filling up the jenkins master's disk with millions of > little log files as build artifacts. > currently, we have a raid10 array set up with 5.4T of storage. we're > currently using 4.0T, 99.9% of which is spark unit test and junit logs. > the worst offenders, with more than 100G of disk usage per job, are: > 193G./Spark-1.6-Maven-with-YARN > 194G./Spark-1.5-Maven-with-YARN > 205G./Spark-1.6-Maven-pre-YARN > 216G./Spark-1.5-Maven-pre-YARN > 387G./Spark-Master-Maven-with-YARN > 420G./Spark-Master-Maven-pre-YARN > 520G./Spark-1.6-SBT > 733G./Spark-1.5-SBT > 812G./Spark-Master-SBT > i have attached a full report w/all builds listed as well. > each of these builds is keeping their build history for 90 days. > keep in mind that for each new matrix build, we're looking at another > 200-500G per for the SBT/pre-YARN/with-YARN jobs. > a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk > usage. > on the hardware config side, we can move from raid10 to raid 5 and get ~3T > additional storage. if we ditch raid altogether and put in bigger disks, we > can get a total of 16-20T storage on master. another option is to have a NFS > mount to a deep storage server. all of these options will require > significant downtime. > quesitons: > * can we lower the number of days that we keep build information? > * there are other options in jenkins that we can set as well: max number of > builds to keep, max # days to keep artifacts, max # of builds to keep > w/artifacts > * can we make the junit and unit test logs smaller (probably not) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming
Shixiong Zhu created SPARK-12429: Summary: Update documentation to show how to use accumulators and broadcasts with Spark Streaming Key: SPARK-12429 URL: https://issues.apache.org/jira/browse/SPARK-12429 Project: Spark Issue Type: Documentation Components: Documentation, Streaming Reporter: Shixiong Zhu Assignee: Shixiong Zhu Accumulators and Broadcasts with Spark Streaming cannot work perfectly when restarting on driver failures. We need to add some example to guide the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064664#comment-15064664 ] Matt Pollock commented on SPARK-6817: - Will this only support UDFs that operate on a full DataFrame? A solution to operate on columns would perhaps be more useful. E.g., being able to use R package functions within filter and mutate > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12429: Assignee: Apache Spark (was: Shixiong Zhu) > Update documentation to show how to use accumulators and broadcasts with > Spark Streaming > > > Key: SPARK-12429 > URL: https://issues.apache.org/jira/browse/SPARK-12429 > Project: Spark > Issue Type: Documentation > Components: Documentation, Streaming >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Accumulators and Broadcasts with Spark Streaming cannot work perfectly when > restarting on driver failures. We need to add some example to guide the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12429: Assignee: Shixiong Zhu (was: Apache Spark) > Update documentation to show how to use accumulators and broadcasts with > Spark Streaming > > > Key: SPARK-12429 > URL: https://issues.apache.org/jira/browse/SPARK-12429 > Project: Spark > Issue Type: Documentation > Components: Documentation, Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Accumulators and Broadcasts with Spark Streaming cannot work perfectly when > restarting on driver failures. We need to add some example to guide the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.
Fede Bar created SPARK-12430: Summary: Temporary folders do not get deleted after Task completes causing problems with disk space. Key: SPARK-12430 URL: https://issues.apache.org/jira/browse/SPARK-12430 Project: Spark Issue Type: Bug Components: Block Manager, Shuffle, Spark Submit Affects Versions: 1.5.2, 1.5.1 Environment: Ubuntu server Reporter: Fede Bar Fix For: 1.4.1 We are experiencing an issue with automatic /tmp folder deletion after framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as Spark 1.5.1) over Mesos will not delete some temporary folders causing free disk space on server to exhaust. Behavior of M/R job using Spark 1.4.1 over Mesos cluster: - Launched using spark-submit on one cluster node. - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/* , */tmp/spark-#/blockmgr-#* - When task is completed */tmp/spark-#/* gets deleted along with */tmp/spark-#/blockmgr-#* sub-folder. Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job): - Launched using spark-submit on one cluster node. - Following folders are created: */tmp/mesos/mesos/slaves/id** * , */tmp/spark-***/ * ,{color:red} /tmp/blockmgr-***{color} - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle container folder {color:red} /tmp/blockmgr-***{color} Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several GB depending on the job that ran. Over time this causes disk space to become full with consequences that we all know. Running a shell script would probably work but it is difficult to identify folders in use by a running M/R or stale folders. I did notice similar issues opened by other users marked as "resolved", but none seems to exactly match the above behavior. I really hope someone has insights on how to fix it. Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12409) JDBC AND operator push down
[ https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12409: Assignee: Apache Spark > JDBC AND operator push down > > > Key: SPARK-12409 > URL: https://issues.apache.org/jira/browse/SPARK-12409 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > For simple AND such as > select * from test where THEID = 1 AND NAME = 'fred', > The filters pushed down to JDBC layers are EqualTo(THEID,1), > EqualTo(Name,fred). These are handled OK by the current code. > For query such as > SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" , > the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2))) > So need to add And filter in JDBC layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12409) JDBC AND operator push down
[ https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064742#comment-15064742 ] Apache Spark commented on SPARK-12409: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/10386 > JDBC AND operator push down > > > Key: SPARK-12409 > URL: https://issues.apache.org/jira/browse/SPARK-12409 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Huaxin Gao >Priority: Minor > > For simple AND such as > select * from test where THEID = 1 AND NAME = 'fred', > The filters pushed down to JDBC layers are EqualTo(THEID,1), > EqualTo(Name,fred). These are handled OK by the current code. > For query such as > SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" , > the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2))) > So need to add And filter in JDBC layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12425) DStream union optimisation
[ https://issues.apache.org/jira/browse/SPARK-12425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12425: Assignee: (was: Apache Spark) > DStream union optimisation > -- > > Key: SPARK-12425 > URL: https://issues.apache.org/jira/browse/SPARK-12425 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Guillaume Poulin >Priority: Minor > > Currently, `DStream.union` always use `UnionRDD` on the underlying `RDD`. > However using `PartitionerAwareUnionRDD` when possible would yield to better > performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties
[ https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064744#comment-15064744 ] Jo Voordeckers commented on SPARK-11327: This PR is now superseded by this one against master: https://github.com/apache/spark/pull/10370 > spark-dispatcher doesn't pass along some spark properties > - > > Key: SPARK-11327 > URL: https://issues.apache.org/jira/browse/SPARK-11327 > Project: Spark > Issue Type: Bug > Components: Mesos >Reporter: Alan Braithwaite > > I haven't figured out exactly what's going on yet, but there's something in > the spark-dispatcher which is failing to pass along properties to the > spark-driver when using spark-submit in a clustered mesos docker environment. > Most importantly, it's not passing along spark.mesos.executor.docker.image... > cli: > {code} > docker run -t -i --rm --net=host > --entrypoint=/usr/local/spark/bin/spark-submit > docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf > spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master > mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster > --properties-file /usr/local/spark/conf/spark-defaults.conf --class > com.example.spark.streaming.MyApp > http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 > spark-testing my-stream 40 > {code} > submit output: > {code} > 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch > an application in mesos://compute1.example.com:31262. > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server > at http://compute1.example.com:31262/v1/submissions/create: > { > "action" : "CreateSubmissionRequest", > "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ], > "appResource" : "http://jarserver.example.com:8000/sparkapp.jar";, > "clientSparkVersion" : "1.5.0", > "environmentVariables" : { > "SPARK_SCALA_VERSION" : "2.10", > "SPARK_CONF_DIR" : "/usr/local/spark/conf", > "SPARK_HOME" : "/usr/local/spark", > "SPARK_ENV_LOADED" : "1" > }, > "mainClass" : "com.example.spark.streaming.MyApp", > "sparkProperties" : { > "spark.serializer" : "org.apache.spark.serializer.KryoSerializer", > "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : > "/usr/local/lib/libmesos.so", > "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs", > "spark.eventLog.enabled" : "true", > "spark.driver.maxResultSize" : "0", > "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER", > "spark.mesos.deploy.zookeeper.url" : > "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181", > "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar";, > "spark.driver.supervise" : "false", > "spark.app.name" : "com.example.spark.streaming.MyApp", > "spark.driver.memory" : "8G", > "spark.logConf" : "true", > "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher", > "spark.mesos.executor.docker.image" : > "docker.example.com/spark-prod:2015.10.2", > "spark.submit.deployMode" : "cluster", > "spark.master" : "mesos://compute1.example.com:31262", > "spark.executor.memory" : "8G", > "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs", > "spark.mesos.docker.executor.network" : "HOST", > "spark.mesos.executor.home" : "/usr/local/spark" > } > } > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server: > { > "action" : "CreateSubmissionResponse", > "serverSparkVersion" : "1.5.0", > "submissionId" : "driver-20151026220353-0011", > "success" : true > } > 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created > as driver-20151026220353-0011. Polling submission state... > 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the > status of submission driver-20151026220353-0011 in > mesos://compute1.example.com:31262. > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server > at > http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011. > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server: > { > "action" : "SubmissionStatusResponse", > "driverState" : "QUEUED", > "serverSparkVersion" : "1.5.0", > "submissionId" : "driver-20151026220353-0011", > "success" : true > } > 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver > driver-20151026220353-0011 is now QUEUED. > 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with > CreateSubmissionResponse: > { > "action" : "CreateSubmissionResponse", > "serverSparkVersion" : "1.5.0", > "submissionId" : "driver-20151026220353-0011", > "success" : true > } > {code} >
[jira] [Assigned] (SPARK-12409) JDBC AND operator push down
[ https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12409: Assignee: (was: Apache Spark) > JDBC AND operator push down > > > Key: SPARK-12409 > URL: https://issues.apache.org/jira/browse/SPARK-12409 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Huaxin Gao >Priority: Minor > > For simple AND such as > select * from test where THEID = 1 AND NAME = 'fred', > The filters pushed down to JDBC layers are EqualTo(THEID,1), > EqualTo(Name,fred). These are handled OK by the current code. > For query such as > SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" , > the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2))) > So need to add And filter in JDBC layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12425) DStream union optimisation
[ https://issues.apache.org/jira/browse/SPARK-12425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12425: Assignee: Apache Spark > DStream union optimisation > -- > > Key: SPARK-12425 > URL: https://issues.apache.org/jira/browse/SPARK-12425 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Guillaume Poulin >Assignee: Apache Spark >Priority: Minor > > Currently, `DStream.union` always use `UnionRDD` on the underlying `RDD`. > However using `PartitionerAwareUnionRDD` when possible would yield to better > performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12365) Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called
[ https://issues.apache.org/jira/browse/SPARK-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12365: -- Target Version/s: 2.0.0 (was: 1.6.1, 2.0.0) > Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called > > > Key: SPARK-12365 > URL: https://issues.apache.org/jira/browse/SPARK-12365 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Minor > Fix For: 2.0.0 > > > SPARK-9886 fixed call to Runtime.getRuntime.addShutdownHook() in > ExternalBlockStore.scala > This issue intends to address remaining usage of > Runtime.getRuntime.addShutdownHook() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12365) Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called
[ https://issues.apache.org/jira/browse/SPARK-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12365: -- Fix Version/s: (was: 1.6.1) > Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called > > > Key: SPARK-12365 > URL: https://issues.apache.org/jira/browse/SPARK-12365 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Minor > Fix For: 2.0.0 > > > SPARK-9886 fixed call to Runtime.getRuntime.addShutdownHook() in > ExternalBlockStore.scala > This issue intends to address remaining usage of > Runtime.getRuntime.addShutdownHook() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203280#comment-14203280 ] Nicholas Chammas edited comment on SPARK-3821 at 12/18/15 9:08 PM: --- After much dilly-dallying, I am happy to present: * A brief proposal / design doc ([fixed JIRA attachment | https://issues.apache.org/jira/secure/attachment/12680371/packer-proposal.html], [md file on GitHub | https://github.com/nchammas/spark-ec2/blob/packer/image-build/proposal.md]) * [Initial implementation | https://github.com/nchammas/spark-ec2/tree/packer/image-build] and [README | https://github.com/nchammas/spark-ec2/blob/packer/image-build/README.md] * New AMIs generated by this implementation: [Base AMIs | https://github.com/nchammas/spark-ec2/tree/packer/ami-list/base], [Spark 1.1.0 Pre-Installed | https://github.com/nchammas/spark-ec2/tree/packer/ami-list/1.1.0] To try out the new AMIs with {{spark-ec2}}, you'll need to update [these | https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L47] [two | https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L593] lines (well, really, just the first one) to point to [my {{spark-ec2}} repo on the {{packer}} branch | https://github.com/nchammas/spark-ec2/tree/packer/image-build]. Your candid feedback and/or improvements are most welcome! was (Author: nchammas): After much dilly-dallying, I am happy to present: * A brief proposal / design doc ([fixed JIRA attachment | https://issues.apache.org/jira/secure/attachment/12680371/packer-proposal.html], [md file on GitHub | https://github.com/nchammas/spark-ec2/blob/packer/packer/proposal.md]) * [Initial implementation | https://github.com/nchammas/spark-ec2/tree/packer/packer] and [README | https://github.com/nchammas/spark-ec2/blob/packer/packer/README.md] * New AMIs generated by this implementation: [Base AMIs | https://github.com/nchammas/spark-ec2/tree/packer/ami-list/base], [Spark 1.1.0 Pre-Installed | https://github.com/nchammas/spark-ec2/tree/packer/ami-list/1.1.0] To try out the new AMIs with {{spark-ec2}}, you'll need to update [these | https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L47] [two | https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L593] lines (well, really, just the first one) to point to [my {{spark-ec2}} repo on the {{packer}} branch | https://github.com/nchammas/spark-ec2/tree/packer/packer]. Your candid feedback and/or improvements are most welcome! > Develop an automated way of creating Spark images (AMI, Docker, and others) > --- > > Key: SPARK-3821 > URL: https://issues.apache.org/jira/browse/SPARK-3821 > Project: Spark > Issue Type: Improvement > Components: Build, EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas > Attachments: packer-proposal.html > > > Right now the creation of Spark AMIs or Docker containers is done manually. > With tools like [Packer|http://www.packer.io/], we should be able to automate > this work, and do so in such a way that multiple types of machine images can > be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12431) add local checkpointing to GraphX
Edward Seidl created SPARK-12431: Summary: add local checkpointing to GraphX Key: SPARK-12431 URL: https://issues.apache.org/jira/browse/SPARK-12431 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.5.2 Reporter: Edward Seidl local checkpointing was added to RDD to speed up iterative spark jobs, but this capability hasn't been added to GraphX. Adding localCheckpoint to GraphImpl, EdgeRDDImpl, and VertexRDDImpl greatly improved the speed of a k-core algorithm I'm using (at the cost of fault tolerance, of course). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12404) Ensure objects passed to StaticInvoke is Serializable
[ https://issues.apache.org/jira/browse/SPARK-12404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12404. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 10357 [https://github.com/apache/spark/pull/10357] > Ensure objects passed to StaticInvoke is Serializable > - > > Key: SPARK-12404 > URL: https://issues.apache.org/jira/browse/SPARK-12404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Critical > Fix For: 1.6.0 > > > Now `StaticInvoke` receives Any as a object and StaticInvoke can be > serialized but sometimes the object passed is not serializable. > For example, following code raises Exception because RowEncoder#extractorsFor > invoked indirectly makes `StaticInvoke`. > {code} > case class TimestampContainer(timestamp: java.sql.Timestamp) > val rdd = sc.parallelize(1 to 2).map(_ => > TimestampContainer(System.currentTimeMillis)) > val df = rdd.toDF > val ds = df.as[TimestampContainer] > val rdd2 = ds.rdd <- invokes > extractorsFor indirectory > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org