[jira] [Commented] (SPARK-14365) Repartition by column
[ https://issues.apache.org/jira/browse/SPARK-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273664#comment-15273664 ] Sun Rui commented on SPARK-14365: - [~dselivanov] Could you verify if SPARK-15110 can solve your problem? > Repartition by column > - > > Key: SPARK-14365 > URL: https://issues.apache.org/jira/browse/SPARK-14365 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Dmitriy Selivanov > > Starting from 1.6 it is possible to set partitioning for data frames. For > example in Scala we can do it in a following way: > {code} > val partitioned = df.repartition($"k") > {code} > Would be nice to have this functionality in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14365) Repartition by column
[ https://issues.apache.org/jira/browse/SPARK-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Rui closed SPARK-14365. --- Resolution: Duplicate > Repartition by column > - > > Key: SPARK-14365 > URL: https://issues.apache.org/jira/browse/SPARK-14365 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Dmitriy Selivanov > > Starting from 1.6 it is possible to set partitioning for data frames. For > example in Scala we can do it in a following way: > {code} > val partitioned = df.repartition($"k") > {code} > Would be nice to have this functionality in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15159) Remove usage of HiveContext in SparkR.
[ https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273655#comment-15273655 ] Sun Rui commented on SPARK-15159: - [~felixcheung], I guess you are talking about SQLContext, not HiveContext. For SQLContext, it is kept for backward compatibility, we don't need to change it for now. HiveContext is deprecated, not removed. However, I don't think it is a big change. Two pieces only: 1. Modify SparkRHive.init() to use SparkSession; 2. Investigate if we need to change using of TestHiveContext in SparkR unit tests. A rough look seems no change is needed. But not sure. [~vsparmar] Feel free to take this JIRA. > Remove usage of HiveContext in SparkR. > -- > > Key: SPARK-15159 > URL: https://issues.apache.org/jira/browse/SPARK-15159 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > HiveContext is to be deprecated in 2.0. Replace them with > SparkSession.withHiveSupport in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14476) Show table name or path in string of DataSourceScan
[ https://issues.apache.org/jira/browse/SPARK-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273632#comment-15273632 ] Apache Spark commented on SPARK-14476: -- User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/12947 > Show table name or path in string of DataSourceScan > --- > > Key: SPARK-14476 > URL: https://issues.apache.org/jira/browse/SPARK-14476 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Cheng Lian >Priority: Critical > > right now, the string of DataSourceScan is only "HadoopFiles xxx", without > any information about the table name or path. > Since we have that in 1.6, this is kind of regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14476) Show table name or path in string of DataSourceScan
[ https://issues.apache.org/jira/browse/SPARK-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273630#comment-15273630 ] Sean Zhong commented on SPARK-14476: Regression of SPARK-12012 > Show table name or path in string of DataSourceScan > --- > > Key: SPARK-14476 > URL: https://issues.apache.org/jira/browse/SPARK-14476 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Cheng Lian >Priority: Critical > > right now, the string of DataSourceScan is only "HadoopFiles xxx", without > any information about the table name or path. > Since we have that in 1.6, this is kind of regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14476) Show table name or path in string of DataSourceScan
[ https://issues.apache.org/jira/browse/SPARK-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14476: Assignee: Cheng Lian (was: Apache Spark) > Show table name or path in string of DataSourceScan > --- > > Key: SPARK-14476 > URL: https://issues.apache.org/jira/browse/SPARK-14476 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Cheng Lian >Priority: Critical > > right now, the string of DataSourceScan is only "HadoopFiles xxx", without > any information about the table name or path. > Since we have that in 1.6, this is kind of regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14476) Show table name or path in string of DataSourceScan
[ https://issues.apache.org/jira/browse/SPARK-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14476: Assignee: Apache Spark (was: Cheng Lian) > Show table name or path in string of DataSourceScan > --- > > Key: SPARK-14476 > URL: https://issues.apache.org/jira/browse/SPARK-14476 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark >Priority: Critical > > right now, the string of DataSourceScan is only "HadoopFiles xxx", without > any information about the table name or path. > Since we have that in 1.6, this is kind of regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15085) Rename current streaming-kafka artifact to include kafka version
[ https://issues.apache.org/jira/browse/SPARK-15085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273619#comment-15273619 ] Apache Spark commented on SPARK-15085: -- User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/12946 > Rename current streaming-kafka artifact to include kafka version > > > Key: SPARK-15085 > URL: https://issues.apache.org/jira/browse/SPARK-15085 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Cody Koeninger > > Since supporting kafka 0.10 will likely need a separate artifact, rename > existing artifact now so that the minor breaking change is in place for spark > 2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15085) Rename current streaming-kafka artifact to include kafka version
[ https://issues.apache.org/jira/browse/SPARK-15085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15085: Assignee: (was: Apache Spark) > Rename current streaming-kafka artifact to include kafka version > > > Key: SPARK-15085 > URL: https://issues.apache.org/jira/browse/SPARK-15085 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Cody Koeninger > > Since supporting kafka 0.10 will likely need a separate artifact, rename > existing artifact now so that the minor breaking change is in place for spark > 2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15085) Rename current streaming-kafka artifact to include kafka version
[ https://issues.apache.org/jira/browse/SPARK-15085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15085: Assignee: Apache Spark > Rename current streaming-kafka artifact to include kafka version > > > Key: SPARK-15085 > URL: https://issues.apache.org/jira/browse/SPARK-15085 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Cody Koeninger >Assignee: Apache Spark > > Since supporting kafka 0.10 will likely need a separate artifact, rename > existing artifact now so that the minor breaking change is in place for spark > 2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14809) R Examples: Check for new R APIs requiring example code in 2.0
[ https://issues.apache.org/jira/browse/SPARK-14809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14809: -- Assignee: Yanbo Liang > R Examples: Check for new R APIs requiring example code in 2.0 > -- > > Key: SPARK-14809 > URL: https://issues.apache.org/jira/browse/SPARK-14809 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Minor > > Audit list of new features added to MLlib's R API, and see which major items > are missing example code (in the examples folder). We do not need examples > for everything, only for major items such as new algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > Note: This no longer includes Scala/Java/Python since those are covered under > the user guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15155) Optionally ignore default role resources
[ https://issues.apache.org/jira/browse/SPARK-15155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Heller updated SPARK-15155: - Description: SPARK-6284 added support for Mesos roles, but the framework will still accept resources from both the reserved role specified in {{spark.mesos.role}} and the default role {{*}}. I'd like to propose the addition of a new boolean property: {{spark.mesos.ignoreDefaultRoleResources}}. When this property is set Spark will only accept resources from the role passed in the {{spark.mesos.role}} property. If {{spark.mesos.role}} has not been set, {{spark.mesos.ignoreDefaultRoleResources}} has no effect. was: SPARK-6284 added support for Mesos roles, but the framework will still accept resources from both the reserved role specified in {{spark.mesos.role}} and the default role {{*}}. I'd like to propose the addition of a new property {{spark.mesos.acceptedResourceRoles}} which would be a comma-delimited list of roles that the framework will accept resources from. This is similar to {{spark.mesos.constraints}}, except that constraints look at the attributes of an offer, and this will look at the role of a resource. In the default case {{spark.mesos.acceptedResourceRoles}} will be set to {{*[,spark.mesos.role]}} giving the exact same behavior to the framework if no value is specified in the property. > Optionally ignore default role resources > > > Key: SPARK-15155 > URL: https://issues.apache.org/jira/browse/SPARK-15155 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 1.5.0, 1.6.0 >Reporter: Chris Heller > > SPARK-6284 added support for Mesos roles, but the framework will still accept > resources from both the reserved role specified in {{spark.mesos.role}} and > the default role {{*}}. > I'd like to propose the addition of a new boolean property: > {{spark.mesos.ignoreDefaultRoleResources}}. When this property is set Spark > will only accept resources from the role passed in the {{spark.mesos.role}} > property. If {{spark.mesos.role}} has not been set, > {{spark.mesos.ignoreDefaultRoleResources}} has no effect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15155) Optionally ignore default role resources
[ https://issues.apache.org/jira/browse/SPARK-15155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Heller updated SPARK-15155: - Summary: Optionally ignore default role resources (was: Selectively accept Mesos resources by role) > Optionally ignore default role resources > > > Key: SPARK-15155 > URL: https://issues.apache.org/jira/browse/SPARK-15155 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 1.5.0, 1.6.0 >Reporter: Chris Heller > > SPARK-6284 added support for Mesos roles, but the framework will still accept > resources from both the reserved role specified in {{spark.mesos.role}} and > the default role {{*}}. > I'd like to propose the addition of a new property > {{spark.mesos.acceptedResourceRoles}} which would be a comma-delimited list > of roles that the framework will accept resources from. > This is similar to {{spark.mesos.constraints}}, except that constraints look > at the attributes of an offer, and this will look at the role of a resource. > In the default case {{spark.mesos.acceptedResourceRoles}} will be set to > {{*[,spark.mesos.role]}} giving the exact same behavior to the framework if > no value is specified in the property. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15171) Deprecate registerTempTable and add dataset.createTempView
[ https://issues.apache.org/jira/browse/SPARK-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273544#comment-15273544 ] Apache Spark commented on SPARK-15171: -- User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/12945 > Deprecate registerTempTable and add dataset.createTempView > -- > > Key: SPARK-15171 > URL: https://issues.apache.org/jira/browse/SPARK-15171 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Minor > > Our current dataset.registerTempTable does not actually materialize data. So, > it should be considered as creating a temp view. We can deprecate it and > create a new method called dataset.createTempView(replaceIfExists: Boolean). > The default value of replaceIfExists should be false. For registerTempTable, > it will call dataset.createTempView(replaceIfExists = true). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15171) Deprecate registerTempTable and add dataset.createTempView
[ https://issues.apache.org/jira/browse/SPARK-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15171: Assignee: Apache Spark > Deprecate registerTempTable and add dataset.createTempView > -- > > Key: SPARK-15171 > URL: https://issues.apache.org/jira/browse/SPARK-15171 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Assignee: Apache Spark >Priority: Minor > > Our current dataset.registerTempTable does not actually materialize data. So, > it should be considered as creating a temp view. We can deprecate it and > create a new method called dataset.createTempView(replaceIfExists: Boolean). > The default value of replaceIfExists should be false. For registerTempTable, > it will call dataset.createTempView(replaceIfExists = true). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15171) Deprecate registerTempTable and add dataset.createTempView
[ https://issues.apache.org/jira/browse/SPARK-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15171: Assignee: (was: Apache Spark) > Deprecate registerTempTable and add dataset.createTempView > -- > > Key: SPARK-15171 > URL: https://issues.apache.org/jira/browse/SPARK-15171 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Priority: Minor > > Our current dataset.registerTempTable does not actually materialize data. So, > it should be considered as creating a temp view. We can deprecate it and > create a new method called dataset.createTempView(replaceIfExists: Boolean). > The default value of replaceIfExists should be false. For registerTempTable, > it will call dataset.createTempView(replaceIfExists = true). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8428) TimSort Comparison method violates its general contract with CLUSTER BY
[ https://issues.apache.org/jira/browse/SPARK-8428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273534#comment-15273534 ] Yi Zhou commented on SPARK-8428: We found the similar issue with Spark 1.6.1 in our larger data size test..I posted the details like below. Then we try to increase the spark.sql.shuffle.partitions to resolve it. {code} CREATE TABLE q26_spark_sql_run_query_0_temp ( cid BIGINT, id1 double, id2 double, id3 double, id4 double, id5 double, id6 double, id7 double, id8 double, id9 double, id10 double, id11 double, id12 double, id13 double, id14 double, id15 double ) INSERT INTO TABLE q26_spark_sql_run_query_0_temp SELECT ss.ss_customer_sk AS cid, count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, count(CASE WHEN i.i_class_id=12 THEN 1 ELSE NULL END) AS id12, count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15 FROM store_sales ss INNER JOIN item i ON (ss.ss_item_sk = i.i_item_sk AND i.i_category IN ('Books') AND ss.ss_customer_sk IS NOT NULL ) GROUP BY ss.ss_customer_sk HAVING count(ss.ss_item_sk) > 5 ORDER BY cid {code} {code} 16/05/05 14:50:03 WARN scheduler.TaskSetManager: Lost task 12.0 in stage 162.0 (TID 15153, node6): java.lang.IllegalArgumentException: Comparison method violates its general contract! at org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:794) at org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525) at org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453) at org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325) at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153) at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:228) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:186) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:295) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:330) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at
[jira] [Commented] (SPARK-15032) When we create a new JDBC session, we may need to create a new session of executionHive
[ https://issues.apache.org/jira/browse/SPARK-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273529#comment-15273529 ] Yin Huai commented on SPARK-15032: -- Can you explain more about "I think the problem is that it terminates the executionHive process"? I am not sure I understand this. Thanks! > When we create a new JDBC session, we may need to create a new session of > executionHive > --- > > Key: SPARK-15032 > URL: https://issues.apache.org/jira/browse/SPARK-15032 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Right now, we only use executionHive in thriftserver. When we create a new > jdbc session, we probably need to create a new session of executionHive. I am > not sure what will break if we leave the code as is. But, I feel it will be > safer to create a new session of executionHive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15171) Deprecate registerTempTable and add dataset.createTempView
Sean Zhong created SPARK-15171: -- Summary: Deprecate registerTempTable and add dataset.createTempView Key: SPARK-15171 URL: https://issues.apache.org/jira/browse/SPARK-15171 Project: Spark Issue Type: Bug Reporter: Sean Zhong Priority: Minor Our current dataset.registerTempTable does not actually materialize data. So, it should be considered as creating a temp view. We can deprecate it and create a new method called dataset.createTempView(replaceIfExists: Boolean). The default value of replaceIfExists should be false. For registerTempTable, it will call dataset.createTempView(replaceIfExists = true). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14809) R Examples: Check for new R APIs requiring example code in 2.0
[ https://issues.apache.org/jira/browse/SPARK-14809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273492#comment-15273492 ] Yanbo Liang commented on SPARK-14809: - I'm glad to help this. > R Examples: Check for new R APIs requiring example code in 2.0 > -- > > Key: SPARK-14809 > URL: https://issues.apache.org/jira/browse/SPARK-14809 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Minor > > Audit list of new features added to MLlib's R API, and see which major items > are missing example code (in the examples folder). We do not need examples > for everything, only for major items such as new algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > Note: This no longer includes Scala/Java/Python since those are covered under > the user guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11395) Support over and window specification in SparkR
[ https://issues.apache.org/jira/browse/SPARK-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-11395. --- Resolution: Fixed Assignee: Sun Rui Fix Version/s: 2.0.0 Resolved by https://github.com/apache/spark/pull/10094 > Support over and window specification in SparkR > --- > > Key: SPARK-11395 > URL: https://issues.apache.org/jira/browse/SPARK-11395 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui >Assignee: Sun Rui > Fix For: 2.0.0 > > > 1. implement over() in Column class. > 2. support window spec > (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.WindowSpec) > 3. support utility functions for defining window in DataFrames. > (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.Window) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10043) Add window functions into SparkR
[ https://issues.apache.org/jira/browse/SPARK-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273475#comment-15273475 ] Shivaram Venkataraman commented on SPARK-10043: --- [~sunrui] Can we resolve this issue now ? > Add window functions into SparkR > > > Key: SPARK-10043 > URL: https://issues.apache.org/jira/browse/SPARK-10043 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Yu Ishikawa > > Add window functions as follows in SparkR. I think we should improve > {{collect}} function in SparkR. > - lead > - cumuDist > - denseRank > - lag > - ntile > - percentRank > - rank > - rowNumber -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15170) Log error message in ExecutorAllocationManager
meiyoula created SPARK-15170: Summary: Log error message in ExecutorAllocationManager Key: SPARK-15170 URL: https://issues.apache.org/jira/browse/SPARK-15170 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula No matter how long the expire idle time of executor, the log just says "it has been idle for $executorIdleTimeoutS seconds". Because executorIdleTimeoutS = conf.getTimeAsSeconds("spark.dynamicAllocation.executorIdleTimeout", "60s"), so it logs same expire time for different executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15074) Spark shuffle service bottlenecked while fetching large amount of intermediate data
[ https://issues.apache.org/jira/browse/SPARK-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273416#comment-15273416 ] Apache Spark commented on SPARK-15074: -- User 'sitalkedia' has created a pull request for this issue: https://github.com/apache/spark/pull/12944 > Spark shuffle service bottlenecked while fetching large amount of > intermediate data > --- > > Key: SPARK-15074 > URL: https://issues.apache.org/jira/browse/SPARK-15074 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.6.1 >Reporter: Sital Kedia > > While running a job which produces more than 90TB of intermediate data, we > find that about 10-15% of the reducer execution time is being spent in > shuffle fetch. > Jstack of the shuffle service reveals that most of the time the shuffle > service is reading the index files generated by the mapper. > {code} > java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at java.io.DataInputStream.readFully(DataInputStream.java:195) > at java.io.DataInputStream.readLong(DataInputStream.java:416) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:277) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:190) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72) > at > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > The issue is that for each shuffle fetch, we reopen the same index file again > and read it. It would be much efficient, if we can avoid opening the same > file multiple times and cache the data. We can use an LRU cache to save the > index file information. This way we can also limit the number of entries in > the cache so that we don't blow up the memory indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) -
[jira] [Assigned] (SPARK-15074) Spark shuffle service bottlenecked while fetching large amount of intermediate data
[ https://issues.apache.org/jira/browse/SPARK-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15074: Assignee: Apache Spark > Spark shuffle service bottlenecked while fetching large amount of > intermediate data > --- > > Key: SPARK-15074 > URL: https://issues.apache.org/jira/browse/SPARK-15074 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.6.1 >Reporter: Sital Kedia >Assignee: Apache Spark > > While running a job which produces more than 90TB of intermediate data, we > find that about 10-15% of the reducer execution time is being spent in > shuffle fetch. > Jstack of the shuffle service reveals that most of the time the shuffle > service is reading the index files generated by the mapper. > {code} > java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at java.io.DataInputStream.readFully(DataInputStream.java:195) > at java.io.DataInputStream.readLong(DataInputStream.java:416) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:277) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:190) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72) > at > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > The issue is that for each shuffle fetch, we reopen the same index file again > and read it. It would be much efficient, if we can avoid opening the same > file multiple times and cache the data. We can use an LRU cache to save the > index file information. This way we can also limit the number of entries in > the cache so that we don't blow up the memory indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Assigned] (SPARK-15074) Spark shuffle service bottlenecked while fetching large amount of intermediate data
[ https://issues.apache.org/jira/browse/SPARK-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15074: Assignee: (was: Apache Spark) > Spark shuffle service bottlenecked while fetching large amount of > intermediate data > --- > > Key: SPARK-15074 > URL: https://issues.apache.org/jira/browse/SPARK-15074 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.6.1 >Reporter: Sital Kedia > > While running a job which produces more than 90TB of intermediate data, we > find that about 10-15% of the reducer execution time is being spent in > shuffle fetch. > Jstack of the shuffle service reveals that most of the time the shuffle > service is reading the index files generated by the mapper. > {code} > java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at java.io.DataInputStream.readFully(DataInputStream.java:195) > at java.io.DataInputStream.readLong(DataInputStream.java:416) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:277) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:190) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72) > at > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {code} > The issue is that for each shuffle fetch, we reopen the same index file again > and read it. It would be much efficient, if we can avoid opening the same > file multiple times and cache the data. We can use an LRU cache to save the > index file information. This way we can also limit the number of entries in > the cache so that we don't blow up the memory indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-14963) YarnShuffleService should use YARN getRecoveryPath() for leveldb location
[ https://issues.apache.org/jira/browse/SPARK-14963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273412#comment-15273412 ] Saisai Shao commented on SPARK-14963: - OK, I will do it. > YarnShuffleService should use YARN getRecoveryPath() for leveldb location > - > > Key: SPARK-14963 > URL: https://issues.apache.org/jira/browse/SPARK-14963 > Project: Spark > Issue Type: Improvement > Components: Shuffle, YARN >Affects Versions: 1.6.1 >Reporter: Thomas Graves > > The YarnShuffleService, currently just picks a directly in the yarn local > dirs to store the leveldb file. YARN added an interface in hadoop 2.5 > getRecoverPath() to get the location where it should be storing this. > We should change to use getRecoveryPath(). This does mean we will have to use > reflection or similar to check for its existence though since it doesn't > exist before hadoop 2.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15159) Remove usage of HiveContext in SparkR.
[ https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273396#comment-15273396 ] Shivaram Venkataraman commented on SPARK-15159: --- Is it being removed or is it being deprecated in 2.0 - If its being removed then we need to make this a priority > Remove usage of HiveContext in SparkR. > -- > > Key: SPARK-15159 > URL: https://issues.apache.org/jira/browse/SPARK-15159 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > HiveContext is to be deprecated in 2.0. Replace them with > SparkSession.withHiveSupport in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15168) Add missing params to Python's MultilayerPerceptronClassifier
[ https://issues.apache.org/jira/browse/SPARK-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273391#comment-15273391 ] Apache Spark commented on SPARK-15168: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/12943 > Add missing params to Python's MultilayerPerceptronClassifier > - > > Key: SPARK-15168 > URL: https://issues.apache.org/jira/browse/SPARK-15168 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > > MultilayerPerceptronClassifier is missing step size, solver, and weights. Add > these params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15168) Add missing params to Python's MultilayerPerceptronClassifier
[ https://issues.apache.org/jira/browse/SPARK-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15168: Assignee: (was: Apache Spark) > Add missing params to Python's MultilayerPerceptronClassifier > - > > Key: SPARK-15168 > URL: https://issues.apache.org/jira/browse/SPARK-15168 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > > MultilayerPerceptronClassifier is missing step size, solver, and weights. Add > these params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15168) Add missing params to Python's MultilayerPerceptronClassifier
[ https://issues.apache.org/jira/browse/SPARK-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15168: Assignee: Apache Spark > Add missing params to Python's MultilayerPerceptronClassifier > - > > Key: SPARK-15168 > URL: https://issues.apache.org/jira/browse/SPARK-15168 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Assignee: Apache Spark >Priority: Trivial > > MultilayerPerceptronClassifier is missing step size, solver, and weights. Add > these params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15159) Remove usage of HiveContext in SparkR.
[ https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273379#comment-15273379 ] Felix Cheung commented on SPARK-15159: -- With the updated goal, this seems to be a fairly big change, how do we want to proceed? > Remove usage of HiveContext in SparkR. > -- > > Key: SPARK-15159 > URL: https://issues.apache.org/jira/browse/SPARK-15159 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > HiveContext is to be deprecated in 2.0. Replace them with > SparkSession.withHiveSupport in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15169) Consider improving HasSolver to allow generilization
holdenk created SPARK-15169: --- Summary: Consider improving HasSolver to allow generilization Key: SPARK-15169 URL: https://issues.apache.org/jira/browse/SPARK-15169 Project: Spark Issue Type: Improvement Components: ML Reporter: holdenk Priority: Trivial The current HasSolver shared param has a fixed default value of "auto" and no validation. Some algorithms (see `MultilayerPerceptronClassifier`) have different default values or validators. This results in either a mostly duplicated param (as in `MultilayerPerceptronClassifier`) or incorrect scaladoc (as in `GeneralizedLinearRegression`). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13566) Deadlock between MemoryStore and BlockManager
[ https://issues.apache.org/jira/browse/SPARK-13566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-13566: -- Assignee: cen yuhai > Deadlock between MemoryStore and BlockManager > - > > Key: SPARK-13566 > URL: https://issues.apache.org/jira/browse/SPARK-13566 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 1.6.0 > Environment: Spark 1.6.0 hadoop2.2.0 jdk1.8.0_65 centOs 6.2 >Reporter: cen yuhai >Assignee: cen yuhai > > === > "block-manager-slave-async-thread-pool-1": > at org.apache.spark.storage.MemoryStore.remove(MemoryStore.scala:216) > - waiting to lock <0x0005895b09b0> (a > org.apache.spark.memory.UnifiedMemoryManager) > at > org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1114) > - locked <0x00058ed6aae0> (a org.apache.spark.storage.BlockInfo) > at > org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1101) > at > org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1101) > at scala.collection.immutable.Set$Set2.foreach(Set.scala:94) > at > org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1101) > at > org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply$mcI$sp(BlockManagerSlaveEndpoint.scala:65) > at > org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65) > at > org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65) > at > org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:84) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > "Executor task launch worker-10": > at > org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1032) > - waiting to lock <0x00059a0988b8> (a > org.apache.spark.storage.BlockInfo) > at > org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1009) > at > org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:460) > at > org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:449) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15168) Add missing params to Python's MultilayerPerceptronClassifier
[ https://issues.apache.org/jira/browse/SPARK-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-15168: Description: MultilayerPerceptronClassifier is missing step size, solver, and weights. Add these params. (was: MultilayerPerceptronClassifier is missing Tol, solver, and weights. Add these params.) > Add missing params to Python's MultilayerPerceptronClassifier > - > > Key: SPARK-15168 > URL: https://issues.apache.org/jira/browse/SPARK-15168 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > > MultilayerPerceptronClassifier is missing step size, solver, and weights. Add > these params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15168) Add missing params to Python's MultilayerPerceptronClassifier
holdenk created SPARK-15168: --- Summary: Add missing params to Python's MultilayerPerceptronClassifier Key: SPARK-15168 URL: https://issues.apache.org/jira/browse/SPARK-15168 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: holdenk Priority: Trivial MultilayerPerceptronClassifier is missing Tol, solver, and weights. Add these params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15159) Remove usage of HiveContext in SparkR.
[ https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Rui updated SPARK-15159: Description: HiveContext is to be deprecated in 2.0. Replace them with SparkSession.withHiveSupport in SparkR (was: HiveContext is to be deprecated in 2.0. However, there are several times of usage of HiveContext in SparkR unit test cases. Replace them with SparkSession.withHiveSupport .) > Remove usage of HiveContext in SparkR. > -- > > Key: SPARK-15159 > URL: https://issues.apache.org/jira/browse/SPARK-15159 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > HiveContext is to be deprecated in 2.0. Replace them with > SparkSession.withHiveSupport in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15159) Remove usage of HiveContext in SparkR.
[ https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Rui updated SPARK-15159: Summary: Remove usage of HiveContext in SparkR. (was: Remove usage of HiveContext in SparkR unit test cases.) > Remove usage of HiveContext in SparkR. > -- > > Key: SPARK-15159 > URL: https://issues.apache.org/jira/browse/SPARK-15159 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > HiveContext is to be deprecated in 2.0. However, there are several times of > usage of HiveContext in SparkR unit test cases. Replace them with > SparkSession.withHiveSupport . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15167) Add public catalog implementation method to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15167: Assignee: Andrew Or (was: Apache Spark) > Add public catalog implementation method to SparkSession > > > Key: SPARK-15167 > URL: https://issues.apache.org/jira/browse/SPARK-15167 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > Right now there's no way to check whether a given SparkSession has Hive > support. You can do `spark.conf.get("spark.sql.catalogImplementation")` but > that's supposed to be hidden from the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15167) Add public catalog implementation method to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15167: Assignee: Apache Spark (was: Andrew Or) > Add public catalog implementation method to SparkSession > > > Key: SPARK-15167 > URL: https://issues.apache.org/jira/browse/SPARK-15167 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Apache Spark > > Right now there's no way to check whether a given SparkSession has Hive > support. You can do `spark.conf.get("spark.sql.catalogImplementation")` but > that's supposed to be hidden from the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15167) Add public catalog implementation method to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273368#comment-15273368 ] Apache Spark commented on SPARK-15167: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/12942 > Add public catalog implementation method to SparkSession > > > Key: SPARK-15167 > URL: https://issues.apache.org/jira/browse/SPARK-15167 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > Right now there's no way to check whether a given SparkSession has Hive > support. You can do `spark.conf.get("spark.sql.catalogImplementation")` but > that's supposed to be hidden from the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15166) Move hive-specific conf setting from SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15166: Assignee: Andrew Or (was: Apache Spark) > Move hive-specific conf setting from SparkSession > - > > Key: SPARK-15166 > URL: https://issues.apache.org/jira/browse/SPARK-15166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15166) Move hive-specific conf setting from SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273354#comment-15273354 ] Apache Spark commented on SPARK-15166: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/12941 > Move hive-specific conf setting from SparkSession > - > > Key: SPARK-15166 > URL: https://issues.apache.org/jira/browse/SPARK-15166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15166) Move hive-specific conf setting from SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15166: Assignee: Apache Spark (was: Andrew Or) > Move hive-specific conf setting from SparkSession > - > > Key: SPARK-15166 > URL: https://issues.apache.org/jira/browse/SPARK-15166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe
[ https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-15165: - Priority: Critical (was: Major) > Codegen can break because toCommentSafeString is not actually safe > -- > > Key: SPARK-15165 > URL: https://issues.apache.org/jira/browse/SPARK-15165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kousuke Saruta >Priority: Critical > > toCommentSafeString method replaces "\u" with "\ \u" to avoid codegen > breaking. > But if the even number of "\" is put before "u", like "\ \u", in the string > literal in the query, codegen can break. > Following code occurs compilation error. > {code} > val df = Seq(...).toDF > df.select("'u002A/'").show > {code} > The reason of the compilation error is because "u002A/" is translated > into "*/" (the end of comment). > Due to this unsafety, arbitrary code can be injected like as follows. > {code} > val df = Seq(...).toDF > // Inject "System.exit(1)" > df.select("'u002A/{System.exit(1);}/*'").show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe
[ https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-15165: - Target Version/s: 2.0.0 > Codegen can break because toCommentSafeString is not actually safe > -- > > Key: SPARK-15165 > URL: https://issues.apache.org/jira/browse/SPARK-15165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kousuke Saruta > > toCommentSafeString method replaces "\u" with "\ \u" to avoid codegen > breaking. > But if the even number of "\" is put before "u", like "\ \u", in the string > literal in the query, codegen can break. > Following code occurs compilation error. > {code} > val df = Seq(...).toDF > df.select("'u002A/'").show > {code} > The reason of the compilation error is because "u002A/" is translated > into "*/" (the end of comment). > Due to this unsafety, arbitrary code can be injected like as follows. > {code} > val df = Seq(...).toDF > // Inject "System.exit(1)" > df.select("'u002A/{System.exit(1);}/*'").show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15152) Scaladoc and Code style Improvements
[ https://issues.apache.org/jira/browse/SPARK-15152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15152. --- Resolution: Fixed Assignee: Jacek Laskowski Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Scaladoc and Code style Improvements > > > Key: SPARK-15152 > URL: https://issues.apache.org/jira/browse/SPARK-15152 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, Spark Core, SQL, YARN >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Jacek Laskowski >Priority: Minor > Fix For: 2.0.0 > > > While doing code reviews for the Spark Notes I found many places with typos > and incorrect code style. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15167) Add public catalog implementation method to SparkSession
Andrew Or created SPARK-15167: - Summary: Add public catalog implementation method to SparkSession Key: SPARK-15167 URL: https://issues.apache.org/jira/browse/SPARK-15167 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or Right now there's no way to check whether a given SparkSession has Hive support. You can do `spark.conf.get("spark.sql.catalogImplementation")` but that's supposed to be hidden from the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15166) Move hive-specific conf setting from SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15166: -- Summary: Move hive-specific conf setting from SparkSession (was: Move hive-specific conf setting to HiveSharedState) > Move hive-specific conf setting from SparkSession > - > > Key: SPARK-15166 > URL: https://issues.apache.org/jira/browse/SPARK-15166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15166) Move hive-specific conf setting to HiveSharedState
Andrew Or created SPARK-15166: - Summary: Move hive-specific conf setting to HiveSharedState Key: SPARK-15166 URL: https://issues.apache.org/jira/browse/SPARK-15166 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe
[ https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-15165: --- Description: toCommentSafeString method replaces "\u" with "\ \u" to avoid codegen breaking. But if the even number of "\" is put before "u", like "\ \u", in the string literal in the query, codegen can break. Following code occurs compilation error. {code} val df = Seq(...).toDF df.select("'u002A/'").show {code} The reason of the compilation error is because "u002A/" is translated into "*/" (the end of comment). Due to this unsafety, arbitrary code can be injected like as follows. {code} val df = Seq(...).toDF // Inject "System.exit(1)" df.select("'u002A/{System.exit(1);}/*'").show {code} was: toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking. But if the even number of "\" is put before "u", like "\\u", in the string literal in the query, codegen can break. Following code occurs compilation error. {code} val df = Seq(...).toDF df.select("'u002A/'").show {code} The reason of the compilation error is because "u002A/" is translated into "*/" (the end of comment). Due to this unsafety, arbitrary code can be injected like as follows. {code} val df = Seq(...).toDF // Inject "System.exit(1)" df.select("'u002A/{System.exit(1);}/*'").show {code} > Codegen can break because toCommentSafeString is not actually safe > -- > > Key: SPARK-15165 > URL: https://issues.apache.org/jira/browse/SPARK-15165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kousuke Saruta > > toCommentSafeString method replaces "\u" with "\ \u" to avoid codegen > breaking. > But if the even number of "\" is put before "u", like "\ \u", in the string > literal in the query, codegen can break. > Following code occurs compilation error. > {code} > val df = Seq(...).toDF > df.select("'u002A/'").show > {code} > The reason of the compilation error is because "u002A/" is translated > into "*/" (the end of comment). > Due to this unsafety, arbitrary code can be injected like as follows. > {code} > val df = Seq(...).toDF > // Inject "System.exit(1)" > df.select("'u002A/{System.exit(1);}/*'").show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe
[ https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-15165: --- Description: toCommentSafeString method replaces "\u" with "u" to avoid codegen breaking. But if the even number of "\" is put before "u", like "\\u", in the string literal in the query, codegen can break. Following code occurs compilation error. {code} val df = Seq(...).toDF df.select("'u002A/'").show {code} The reason of the compilation error is because "u002A/" is translated into "*/" (the end of comment). Due to this unsafety, arbitrary code can be injected like as follows. {code} val df = Seq(...).toDF // Inject "System.exit(1)" df.select("'u002A/{System.exit(1);}/*'").show {code} was: toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking. But if the even number of "\" is put before "u", like "\\u", in the string literal in the query, codegen can break. Following code occurs compilation error. {code} val df = Seq(...).toDF df.select("'u002A/'").show {code} The reason of the compilation error is because "u002A/" is translated into "*/" (the end of comment). Due to this unsafety, arbitrary code can be injected like as follows. {code} val df = Seq(...).toDF // Inject "System.exit(1)" df.select("'u002A/{System.exit(1);}/*'").show {code} > Codegen can break because toCommentSafeString is not actually safe > -- > > Key: SPARK-15165 > URL: https://issues.apache.org/jira/browse/SPARK-15165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kousuke Saruta > > toCommentSafeString method replaces "\u" with "u" to avoid codegen > breaking. > But if the even number of "\" is put before "u", like "\\u", in the string > literal in the query, codegen can break. > Following code occurs compilation error. > {code} > val df = Seq(...).toDF > df.select("'u002A/'").show > {code} > The reason of the compilation error is because "u002A/" is translated > into "*/" (the end of comment). > Due to this unsafety, arbitrary code can be injected like as follows. > {code} > val df = Seq(...).toDF > // Inject "System.exit(1)" > df.select("'u002A/{System.exit(1);}/*'").show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe
[ https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-15165: --- Description: toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking. But if the even number of "\" is put before "u", like "\\u", in the string literal in the query, codegen can break. Following code occurs compilation error. {code} val df = Seq(...).toDF df.select("'u002A/'").show {code} The reason of the compilation error is because "u002A/" is translated into "*/" (the end of comment). Due to this unsafety, arbitrary code can be injected like as follows. {code} val df = Seq(...).toDF // Inject "System.exit(1)" df.select("'u002A/{System.exit(1);}/*'").show {code} was: toCommentSafeString method replaces "\u" with "\\\u" to avoid codegen breaking. But if the even number of "\" is put before "u", like "\\u", in the string literal in the query, codegen can break. Following code occurs compilation error. {code} val df = Seq(...).toDF df.select("'u002A/'").show {code} The reason of the compilation error is because "u002A/" is translated into "*/" (the end of comment). Due to this unsafety, arbitrary code can be injected like as follows. {code} val df = Seq(...).toDF // Inject "System.exit(1)" df.select("'u002A/{System.exit(1);}/*'").show {code} > Codegen can break because toCommentSafeString is not actually safe > -- > > Key: SPARK-15165 > URL: https://issues.apache.org/jira/browse/SPARK-15165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kousuke Saruta > > toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking. > But if the even number of "\" is put before "u", like "\\u", in the string > literal in the query, codegen can break. > Following code occurs compilation error. > {code} > val df = Seq(...).toDF > df.select("'u002A/'").show > {code} > The reason of the compilation error is because "u002A/" is translated > into "*/" (the end of comment). > Due to this unsafety, arbitrary code can be injected like as follows. > {code} > val df = Seq(...).toDF > // Inject "System.exit(1)" > df.select("'u002A/{System.exit(1);}/*'").show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe
[ https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-15165: --- Description: toCommentSafeString method replaces "\u" with "\\\u" to avoid codegen breaking. But if the even number of "\" is put before "u", like "\\u", in the string literal in the query, codegen can break. Following code occurs compilation error. {code} val df = Seq(...).toDF df.select("'u002A/'").show {code} The reason of the compilation error is because "u002A/" is translated into "*/" (the end of comment). Due to this unsafety, arbitrary code can be injected like as follows. {code} val df = Seq(...).toDF // Inject "System.exit(1)" df.select("'u002A/{System.exit(1);}/*'").show {code} was: toCommentSafeString method replaces "\u" with "u" to avoid codegen breaking. But if the even number of "\" is put before "u", like "\\u", in the string literal in the query, codegen can break. Following code occurs compilation error. {code} val df = Seq(...).toDF df.select("'u002A/'").show {code} The reason of the compilation error is because "u002A/" is translated into "*/" (the end of comment). Due to this unsafety, arbitrary code can be injected like as follows. {code} val df = Seq(...).toDF // Inject "System.exit(1)" df.select("'u002A/{System.exit(1);}/*'").show {code} > Codegen can break because toCommentSafeString is not actually safe > -- > > Key: SPARK-15165 > URL: https://issues.apache.org/jira/browse/SPARK-15165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kousuke Saruta > > toCommentSafeString method replaces "\u" with "\\\u" to avoid codegen > breaking. > But if the even number of "\" is put before "u", like "\\u", in the string > literal in the query, codegen can break. > Following code occurs compilation error. > {code} > val df = Seq(...).toDF > df.select("'u002A/'").show > {code} > The reason of the compilation error is because "u002A/" is translated > into "*/" (the end of comment). > Due to this unsafety, arbitrary code can be injected like as follows. > {code} > val df = Seq(...).toDF > // Inject "System.exit(1)" > df.select("'u002A/{System.exit(1);}/*'").show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15074) Spark shuffle service bottlenecked while fetching large amount of intermediate data
[ https://issues.apache.org/jira/browse/SPARK-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273241#comment-15273241 ] Sital Kedia edited comment on SPARK-15074 at 5/5/16 10:32 PM: -- Okay, I made a change to cache the index file and that made the shuffle read time twice as fast. I am going to put out a PR for that change soon. Now I see the shuffle service is spending most of the time in the FileChannelImpl.transferTo method (Refer to the stack trace below). I wonder if there is a way to speed it up further? {code} java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:427) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:492) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:607) at org.apache.spark.network.buffer.LazyFileRegion.transferTo(LazyFileRegion.java:96) at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:89) at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:254) at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237) at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:281) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:761) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:311) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:729) at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1127) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:644) at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:644) at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:693) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:681) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:716) at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:954) at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:244) at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:184) at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:129) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:100) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at
[jira] [Commented] (SPARK-15074) Spark shuffle service bottlenecked while fetching large amount of intermediate data
[ https://issues.apache.org/jira/browse/SPARK-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273241#comment-15273241 ] Sital Kedia commented on SPARK-15074: - Okay, I made a change to cache the index file and that made the shuffle read time twice as fast. I am going to put out a PR for that change soon. Now I see the shuffle service is spending most of the time in the FileChannelImpl.transferTo method (Refer to the stack trace below). I wonder if there is a way to speed it up further? java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:427) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:492) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:607) at org.apache.spark.network.buffer.LazyFileRegion.transferTo(LazyFileRegion.java:96) at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:89) at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:254) at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237) at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:281) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:761) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:311) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:729) at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1127) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:644) at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:644) at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:693) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:681) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:716) at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:954) at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:244) at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:184) at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:129) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:100) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at
[jira] [Assigned] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe
[ https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15165: Assignee: Apache Spark > Codegen can break because toCommentSafeString is not actually safe > -- > > Key: SPARK-15165 > URL: https://issues.apache.org/jira/browse/SPARK-15165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark > > toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking. > But if the even number of "\" is put before "u", like "\\u", in the string > literal in the query, codegen can break. > Following code occurs compilation error. > {code} > val df = Seq(...).toDF > df.select("'u002A/'").show > {code} > The reason of the compilation error is because "u002A/" is translated > into "*/" (the end of comment). > Due to this unsafety, arbitrary code can be injected like as follows. > {code} > val df = Seq(...).toDF > // Inject "System.exit(1)" > df.select("'u002A/{System.exit(1);}/*'").show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe
[ https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273238#comment-15273238 ] Apache Spark commented on SPARK-15165: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/12939 > Codegen can break because toCommentSafeString is not actually safe > -- > > Key: SPARK-15165 > URL: https://issues.apache.org/jira/browse/SPARK-15165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kousuke Saruta > > toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking. > But if the even number of "\" is put before "u", like "\\u", in the string > literal in the query, codegen can break. > Following code occurs compilation error. > {code} > val df = Seq(...).toDF > df.select("'u002A/'").show > {code} > The reason of the compilation error is because "u002A/" is translated > into "*/" (the end of comment). > Due to this unsafety, arbitrary code can be injected like as follows. > {code} > val df = Seq(...).toDF > // Inject "System.exit(1)" > df.select("'u002A/{System.exit(1);}/*'").show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe
[ https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15165: Assignee: (was: Apache Spark) > Codegen can break because toCommentSafeString is not actually safe > -- > > Key: SPARK-15165 > URL: https://issues.apache.org/jira/browse/SPARK-15165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kousuke Saruta > > toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking. > But if the even number of "\" is put before "u", like "\\u", in the string > literal in the query, codegen can break. > Following code occurs compilation error. > {code} > val df = Seq(...).toDF > df.select("'u002A/'").show > {code} > The reason of the compilation error is because "u002A/" is translated > into "*/" (the end of comment). > Due to this unsafety, arbitrary code can be injected like as follows. > {code} > val df = Seq(...).toDF > // Inject "System.exit(1)" > df.select("'u002A/{System.exit(1);}/*'").show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe
Kousuke Saruta created SPARK-15165: -- Summary: Codegen can break because toCommentSafeString is not actually safe Key: SPARK-15165 URL: https://issues.apache.org/jira/browse/SPARK-15165 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Kousuke Saruta toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking. But if the even number of "\" is put before "u", like "\\u", in the string literal in the query, codegen can break. Following code occurs compilation error. {code} val df = Seq(...).toDF df.select("'u002A/'").show {code} The reason of the compilation error is because "u002A/" is translated into "*/" (the end of comment). Due to this unsafety, arbitrary code can be injected like as follows. {code} val df = Seq(...).toDF // Inject "System.exit(1)" df.select("'u002A/{System.exit(1);}/*'").show {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14977) Fine grained mode in Mesos is not fair
[ https://issues.apache.org/jira/browse/SPARK-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt closed SPARK-14977. --- Resolution: Not A Problem > Fine grained mode in Mesos is not fair > -- > > Key: SPARK-14977 > URL: https://issues.apache.org/jira/browse/SPARK-14977 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 > Environment: Spark commit db75ccb, Debian jessie, Mesos fine grained >Reporter: Luca Bruno > > I've setup a mesos cluster and I'm running spark in fine grained mode. > Spark defaults to 2 executor cores and 2gb of ram. > The total mesos cluster has 8 cores and 8gb of ram. > When I submit two spark jobs simultaneously, spark will always accept full > resources, leading the two frameworks to use 4gb of ram each instead of 2gb. > If I submit another spark job, it will not get offered resources from mesos, > at least using the default HierarchicalDRF allocator module. > Mesos will keep offering 4gb of ram to earlier spark jobs, and spark keeps > accepting full resources for every new task. > Hence new spark jobs have no chance of getting a share. > Is this something to be solved with a custom mesos allocator? Or spark should > be more fair instead? Or maybe provide a configuration option to always > accept with the minimum resources? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14977) Fine grained mode in Mesos is not fair
[ https://issues.apache.org/jira/browse/SPARK-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273202#comment-15273202 ] Michael Gummelt commented on SPARK-14977: - [~lethalman]: Fine-grained mode only release cores, not memory. It's impossible for us to shrink the memory allocation without OOM-ing the executor, because the JVM doesn't relinquish memory back to the OS. You can use dynamic allocation to terminate entire executors as they become idle. Also, FYI, fine-grained mode will soon be deprecated in favor of dynamic allocation. > Fine grained mode in Mesos is not fair > -- > > Key: SPARK-14977 > URL: https://issues.apache.org/jira/browse/SPARK-14977 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 > Environment: Spark commit db75ccb, Debian jessie, Mesos fine grained >Reporter: Luca Bruno > > I've setup a mesos cluster and I'm running spark in fine grained mode. > Spark defaults to 2 executor cores and 2gb of ram. > The total mesos cluster has 8 cores and 8gb of ram. > When I submit two spark jobs simultaneously, spark will always accept full > resources, leading the two frameworks to use 4gb of ram each instead of 2gb. > If I submit another spark job, it will not get offered resources from mesos, > at least using the default HierarchicalDRF allocator module. > Mesos will keep offering 4gb of ram to earlier spark jobs, and spark keeps > accepting full resources for every new task. > Hence new spark jobs have no chance of getting a share. > Is this something to be solved with a custom mesos allocator? Or spark should > be more fair instead? Or maybe provide a configuration option to always > accept with the minimum resources? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15162) Update PySpark LogisticRegression threshold PyDoc to be as complete as Scaladoc
[ https://issues.apache.org/jira/browse/SPARK-15162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15162: Assignee: Apache Spark > Update PySpark LogisticRegression threshold PyDoc to be as complete as > Scaladoc > --- > > Key: SPARK-15162 > URL: https://issues.apache.org/jira/browse/SPARK-15162 > Project: Spark > Issue Type: Improvement >Reporter: holdenk >Assignee: Apache Spark >Priority: Trivial > > The PyDoc for setting and getting the threshold in logistic regression > doesn't have the same level of detail as the Scaladoc does. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15162) Update PySpark LogisticRegression threshold PyDoc to be as complete as Scaladoc
[ https://issues.apache.org/jira/browse/SPARK-15162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273180#comment-15273180 ] Apache Spark commented on SPARK-15162: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/12938 > Update PySpark LogisticRegression threshold PyDoc to be as complete as > Scaladoc > --- > > Key: SPARK-15162 > URL: https://issues.apache.org/jira/browse/SPARK-15162 > Project: Spark > Issue Type: Improvement >Reporter: holdenk >Priority: Trivial > > The PyDoc for setting and getting the threshold in logistic regression > doesn't have the same level of detail as the Scaladoc does. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15164) Mark classification algorithms as experimental where marked so in scala
[ https://issues.apache.org/jira/browse/SPARK-15164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273181#comment-15273181 ] Apache Spark commented on SPARK-15164: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/12938 > Mark classification algorithms as experimental where marked so in scala > --- > > Key: SPARK-15164 > URL: https://issues.apache.org/jira/browse/SPARK-15164 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15164) Mark classification algorithms as experimental where marked so in scala
[ https://issues.apache.org/jira/browse/SPARK-15164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15164: Assignee: Apache Spark > Mark classification algorithms as experimental where marked so in scala > --- > > Key: SPARK-15164 > URL: https://issues.apache.org/jira/browse/SPARK-15164 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: holdenk >Assignee: Apache Spark >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15162) Update PySpark LogisticRegression threshold PyDoc to be as complete as Scaladoc
[ https://issues.apache.org/jira/browse/SPARK-15162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15162: Assignee: (was: Apache Spark) > Update PySpark LogisticRegression threshold PyDoc to be as complete as > Scaladoc > --- > > Key: SPARK-15162 > URL: https://issues.apache.org/jira/browse/SPARK-15162 > Project: Spark > Issue Type: Improvement >Reporter: holdenk >Priority: Trivial > > The PyDoc for setting and getting the threshold in logistic regression > doesn't have the same level of detail as the Scaladoc does. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15164) Mark classification algorithms as experimental where marked so in scala
[ https://issues.apache.org/jira/browse/SPARK-15164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15164: Assignee: (was: Apache Spark) > Mark classification algorithms as experimental where marked so in scala > --- > > Key: SPARK-15164 > URL: https://issues.apache.org/jira/browse/SPARK-15164 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14893) Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed
[ https://issues.apache.org/jira/browse/SPARK-14893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-14893. --- Resolution: Fixed Assignee: Dilip Biswal Fix Version/s: 2.0.0 > Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed > --- > > Key: SPARK-14893 > URL: https://issues.apache.org/jira/browse/SPARK-14893 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Dilip Biswal > Fix For: 2.0.0 > > > The test was disabled in https://github.com/apache/spark/pull/12585. To > re-enable it we need to rebuild the jar using the updated source code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14812) ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-14812: --- Assignee: DB Tsai > ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-14812 > URL: https://issues.apache.org/jira/browse/SPARK-14812 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: DB Tsai > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15158) Too aggressive logging in SizeBasedRollingPolicy?
[ https://issues.apache.org/jira/browse/SPARK-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15158: -- Assignee: Kai Wang > Too aggressive logging in SizeBasedRollingPolicy? > - > > Key: SPARK-15158 > URL: https://issues.apache.org/jira/browse/SPARK-15158 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Kai Wang >Assignee: Kai Wang >Priority: Trivial > Fix For: 2.0.0 > > > The questionable line is this: > https://github.com/apache/spark/blob/3e27940a19e7bab448f1af11d2065ecd1ec66197/core/src/main/scala/org/apache/spark/util/logging/RollingPolicy.scala#L116 > This will output a message *whenever* anything is logged at executor level. > Like the following: > SizeBasedRollingPolicy:59 83 + 140796 > 1048576 > SizeBasedRollingPolicy:59 83 + 140879 > 1048576 > SizeBasedRollingPolicy:59 83 + 140962 > 1048576 > ... > This seems to aggressive. Should this be at least downgrade to debug level? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9926) Parallelize file listing for partitioned Hive table
[ https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-9926. -- Resolution: Fixed Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Parallelize file listing for partitioned Hive table > --- > > Key: SPARK-9926 > URL: https://issues.apache.org/jira/browse/SPARK-9926 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheolsoo Park >Assignee: Ryan Blue > Fix For: 2.0.0 > > > In Spark SQL, short queries like {{select * from table limit 10}} run very > slowly against partitioned Hive tables because of file listing. In > particular, if a large number of partitions are scanned on storage like S3, > the queries run extremely slowly. Here are some example benchmarks in my > environment- > * Parquet-backed Hive table > * Partitioned by dateint and hour > * Stored on S3 > ||\# of partitions||\# of files||runtime||query|| > |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit > 10;| > |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;| > |240|136222|1 hour|select * from nccp_log where dateint>=20150601 and > dateint<=20150610 limit 10;| > The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive > partition path and group them into a UnionRDD. Then, all the input files are > listed sequentially. In other tools such as Hive and Pig, this can be solved > by setting > [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml] > high. But in Spark, since each HadoopRDD lists only one partition path, > setting this property doesn't help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15158) Too aggressive logging in SizeBasedRollingPolicy?
[ https://issues.apache.org/jira/browse/SPARK-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15158. --- Resolution: Fixed Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Too aggressive logging in SizeBasedRollingPolicy? > - > > Key: SPARK-15158 > URL: https://issues.apache.org/jira/browse/SPARK-15158 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Kai Wang >Priority: Trivial > Fix For: 2.0.0 > > > The questionable line is this: > https://github.com/apache/spark/blob/3e27940a19e7bab448f1af11d2065ecd1ec66197/core/src/main/scala/org/apache/spark/util/logging/RollingPolicy.scala#L116 > This will output a message *whenever* anything is logged at executor level. > Like the following: > SizeBasedRollingPolicy:59 83 + 140796 > 1048576 > SizeBasedRollingPolicy:59 83 + 140879 > 1048576 > SizeBasedRollingPolicy:59 83 + 140962 > 1048576 > ... > This seems to aggressive. Should this be at least downgrade to debug level? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15134) Indent SparkSession builder patterns and update binary_classification_metrics_example.py
[ https://issues.apache.org/jira/browse/SPARK-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15134. --- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Indent SparkSession builder patterns and update > binary_classification_metrics_example.py > > > Key: SPARK-15134 > URL: https://issues.apache.org/jira/browse/SPARK-15134 > Project: Spark > Issue Type: Task > Components: Examples >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.0 > > > This issue addresses the comments in SPARK-15031 and also fix java-linter > errors. > - Use multiline format in SparkSession builder patterns. > - Update `binary_classification_metrics_example.py` to use `SparkSession`. > - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15135) Make sure SparkSession thread safe
[ https://issues.apache.org/jira/browse/SPARK-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15135. --- Resolution: Fixed Fix Version/s: 2.0.0 > Make sure SparkSession thread safe > -- > > Key: SPARK-15135 > URL: https://issues.apache.org/jira/browse/SPARK-15135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > Fixed non-thread-safe classed used by SparkSession. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15072) Remove SparkSession.withHiveSupport
[ https://issues.apache.org/jira/browse/SPARK-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15072. --- Resolution: Fixed > Remove SparkSession.withHiveSupport > --- > > Key: SPARK-15072 > URL: https://issues.apache.org/jira/browse/SPARK-15072 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Sandeep Singh > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10653) Remove unnecessary things from SparkEnv
[ https://issues.apache.org/jira/browse/SPARK-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273117#comment-15273117 ] Alex Bozarth commented on SPARK-10653: -- I'm currently running tests on a fix for this and will open a PR after. I have removed blockTransferService and sparkFilesDir and replaced the few references to them. ExecutorMemoryManager was already removed in SPARK-10984. I also took a quick look at the other vals in the constructor and I didn't see any other low hanging fruit to remove. > Remove unnecessary things from SparkEnv > --- > > Key: SPARK-10653 > URL: https://issues.apache.org/jira/browse/SPARK-10653 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or > > As of the writing of this message, there are at least two things that can be > removed from it: > {code} > @DeveloperApi > class SparkEnv ( > val executorId: String, > private[spark] val rpcEnv: RpcEnv, > val serializer: Serializer, > val closureSerializer: Serializer, > val cacheManager: CacheManager, > val mapOutputTracker: MapOutputTracker, > val shuffleManager: ShuffleManager, > val broadcastManager: BroadcastManager, > val blockTransferService: BlockTransferService, // this one can go > val blockManager: BlockManager, > val securityManager: SecurityManager, > val httpFileServer: HttpFileServer, > val sparkFilesDir: String, // this one maybe? It's only used in 1 place. > val metricsSystem: MetricsSystem, > val shuffleMemoryManager: ShuffleMemoryManager, > val executorMemoryManager: ExecutorMemoryManager, // this can go > val outputCommitCoordinator: OutputCommitCoordinator, > val conf: SparkConf) extends Logging { > ... > } > {code} > We should avoid adding to this infinite list of things in SparkEnv's > constructors if they're not needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15140) ensure input object of encoder is not null
[ https://issues.apache.org/jira/browse/SPARK-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273100#comment-15273100 ] Michael Armbrust commented on SPARK-15140: -- The 2.0 behavior seems correct. Ideally .toDS().collect() will always round-trip the data without change. > ensure input object of encoder is not null > -- > > Key: SPARK-15140 > URL: https://issues.apache.org/jira/browse/SPARK-15140 > Project: Spark > Issue Type: Improvement >Reporter: Wenchen Fan > > Current we assume the input object for encoder won't be null, but we don't > check it. For example, in 1.6 `Seq("a", null).toDS.collect` will throw NPE, > in 2.0 this will return Array("a", null). > We should define this behaviour more clearly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14959) Problem Reading partitioned ORC or Parquet files
[ https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14959: - Priority: Critical (was: Major) > Problem Reading partitioned ORC or Parquet files > - > > Key: SPARK-14959 > URL: https://issues.apache.org/jira/browse/SPARK-14959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Hadoop 2.7.1.2.4.0.0-169 (HDP 2.4) >Reporter: Sebastian YEPES FERNANDEZ >Priority: Critical > > Hello, > I have noticed that in the pasts days there is an issue when trying to read > partitioned files from HDFS. > I am running on Spark master branch #c544356 > The write actually works but the read fails. > {code:title=Issue Reproduction} > case class Data(id: Int, text: String) > val ds = spark.createDataset( Seq(Data(0, "hello"), Data(1, "hello"), Data(0, > "world"), Data(1, "there")) ) > scala> > ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("parquet").partitionBy("id").save("/user/spark/test.parquet") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > java.io.FileNotFoundException: Path is not a file: > /user/spark/test.parquet/id=0 > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227) > at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1285) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:228) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209) > at > org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:372) > at > org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:360) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at >
[jira] [Updated] (SPARK-14959) Problem Reading partitioned ORC or Parquet files
[ https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-14959: - Target Version/s: 2.0.0 Component/s: (was: Input/Output) SQL > Problem Reading partitioned ORC or Parquet files > - > > Key: SPARK-14959 > URL: https://issues.apache.org/jira/browse/SPARK-14959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Hadoop 2.7.1.2.4.0.0-169 (HDP 2.4) >Reporter: Sebastian YEPES FERNANDEZ > > Hello, > I have noticed that in the pasts days there is an issue when trying to read > partitioned files from HDFS. > I am running on Spark master branch #c544356 > The write actually works but the read fails. > {code:title=Issue Reproduction} > case class Data(id: Int, text: String) > val ds = spark.createDataset( Seq(Data(0, "hello"), Data(1, "hello"), Data(0, > "world"), Data(1, "there")) ) > scala> > ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("parquet").partitionBy("id").save("/user/spark/test.parquet") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > java.io.FileNotFoundException: Path is not a file: > /user/spark/test.parquet/id=0 > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227) > at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1285) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:228) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209) > at > org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:372) > at > org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:360) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at >
[jira] [Updated] (SPARK-10653) Remove unnecessary things from SparkEnv
[ https://issues.apache.org/jira/browse/SPARK-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Bozarth updated SPARK-10653: - Summary: Remove unnecessary things from SparkEnv (was: head) > Remove unnecessary things from SparkEnv > --- > > Key: SPARK-10653 > URL: https://issues.apache.org/jira/browse/SPARK-10653 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or > > As of the writing of this message, there are at least two things that can be > removed from it: > {code} > @DeveloperApi > class SparkEnv ( > val executorId: String, > private[spark] val rpcEnv: RpcEnv, > val serializer: Serializer, > val closureSerializer: Serializer, > val cacheManager: CacheManager, > val mapOutputTracker: MapOutputTracker, > val shuffleManager: ShuffleManager, > val broadcastManager: BroadcastManager, > val blockTransferService: BlockTransferService, // this one can go > val blockManager: BlockManager, > val securityManager: SecurityManager, > val httpFileServer: HttpFileServer, > val sparkFilesDir: String, // this one maybe? It's only used in 1 place. > val metricsSystem: MetricsSystem, > val shuffleMemoryManager: ShuffleMemoryManager, > val executorMemoryManager: ExecutorMemoryManager, // this can go > val outputCommitCoordinator: OutputCommitCoordinator, > val conf: SparkConf) extends Logging { > ... > } > {code} > We should avoid adding to this infinite list of things in SparkEnv's > constructors if they're not needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10653) head
[ https://issues.apache.org/jira/browse/SPARK-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Bozarth updated SPARK-10653: - Summary: head (was: Remove unnecessary things from SparkEnv) > head > > > Key: SPARK-10653 > URL: https://issues.apache.org/jira/browse/SPARK-10653 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or > > As of the writing of this message, there are at least two things that can be > removed from it: > {code} > @DeveloperApi > class SparkEnv ( > val executorId: String, > private[spark] val rpcEnv: RpcEnv, > val serializer: Serializer, > val closureSerializer: Serializer, > val cacheManager: CacheManager, > val mapOutputTracker: MapOutputTracker, > val shuffleManager: ShuffleManager, > val broadcastManager: BroadcastManager, > val blockTransferService: BlockTransferService, // this one can go > val blockManager: BlockManager, > val securityManager: SecurityManager, > val httpFileServer: HttpFileServer, > val sparkFilesDir: String, // this one maybe? It's only used in 1 place. > val metricsSystem: MetricsSystem, > val shuffleMemoryManager: ShuffleMemoryManager, > val executorMemoryManager: ExecutorMemoryManager, // this can go > val outputCommitCoordinator: OutputCommitCoordinator, > val conf: SparkConf) extends Logging { > ... > } > {code} > We should avoid adding to this infinite list of things in SparkEnv's > constructors if they're not needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14813) ML 2.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-14813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14813: -- Assignee: holdenk (was: Yanbo Liang) > ML 2.0 QA: API: Python API coverage > --- > > Key: SPARK-14813 > URL: https://issues.apache.org/jira/browse/SPARK-14813 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: holdenk > > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > Please use a *separate* JIRA (linked below) for this list of to-do items. > UPDATE: This only needs to cover spark.ml since spark.mllib is going into > maintenance mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15159) Remove usage of HiveContext in SparkR unit test cases.
[ https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272666#comment-15272666 ] Vijay Parmar edited comment on SPARK-15159 at 5/5/16 8:31 PM: -- Hi Sun, I am interested in taking-up this task but unable to assign it to myself Can you please guide me to the right direction. was (Author: vsparmar): Hi Sun, I am interested in taking-up this task but unable to assign it to myself Can you please guide me to the right direction like link to repo. or something from where I can start. > Remove usage of HiveContext in SparkR unit test cases. > -- > > Key: SPARK-15159 > URL: https://issues.apache.org/jira/browse/SPARK-15159 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > HiveContext is to be deprecated in 2.0. However, there are several times of > usage of HiveContext in SparkR unit test cases. Replace them with > SparkSession.withHiveSupport . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15159) Remove usage of HiveContext in SparkR unit test cases.
[ https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272666#comment-15272666 ] Vijay Parmar edited comment on SPARK-15159 at 5/5/16 8:31 PM: -- Hi Sun, I am interested in taking-up this task. Can you please guide me to the right direction. was (Author: vsparmar): Hi Sun, I am interested in taking-up this task but unable to assign it to myself Can you please guide me to the right direction. > Remove usage of HiveContext in SparkR unit test cases. > -- > > Key: SPARK-15159 > URL: https://issues.apache.org/jira/browse/SPARK-15159 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > HiveContext is to be deprecated in 2.0. However, there are several times of > usage of HiveContext in SparkR unit test cases. Replace them with > SparkSession.withHiveSupport . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15163) Mark experimental algorithms experimental in PySpark
[ https://issues.apache.org/jira/browse/SPARK-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-15163: Component/s: PySpark > Mark experimental algorithms experimental in PySpark > > > Key: SPARK-15163 > URL: https://issues.apache.org/jira/browse/SPARK-15163 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > > While we are going through them anyways might as well mark the PySpark > algorithm as experimental that are marked so in Scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15164) Mark classification algorithms as experimental where marked so in scala
holdenk created SPARK-15164: --- Summary: Mark classification algorithms as experimental where marked so in scala Key: SPARK-15164 URL: https://issues.apache.org/jira/browse/SPARK-15164 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: holdenk Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15163) Mark experimental algorithms experimental in PySpark
holdenk created SPARK-15163: --- Summary: Mark experimental algorithms experimental in PySpark Key: SPARK-15163 URL: https://issues.apache.org/jira/browse/SPARK-15163 Project: Spark Issue Type: Improvement Components: ML Reporter: holdenk Priority: Trivial While we are going through them anyways might as well mark the PySpark algorithm as experimental that are marked so in Scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15162) Update PySpark LogisticRegression threshold PyDoc to be as complete as Scaladoc
holdenk created SPARK-15162: --- Summary: Update PySpark LogisticRegression threshold PyDoc to be as complete as Scaladoc Key: SPARK-15162 URL: https://issues.apache.org/jira/browse/SPARK-15162 Project: Spark Issue Type: Improvement Reporter: holdenk Priority: Trivial The PyDoc for setting and getting the threshold in logistic regression doesn't have the same level of detail as the Scaladoc does. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15092) toDebugString missing from ML DecisionTreeClassifier
[ https://issues.apache.org/jira/browse/SPARK-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272956#comment-15272956 ] Apache Spark commented on SPARK-15092: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/12937 > toDebugString missing from ML DecisionTreeClassifier > > > Key: SPARK-15092 > URL: https://issues.apache.org/jira/browse/SPARK-15092 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: HDP 2.3.4, Red Hat 6.7 >Reporter: Ivan SPM >Assignee: holdenk >Priority: Minor > Labels: features > > The attribute toDebugString is missing from the DecisionTreeClassifier and > DecisionTreeClassifierModel from ML. The attribute exists on the MLLib > DecisionTree model. > There's no way to check or print the model tree structure from the ML. > The basic code for it is this: > rom pyspark.ml import Pipeline > from pyspark.ml.feature import VectorAssembler, StringIndexer > from pyspark.ml.classification import DecisionTreeClassifier > cl = DecisionTreeClassifier(labelCol='target_idx', featuresCol='features') > pipe = Pipeline(stages=[target_index, assembler, cl]) > model = pipe.fit(df_train) > # Prediction and model evaluation > predictions = model.transform(df_test) > mc_evaluator = MulticlassClassificationEvaluator( > labelCol="target_idx", predictionCol="prediction", metricName="precision") > accuracy = mc_evaluator.evaluate(predictions) > print("Test Error = {}".format(1.0 - accuracy)) > now it would be great to be able to do what is being done on the MLLib model: > print model.toDebugString(), # it already has newline > DecisionTreeModel classifier of depth 1 with 3 nodes > If (feature 0 <= 0.0) >Predict: 0.0 > Else (feature 0 > 0.0) >Predict: 1.0 > but there's no toDebugString attribute either to the pipeline model or the > DecisionTreeClassifier model: > cl.toDebugString() > Attribute Error > https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/mllib/tree.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15080) Break copyAndReset into copy and reset
[ https://issues.apache.org/jira/browse/SPARK-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15080: Assignee: Apache Spark > Break copyAndReset into copy and reset > -- > > Key: SPARK-15080 > URL: https://issues.apache.org/jira/browse/SPARK-15080 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Apache Spark > Fix For: 2.0.0 > > > We should break copy and reset into two methods rather than just one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15080) Break copyAndReset into copy and reset
[ https://issues.apache.org/jira/browse/SPARK-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272945#comment-15272945 ] Apache Spark commented on SPARK-15080: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/12936 > Break copyAndReset into copy and reset > -- > > Key: SPARK-15080 > URL: https://issues.apache.org/jira/browse/SPARK-15080 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin > Fix For: 2.0.0 > > > We should break copy and reset into two methods rather than just one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15080) Break copyAndReset into copy and reset
[ https://issues.apache.org/jira/browse/SPARK-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15080: Assignee: (was: Apache Spark) > Break copyAndReset into copy and reset > -- > > Key: SPARK-15080 > URL: https://issues.apache.org/jira/browse/SPARK-15080 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin > Fix For: 2.0.0 > > > We should break copy and reset into two methods rather than just one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete
[ https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14915: -- Fix Version/s: 1.6.2 > Tasks that fail due to CommitDeniedException (a side-effect of speculation) > can cause job to never complete > --- > > Key: SPARK-14915 > URL: https://issues.apache.org/jira/browse/SPARK-14915 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.3, 1.6.2, 2.0.0 >Reporter: Jason Moore >Assignee: Jason Moore >Priority: Critical > Fix For: 1.6.2, 2.0.0 > > > In SPARK-14357, code was corrected towards the originally intended behavior > that a CommitDeniedException should not count towards the failure count for a > job. After having run with this fix for a few weeks, it's become apparent > that this behavior has some unintended consequences - that a speculative task > will continuously receive a CDE from the driver, now causing it to fail and > retry over and over without limit. > I'm thinking we could put a task that receives a CDE from the driver, into a > TaskState.FINISHED or some other state to indicated that the task shouldn't > be resubmitted by the TaskScheduler. I'd probably need some opinions on > whether there are other consequences for doing something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15138) Linkify ML PyDoc regression
[ https://issues.apache.org/jira/browse/SPARK-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272884#comment-15272884 ] holdenk commented on SPARK-15138: - cc [~yanboliang] > Linkify ML PyDoc regression > --- > > Key: SPARK-15138 > URL: https://issues.apache.org/jira/browse/SPARK-15138 > Project: Spark > Issue Type: Sub-task >Reporter: holdenk >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14854) Left outer join produces incorrect output when the join condition does not have left table key
[ https://issues.apache.org/jira/browse/SPARK-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272865#comment-15272865 ] kanika dhuria commented on SPARK-14854: --- Why do you think they are same issue? I was expecting all left table data when the join condition is false. Even when I have condition like $"num1".===(lit(10)), result is empty. > Left outer join produces incorrect output when the join condition does not > have left table key > -- > > Key: SPARK-14854 > URL: https://issues.apache.org/jira/browse/SPARK-14854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: kanika dhuria > > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val s = StructType(StructField("num", StringType, true)::Nil) > val s1 = StructType(StructField("num1", StringType, true)::Nil) > val m = > sc.textFile("file:/tmp/master.txt").map(_.split(",")).map(p=>Row(p(0))) > val d = > sc.textFile("file:/tmp/detail.txt").map(_.split(",")).map(p=>Row(p(0))) > val m1 = sqlContext.createDataFrame(m, s1) > val d1 = sqlContext.createDataFrame(d, s) > val j1 = d1.join(m1,$"num1".===(lit(null)),"left_outer"); > j1.take(1) > Returns empty data set. Left table has data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15110) SparkR - Implement repartitionByColumn on DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-15110. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12887 [https://github.com/apache/spark/pull/12887] > SparkR - Implement repartitionByColumn on DataFrame > --- > > Key: SPARK-15110 > URL: https://issues.apache.org/jira/browse/SPARK-15110 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Narine Kokhlikyan > Fix For: 2.0.0 > > > Implement repartitionByColumn on DataFrame. > This will allow us to run R functions on each partition identified by column > groups with dapply() method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14811) ML, Graph 2.0 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14811: -- Assignee: Yanbo Liang > ML, Graph 2.0 QA: API: New Scala APIs, docs > --- > > Key: SPARK-14811 > URL: https://issues.apache.org/jira/browse/SPARK-14811 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14813) ML 2.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-14813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14813: -- Assignee: Yanbo Liang > ML 2.0 QA: API: Python API coverage > --- > > Key: SPARK-14813 > URL: https://issues.apache.org/jira/browse/SPARK-14813 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > Please use a *separate* JIRA (linked below) for this list of to-do items. > UPDATE: This only needs to cover spark.ml since spark.mllib is going into > maintenance mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15161) Consider moving featureImportances into TreeEnsemble models base class
holdenk created SPARK-15161: --- Summary: Consider moving featureImportances into TreeEnsemble models base class Key: SPARK-15161 URL: https://issues.apache.org/jira/browse/SPARK-15161 Project: Spark Issue Type: Improvement Reporter: holdenk Priority: Minor Right now each of the subclasses has it implemented, we could consider moving it to the base class (after 2.0). cc [~mlnick] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org