date:20160505

[jira] [Commented] (SPARK-14365) Repartition by column

2016-05-05 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273664#comment-15273664
 ] 

Sun Rui commented on SPARK-14365:
-

[~dselivanov] Could you verify if SPARK-15110 can solve your problem?

> Repartition by column
> -
>
> Key: SPARK-14365
> URL: https://issues.apache.org/jira/browse/SPARK-14365
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Dmitriy Selivanov
>
> Starting from 1.6 it is possible to set partitioning for data frames. For 
> example in Scala we can do it in a following way:
> {code}
> val partitioned = df.repartition($"k")
> {code}
> Would be nice to have this functionality in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-14365) Repartition by column

2016-05-05 Thread Sun Rui (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui closed SPARK-14365.
---
Resolution: Duplicate

> Repartition by column
> -
>
> Key: SPARK-14365
> URL: https://issues.apache.org/jira/browse/SPARK-14365
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Dmitriy Selivanov
>
> Starting from 1.6 it is possible to set partitioning for data frames. For 
> example in Scala we can do it in a following way:
> {code}
> val partitioned = df.repartition($"k")
> {code}
> Would be nice to have this functionality in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15159) Remove usage of HiveContext in SparkR.

2016-05-05 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273655#comment-15273655
 ] 

Sun Rui commented on SPARK-15159:
-

[~felixcheung], I guess you are talking about SQLContext, not HiveContext. For 
SQLContext, it is kept for backward compatibility, we don't need to change it 
for now.

HiveContext is deprecated, not removed. However, I don't think it is a big 
change. Two pieces only:
1. Modify SparkRHive.init() to use SparkSession;
2. Investigate if we need to change using of TestHiveContext in SparkR unit 
tests. A rough look seems no change is needed. But not sure.

[~vsparmar] Feel free to take this JIRA.

> Remove usage of HiveContext in SparkR.
> --
>
> Key: SPARK-15159
> URL: https://issues.apache.org/jira/browse/SPARK-15159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> HiveContext is to be deprecated in 2.0.  Replace them with 
> SparkSession.withHiveSupport in SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14476) Show table name or path in string of DataSourceScan

2016-05-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273632#comment-15273632
 ] 

Apache Spark commented on SPARK-14476:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/12947

> Show table name or path in string of DataSourceScan
> ---
>
> Key: SPARK-14476
> URL: https://issues.apache.org/jira/browse/SPARK-14476
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
>Priority: Critical
>
> right now, the string of DataSourceScan is only "HadoopFiles xxx", without 
> any information about the table name or path. 
> Since we have that in 1.6, this is kind of regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14476) Show table name or path in string of DataSourceScan

2016-05-05 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273630#comment-15273630
 ] 

Sean Zhong commented on SPARK-14476:


Regression of SPARK-12012

> Show table name or path in string of DataSourceScan
> ---
>
> Key: SPARK-14476
> URL: https://issues.apache.org/jira/browse/SPARK-14476
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
>Priority: Critical
>
> right now, the string of DataSourceScan is only "HadoopFiles xxx", without 
> any information about the table name or path. 
> Since we have that in 1.6, this is kind of regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14476) Show table name or path in string of DataSourceScan

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14476:


Assignee: Cheng Lian  (was: Apache Spark)

> Show table name or path in string of DataSourceScan
> ---
>
> Key: SPARK-14476
> URL: https://issues.apache.org/jira/browse/SPARK-14476
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
>Priority: Critical
>
> right now, the string of DataSourceScan is only "HadoopFiles xxx", without 
> any information about the table name or path. 
> Since we have that in 1.6, this is kind of regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14476) Show table name or path in string of DataSourceScan

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14476:


Assignee: Apache Spark  (was: Cheng Lian)

> Show table name or path in string of DataSourceScan
> ---
>
> Key: SPARK-14476
> URL: https://issues.apache.org/jira/browse/SPARK-14476
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>Priority: Critical
>
> right now, the string of DataSourceScan is only "HadoopFiles xxx", without 
> any information about the table name or path. 
> Since we have that in 1.6, this is kind of regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15085) Rename current streaming-kafka artifact to include kafka version

2016-05-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273619#comment-15273619
 ] 

Apache Spark commented on SPARK-15085:
--

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/12946

> Rename current streaming-kafka artifact to include kafka version
> 
>
> Key: SPARK-15085
> URL: https://issues.apache.org/jira/browse/SPARK-15085
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Cody Koeninger
>
> Since supporting kafka 0.10 will likely need a separate artifact, rename 
> existing artifact now so that the minor breaking change is in place for spark 
> 2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15085) Rename current streaming-kafka artifact to include kafka version

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15085:


Assignee: (was: Apache Spark)

> Rename current streaming-kafka artifact to include kafka version
> 
>
> Key: SPARK-15085
> URL: https://issues.apache.org/jira/browse/SPARK-15085
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Cody Koeninger
>
> Since supporting kafka 0.10 will likely need a separate artifact, rename 
> existing artifact now so that the minor breaking change is in place for spark 
> 2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15085) Rename current streaming-kafka artifact to include kafka version

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15085:


Assignee: Apache Spark

> Rename current streaming-kafka artifact to include kafka version
> 
>
> Key: SPARK-15085
> URL: https://issues.apache.org/jira/browse/SPARK-15085
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Cody Koeninger
>Assignee: Apache Spark
>
> Since supporting kafka 0.10 will likely need a separate artifact, rename 
> existing artifact now so that the minor breaking change is in place for spark 
> 2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14809) R Examples: Check for new R APIs requiring example code in 2.0

2016-05-05 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14809:
--
Assignee: Yanbo Liang

> R Examples: Check for new R APIs requiring example code in 2.0
> --
>
> Key: SPARK-14809
> URL: https://issues.apache.org/jira/browse/SPARK-14809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Minor
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> Note: This no longer includes Scala/Java/Python since those are covered under 
> the user guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15155) Optionally ignore default role resources

2016-05-05 Thread Chris Heller (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Heller updated SPARK-15155:
-
Description: 
SPARK-6284 added support for Mesos roles, but the framework will still accept 
resources from both the reserved role specified in {{spark.mesos.role}} and the 
default role {{*}}.

I'd like to propose the addition of a new boolean property: 
{{spark.mesos.ignoreDefaultRoleResources}}. When this property is set Spark 
will only accept resources from the role passed in the {{spark.mesos.role}} 
property. If {{spark.mesos.role}} has not been set, 
{{spark.mesos.ignoreDefaultRoleResources}} has no effect.

  was:
SPARK-6284 added support for Mesos roles, but the framework will still accept 
resources from both the reserved role specified in {{spark.mesos.role}} and the 
default role {{*}}.

I'd like to propose the addition of a new property 
{{spark.mesos.acceptedResourceRoles}} which would be a comma-delimited list of 
roles that the framework will accept resources from.

This is similar to {{spark.mesos.constraints}}, except that constraints look at 
the attributes of an offer, and this will look at the role of a resource.

In the default case {{spark.mesos.acceptedResourceRoles}} will be set to 
{{*[,spark.mesos.role]}} giving the exact same behavior to the framework if no 
value is specified in the property.


> Optionally ignore default role resources
> 
>
> Key: SPARK-15155
> URL: https://issues.apache.org/jira/browse/SPARK-15155
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chris Heller
>
> SPARK-6284 added support for Mesos roles, but the framework will still accept 
> resources from both the reserved role specified in {{spark.mesos.role}} and 
> the default role {{*}}.
> I'd like to propose the addition of a new boolean property: 
> {{spark.mesos.ignoreDefaultRoleResources}}. When this property is set Spark 
> will only accept resources from the role passed in the {{spark.mesos.role}} 
> property. If {{spark.mesos.role}} has not been set, 
> {{spark.mesos.ignoreDefaultRoleResources}} has no effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15155) Optionally ignore default role resources

2016-05-05 Thread Chris Heller (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Heller updated SPARK-15155:
-
Summary: Optionally ignore default role resources  (was: Selectively accept 
Mesos resources by role)

> Optionally ignore default role resources
> 
>
> Key: SPARK-15155
> URL: https://issues.apache.org/jira/browse/SPARK-15155
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chris Heller
>
> SPARK-6284 added support for Mesos roles, but the framework will still accept 
> resources from both the reserved role specified in {{spark.mesos.role}} and 
> the default role {{*}}.
> I'd like to propose the addition of a new property 
> {{spark.mesos.acceptedResourceRoles}} which would be a comma-delimited list 
> of roles that the framework will accept resources from.
> This is similar to {{spark.mesos.constraints}}, except that constraints look 
> at the attributes of an offer, and this will look at the role of a resource.
> In the default case {{spark.mesos.acceptedResourceRoles}} will be set to 
> {{*[,spark.mesos.role]}} giving the exact same behavior to the framework if 
> no value is specified in the property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15171) Deprecate registerTempTable and add dataset.createTempView

2016-05-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273544#comment-15273544
 ] 

Apache Spark commented on SPARK-15171:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/12945

> Deprecate registerTempTable and add dataset.createTempView
> --
>
> Key: SPARK-15171
> URL: https://issues.apache.org/jira/browse/SPARK-15171
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Minor
>
> Our current dataset.registerTempTable does not actually materialize data. So, 
> it should be considered as creating a temp view. We can deprecate it and 
> create a new method called dataset.createTempView(replaceIfExists: Boolean). 
> The default value of replaceIfExists should be false. For registerTempTable, 
> it will call dataset.createTempView(replaceIfExists = true).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15171) Deprecate registerTempTable and add dataset.createTempView

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15171:


Assignee: Apache Spark

> Deprecate registerTempTable and add dataset.createTempView
> --
>
> Key: SPARK-15171
> URL: https://issues.apache.org/jira/browse/SPARK-15171
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Assignee: Apache Spark
>Priority: Minor
>
> Our current dataset.registerTempTable does not actually materialize data. So, 
> it should be considered as creating a temp view. We can deprecate it and 
> create a new method called dataset.createTempView(replaceIfExists: Boolean). 
> The default value of replaceIfExists should be false. For registerTempTable, 
> it will call dataset.createTempView(replaceIfExists = true).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15171) Deprecate registerTempTable and add dataset.createTempView

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15171:


Assignee: (was: Apache Spark)

> Deprecate registerTempTable and add dataset.createTempView
> --
>
> Key: SPARK-15171
> URL: https://issues.apache.org/jira/browse/SPARK-15171
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Priority: Minor
>
> Our current dataset.registerTempTable does not actually materialize data. So, 
> it should be considered as creating a temp view. We can deprecate it and 
> create a new method called dataset.createTempView(replaceIfExists: Boolean). 
> The default value of replaceIfExists should be false. For registerTempTable, 
> it will call dataset.createTempView(replaceIfExists = true).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8428) TimSort Comparison method violates its general contract with CLUSTER BY

2016-05-05 Thread Yi Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273534#comment-15273534
 ] 

Yi Zhou commented on SPARK-8428:


We found the similar issue with Spark 1.6.1 in our larger data size test..I 
posted the details like below. Then we try to increase the 
spark.sql.shuffle.partitions to resolve it. 

{code}
CREATE TABLE q26_spark_sql_run_query_0_temp (
  cid  BIGINT,
  id1  double,
  id2  double,
  id3  double,
  id4  double,
  id5  double,
  id6  double,
  id7  double,
  id8  double,
  id9  double,
  id10 double,
  id11 double,
  id12 double,
  id13 double,
  id14 double,
  id15 double
)

INSERT INTO TABLE q26_spark_sql_run_query_0_temp
SELECT
  ss.ss_customer_sk AS cid,
  count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
  count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
  count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
  count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
  count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
  count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
  count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
  count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
  count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
  count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
  count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
  count(CASE WHEN i.i_class_id=12 THEN 1 ELSE NULL END) AS id12,
  count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
  count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
  count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15
FROM store_sales ss
INNER JOIN item i
  ON (ss.ss_item_sk = i.i_item_sk
  AND i.i_category IN ('Books')
  AND ss.ss_customer_sk IS NOT NULL
)
GROUP BY ss.ss_customer_sk
HAVING count(ss.ss_item_sk) > 5
ORDER BY cid
{code}

{code}
16/05/05 14:50:03 WARN scheduler.TaskSetManager: Lost task 12.0 in stage 162.0 
(TID 15153, node6): java.lang.IllegalArgumentException: Comparison method 
violates its
general contract!
at 
org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:794)
at 
org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525)
at 
org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453)
at 
org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325)
at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153)
at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:228)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:186)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:295)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:330)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at

[jira] [Commented] (SPARK-15032) When we create a new JDBC session, we may need to create a new session of executionHive

2016-05-05 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273529#comment-15273529
 ] 

Yin Huai commented on SPARK-15032:
--

Can you explain more about "I think the problem is that it terminates the 
executionHive process"? I am not sure I understand this. Thanks!

> When we create a new JDBC session, we may need to create a new session of 
> executionHive
> ---
>
> Key: SPARK-15032
> URL: https://issues.apache.org/jira/browse/SPARK-15032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now, we only use executionHive in thriftserver. When we create a new 
> jdbc session, we probably need to create a new session of executionHive. I am 
> not sure what will break if we leave the code as is. But, I feel it will be 
> safer to create a new session of executionHive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15171) Deprecate registerTempTable and add dataset.createTempView

2016-05-05 Thread Sean Zhong (JIRA)

Sean Zhong created SPARK-15171:
--

 Summary: Deprecate registerTempTable and add dataset.createTempView
 Key: SPARK-15171
 URL: https://issues.apache.org/jira/browse/SPARK-15171
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong
Priority: Minor


Our current dataset.registerTempTable does not actually materialize data. So, 
it should be considered as creating a temp view. We can deprecate it and create 
a new method called dataset.createTempView(replaceIfExists: Boolean). The 
default value of replaceIfExists should be false. For registerTempTable, it 
will call dataset.createTempView(replaceIfExists = true).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14809) R Examples: Check for new R APIs requiring example code in 2.0

2016-05-05 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273492#comment-15273492
 ] 

Yanbo Liang commented on SPARK-14809:
-

I'm glad to help this.

> R Examples: Check for new R APIs requiring example code in 2.0
> --
>
> Key: SPARK-14809
> URL: https://issues.apache.org/jira/browse/SPARK-14809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> Note: This no longer includes Scala/Java/Python since those are covered under 
> the user guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11395) Support over and window specification in SparkR

2016-05-05 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-11395.
---
   Resolution: Fixed
 Assignee: Sun Rui
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/10094

> Support over and window specification in SparkR
> ---
>
> Key: SPARK-11395
> URL: https://issues.apache.org/jira/browse/SPARK-11395
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> 1. implement over() in Column class.
> 2. support window spec 
> (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.WindowSpec)
> 3. support utility functions for defining window in DataFrames. 
> (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.Window)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10043) Add window functions into SparkR

2016-05-05 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273475#comment-15273475
 ] 

Shivaram Venkataraman commented on SPARK-10043:
---

[~sunrui] Can we resolve this issue now ?

> Add window functions into SparkR
> 
>
> Key: SPARK-10043
> URL: https://issues.apache.org/jira/browse/SPARK-10043
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add window functions as follows in SparkR. I think we should improve 
> {{collect}} function in SparkR.
> - lead
> - cumuDist
> - denseRank
> - lag
> - ntile
> - percentRank
> - rank
> - rowNumber



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15170) Log error message in ExecutorAllocationManager

2016-05-05 Thread meiyoula (JIRA)

meiyoula created SPARK-15170:


 Summary: Log error message in ExecutorAllocationManager
 Key: SPARK-15170
 URL: https://issues.apache.org/jira/browse/SPARK-15170
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula


No matter how long the expire idle time of executor, the log just says "it has 
been idle for $executorIdleTimeoutS seconds". Because executorIdleTimeoutS = 
conf.getTimeAsSeconds("spark.dynamicAllocation.executorIdleTimeout", "60s"), so 
it logs same expire time for different executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15074) Spark shuffle service bottlenecked while fetching large amount of intermediate data

2016-05-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273416#comment-15273416
 ] 

Apache Spark commented on SPARK-15074:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/12944

> Spark shuffle service bottlenecked while fetching large amount of 
> intermediate data
> ---
>
> Key: SPARK-15074
> URL: https://issues.apache.org/jira/browse/SPARK-15074
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>
> While running a job which produces more than 90TB of intermediate data, we 
> find that about 10-15% of the reducer execution time is being spent in 
> shuffle fetch. 
> Jstack of the shuffle service reveals that most of the time the shuffle 
> service is reading the index files generated by the mapper. 
> {code}
> java.lang.Thread.State: RUNNABLE
>   at java.io.FileInputStream.readBytes(Native Method)
>   at java.io.FileInputStream.read(FileInputStream.java:255)
>   at java.io.DataInputStream.readFully(DataInputStream.java:195)
>   at java.io.DataInputStream.readLong(DataInputStream.java:416)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:277)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:190)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> The issue is that for each shuffle fetch, we reopen the same index file again 
> and read it. It would be much efficient, if we can avoid opening the same 
> file multiple times and cache the data. We can use an LRU cache to save the 
> index file information. This way we can also limit the number of entries in 
> the cache so that we don't blow up the memory indefinitely. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Assigned] (SPARK-15074) Spark shuffle service bottlenecked while fetching large amount of intermediate data

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15074:


Assignee: Apache Spark

> Spark shuffle service bottlenecked while fetching large amount of 
> intermediate data
> ---
>
> Key: SPARK-15074
> URL: https://issues.apache.org/jira/browse/SPARK-15074
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>Assignee: Apache Spark
>
> While running a job which produces more than 90TB of intermediate data, we 
> find that about 10-15% of the reducer execution time is being spent in 
> shuffle fetch. 
> Jstack of the shuffle service reveals that most of the time the shuffle 
> service is reading the index files generated by the mapper. 
> {code}
> java.lang.Thread.State: RUNNABLE
>   at java.io.FileInputStream.readBytes(Native Method)
>   at java.io.FileInputStream.read(FileInputStream.java:255)
>   at java.io.DataInputStream.readFully(DataInputStream.java:195)
>   at java.io.DataInputStream.readLong(DataInputStream.java:416)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:277)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:190)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> The issue is that for each shuffle fetch, we reopen the same index file again 
> and read it. It would be much efficient, if we can avoid opening the same 
> file multiple times and cache the data. We can use an LRU cache to save the 
> index file information. This way we can also limit the number of entries in 
> the cache so that we don't blow up the memory indefinitely. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Assigned] (SPARK-15074) Spark shuffle service bottlenecked while fetching large amount of intermediate data

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15074:


Assignee: (was: Apache Spark)

> Spark shuffle service bottlenecked while fetching large amount of 
> intermediate data
> ---
>
> Key: SPARK-15074
> URL: https://issues.apache.org/jira/browse/SPARK-15074
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>
> While running a job which produces more than 90TB of intermediate data, we 
> find that about 10-15% of the reducer execution time is being spent in 
> shuffle fetch. 
> Jstack of the shuffle service reveals that most of the time the shuffle 
> service is reading the index files generated by the mapper. 
> {code}
> java.lang.Thread.State: RUNNABLE
>   at java.io.FileInputStream.readBytes(Native Method)
>   at java.io.FileInputStream.read(FileInputStream.java:255)
>   at java.io.DataInputStream.readFully(DataInputStream.java:195)
>   at java.io.DataInputStream.readLong(DataInputStream.java:416)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:277)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:190)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> The issue is that for each shuffle fetch, we reopen the same index file again 
> and read it. It would be much efficient, if we can avoid opening the same 
> file multiple times and cache the data. We can use an LRU cache to save the 
> index file information. This way we can also limit the number of entries in 
> the cache so that we don't blow up the memory indefinitely. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Commented] (SPARK-14963) YarnShuffleService should use YARN getRecoveryPath() for leveldb location

2016-05-05 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273412#comment-15273412
 ] 

Saisai Shao commented on SPARK-14963:
-

OK, I will do it.

> YarnShuffleService should use YARN getRecoveryPath() for leveldb location
> -
>
> Key: SPARK-14963
> URL: https://issues.apache.org/jira/browse/SPARK-14963
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 1.6.1
>Reporter: Thomas Graves
>
> The YarnShuffleService, currently just picks a directly in the yarn local 
> dirs to store the leveldb file.  YARN added an interface in hadoop 2.5 
> getRecoverPath() to get the location where it should be storing this.
> We should change to use getRecoveryPath(). This does mean we will have to use 
> reflection or similar to check for its existence though since it doesn't 
> exist before hadoop 2.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15159) Remove usage of HiveContext in SparkR.

2016-05-05 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273396#comment-15273396
 ] 

Shivaram Venkataraman commented on SPARK-15159:
---

Is it being removed or is it being deprecated in 2.0 - If its being removed 
then we need to make this a priority

> Remove usage of HiveContext in SparkR.
> --
>
> Key: SPARK-15159
> URL: https://issues.apache.org/jira/browse/SPARK-15159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> HiveContext is to be deprecated in 2.0.  Replace them with 
> SparkSession.withHiveSupport in SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15168) Add missing params to Python's MultilayerPerceptronClassifier

2016-05-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273391#comment-15273391
 ] 

Apache Spark commented on SPARK-15168:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/12943

> Add missing params to Python's MultilayerPerceptronClassifier
> -
>
> Key: SPARK-15168
> URL: https://issues.apache.org/jira/browse/SPARK-15168
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> MultilayerPerceptronClassifier is missing step size, solver, and weights. Add 
> these params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15168) Add missing params to Python's MultilayerPerceptronClassifier

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15168:


Assignee: (was: Apache Spark)

> Add missing params to Python's MultilayerPerceptronClassifier
> -
>
> Key: SPARK-15168
> URL: https://issues.apache.org/jira/browse/SPARK-15168
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> MultilayerPerceptronClassifier is missing step size, solver, and weights. Add 
> these params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15168) Add missing params to Python's MultilayerPerceptronClassifier

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15168:


Assignee: Apache Spark

> Add missing params to Python's MultilayerPerceptronClassifier
> -
>
> Key: SPARK-15168
> URL: https://issues.apache.org/jira/browse/SPARK-15168
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
>
> MultilayerPerceptronClassifier is missing step size, solver, and weights. Add 
> these params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15159) Remove usage of HiveContext in SparkR.

2016-05-05 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273379#comment-15273379
 ] 

Felix Cheung commented on SPARK-15159:
--

With the updated goal, this seems to be a fairly big change, how do we want to 
proceed?


> Remove usage of HiveContext in SparkR.
> --
>
> Key: SPARK-15159
> URL: https://issues.apache.org/jira/browse/SPARK-15159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> HiveContext is to be deprecated in 2.0.  Replace them with 
> SparkSession.withHiveSupport in SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15169) Consider improving HasSolver to allow generilization

2016-05-05 Thread holdenk (JIRA)

holdenk created SPARK-15169:
---

 Summary: Consider improving HasSolver to allow generilization
 Key: SPARK-15169
 URL: https://issues.apache.org/jira/browse/SPARK-15169
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: holdenk
Priority: Trivial


The current HasSolver shared param has a fixed default value of "auto" and no 
validation. Some algorithms (see `MultilayerPerceptronClassifier`) have 
different default values or validators. This results in either a mostly 
duplicated param (as in `MultilayerPerceptronClassifier`) or incorrect scaladoc 
(as in `GeneralizedLinearRegression`).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13566) Deadlock between MemoryStore and BlockManager

2016-05-05 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13566:
--
Assignee: cen yuhai

> Deadlock between MemoryStore and BlockManager
> -
>
> Key: SPARK-13566
> URL: https://issues.apache.org/jira/browse/SPARK-13566
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0
> Environment: Spark 1.6.0 hadoop2.2.0 jdk1.8.0_65 centOs 6.2
>Reporter: cen yuhai
>Assignee: cen yuhai
>
> ===
> "block-manager-slave-async-thread-pool-1":
> at org.apache.spark.storage.MemoryStore.remove(MemoryStore.scala:216)
> - waiting to lock <0x0005895b09b0> (a 
> org.apache.spark.memory.UnifiedMemoryManager)
> at 
> org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1114)
> - locked <0x00058ed6aae0> (a org.apache.spark.storage.BlockInfo)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1101)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1101)
> at scala.collection.immutable.Set$Set2.foreach(Set.scala:94)
> at 
> org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1101)
> at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply$mcI$sp(BlockManagerSlaveEndpoint.scala:65)
> at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65)
> at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65)
> at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:84)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> "Executor task launch worker-10":
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1032)
> - waiting to lock <0x00059a0988b8> (a 
> org.apache.spark.storage.BlockInfo)
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1009)
> at 
> org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:460)
> at 
> org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:449)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15168) Add missing params to Python's MultilayerPerceptronClassifier

2016-05-05 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-15168:

Description: MultilayerPerceptronClassifier is missing step size, solver, 
and weights. Add these params.  (was: MultilayerPerceptronClassifier is missing 
Tol, solver, and weights. Add these params.)

> Add missing params to Python's MultilayerPerceptronClassifier
> -
>
> Key: SPARK-15168
> URL: https://issues.apache.org/jira/browse/SPARK-15168
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> MultilayerPerceptronClassifier is missing step size, solver, and weights. Add 
> these params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15168) Add missing params to Python's MultilayerPerceptronClassifier

2016-05-05 Thread holdenk (JIRA)

holdenk created SPARK-15168:
---

 Summary: Add missing params to Python's 
MultilayerPerceptronClassifier
 Key: SPARK-15168
 URL: https://issues.apache.org/jira/browse/SPARK-15168
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: holdenk
Priority: Trivial


MultilayerPerceptronClassifier is missing Tol, solver, and weights. Add these 
params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15159) Remove usage of HiveContext in SparkR.

2016-05-05 Thread Sun Rui (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui updated SPARK-15159:

Description: HiveContext is to be deprecated in 2.0.  Replace them with 
SparkSession.withHiveSupport in SparkR  (was: HiveContext is to be deprecated 
in 2.0. However, there are several times of usage of HiveContext in SparkR unit 
test cases. Replace them with SparkSession.withHiveSupport .)

> Remove usage of HiveContext in SparkR.
> --
>
> Key: SPARK-15159
> URL: https://issues.apache.org/jira/browse/SPARK-15159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> HiveContext is to be deprecated in 2.0.  Replace them with 
> SparkSession.withHiveSupport in SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15159) Remove usage of HiveContext in SparkR.

2016-05-05 Thread Sun Rui (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui updated SPARK-15159:

Summary: Remove usage of HiveContext in SparkR.  (was: Remove usage of 
HiveContext in SparkR unit test cases.)

> Remove usage of HiveContext in SparkR.
> --
>
> Key: SPARK-15159
> URL: https://issues.apache.org/jira/browse/SPARK-15159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> HiveContext is to be deprecated in 2.0. However, there are several times of 
> usage of HiveContext in SparkR unit test cases. Replace them with 
> SparkSession.withHiveSupport .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15167) Add public catalog implementation method to SparkSession

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15167:


Assignee: Andrew Or  (was: Apache Spark)

> Add public catalog implementation method to SparkSession
> 
>
> Key: SPARK-15167
> URL: https://issues.apache.org/jira/browse/SPARK-15167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Right now there's no way to check whether a given SparkSession has Hive 
> support. You can do `spark.conf.get("spark.sql.catalogImplementation")` but 
> that's supposed to be hidden from the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15167) Add public catalog implementation method to SparkSession

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15167:


Assignee: Apache Spark  (was: Andrew Or)

> Add public catalog implementation method to SparkSession
> 
>
> Key: SPARK-15167
> URL: https://issues.apache.org/jira/browse/SPARK-15167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> Right now there's no way to check whether a given SparkSession has Hive 
> support. You can do `spark.conf.get("spark.sql.catalogImplementation")` but 
> that's supposed to be hidden from the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15167) Add public catalog implementation method to SparkSession

2016-05-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273368#comment-15273368
 ] 

Apache Spark commented on SPARK-15167:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/12942

> Add public catalog implementation method to SparkSession
> 
>
> Key: SPARK-15167
> URL: https://issues.apache.org/jira/browse/SPARK-15167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Right now there's no way to check whether a given SparkSession has Hive 
> support. You can do `spark.conf.get("spark.sql.catalogImplementation")` but 
> that's supposed to be hidden from the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15166) Move hive-specific conf setting from SparkSession

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15166:


Assignee: Andrew Or  (was: Apache Spark)

> Move hive-specific conf setting from SparkSession
> -
>
> Key: SPARK-15166
> URL: https://issues.apache.org/jira/browse/SPARK-15166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15166) Move hive-specific conf setting from SparkSession

2016-05-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273354#comment-15273354
 ] 

Apache Spark commented on SPARK-15166:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/12941

> Move hive-specific conf setting from SparkSession
> -
>
> Key: SPARK-15166
> URL: https://issues.apache.org/jira/browse/SPARK-15166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15166) Move hive-specific conf setting from SparkSession

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15166:


Assignee: Apache Spark  (was: Andrew Or)

> Move hive-specific conf setting from SparkSession
> -
>
> Key: SPARK-15166
> URL: https://issues.apache.org/jira/browse/SPARK-15166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-05 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15165:
-
Priority: Critical  (was: Major)

> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Priority: Critical
>
> toCommentSafeString method replaces "\u" with "\ \u" to avoid codegen 
> breaking.
> But if the even number of "\" is put before "u", like "\ \u", in the string 
> literal in the query, codegen can break.
> Following code occurs compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-05 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15165:
-
Target Version/s: 2.0.0

> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> toCommentSafeString method replaces "\u" with "\ \u" to avoid codegen 
> breaking.
> But if the even number of "\" is put before "u", like "\ \u", in the string 
> literal in the query, codegen can break.
> Following code occurs compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15152) Scaladoc and Code style Improvements

2016-05-05 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15152.
---
  Resolution: Fixed
Assignee: Jacek Laskowski
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Scaladoc and Code style Improvements
> 
>
> Key: SPARK-15152
> URL: https://issues.apache.org/jira/browse/SPARK-15152
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, Spark Core, SQL, YARN
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Minor
> Fix For: 2.0.0
>
>
> While doing code reviews for the Spark Notes I found many places with typos 
> and incorrect code style.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15167) Add public catalog implementation method to SparkSession

2016-05-05 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15167:
-

 Summary: Add public catalog implementation method to SparkSession
 Key: SPARK-15167
 URL: https://issues.apache.org/jira/browse/SPARK-15167
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


Right now there's no way to check whether a given SparkSession has Hive 
support. You can do `spark.conf.get("spark.sql.catalogImplementation")` but 
that's supposed to be hidden from the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15166) Move hive-specific conf setting from SparkSession

2016-05-05 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15166:
--
Summary: Move hive-specific conf setting from SparkSession  (was: Move 
hive-specific conf setting to HiveSharedState)

> Move hive-specific conf setting from SparkSession
> -
>
> Key: SPARK-15166
> URL: https://issues.apache.org/jira/browse/SPARK-15166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15166) Move hive-specific conf setting to HiveSharedState

2016-05-05 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15166:
-

 Summary: Move hive-specific conf setting to HiveSharedState
 Key: SPARK-15166
 URL: https://issues.apache.org/jira/browse/SPARK-15166
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-05 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-15165:
---
Description: 
toCommentSafeString method replaces "\u" with "\ \u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\ \u", in the string 
literal in the query, codegen can break.

Following code occurs compilation error.

{code}
val df = Seq(...).toDF
df.select("'u002A/'").show
{code}

The reason of the compilation error is because "u002A/" is translated 
into "*/" (the end of comment). 

Due to this unsafety, arbitrary code can be injected like as follows.

{code}
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'u002A/{System.exit(1);}/*'").show
{code}


  was:
toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\u", in the string 
literal in the query, codegen can break.

Following code occurs compilation error.

{code}
val df = Seq(...).toDF
df.select("'u002A/'").show
{code}

The reason of the compilation error is because "u002A/" is translated 
into "*/" (the end of comment). 

Due to this unsafety, arbitrary code can be injected like as follows.

{code}
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'u002A/{System.exit(1);}/*'").show
{code}



> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> toCommentSafeString method replaces "\u" with "\ \u" to avoid codegen 
> breaking.
> But if the even number of "\" is put before "u", like "\ \u", in the string 
> literal in the query, codegen can break.
> Following code occurs compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-05 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-15165:
---
Description: 
toCommentSafeString method replaces "\u" with "u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\u", in the string 
literal in the query, codegen can break.

Following code occurs compilation error.

{code}
val df = Seq(...).toDF
df.select("'u002A/'").show
{code}

The reason of the compilation error is because "u002A/" is translated 
into "*/" (the end of comment). 

Due to this unsafety, arbitrary code can be injected like as follows.

{code}
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'u002A/{System.exit(1);}/*'").show
{code}


  was:
toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\u", in the string 
literal in the query, codegen can break.

Following code occurs compilation error.

{code}
val df = Seq(...).toDF
df.select("'u002A/'").show
{code}

The reason of the compilation error is because "u002A/" is translated 
into "*/" (the end of comment). 

Due to this unsafety, arbitrary code can be injected like as follows.

{code}
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'u002A/{System.exit(1);}/*'").show
{code}



> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> toCommentSafeString method replaces "\u" with "u" to avoid codegen 
> breaking.
> But if the even number of "\" is put before "u", like "\\u", in the string 
> literal in the query, codegen can break.
> Following code occurs compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-05 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-15165:
---
Description: 
toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\u", in the string 
literal in the query, codegen can break.

Following code occurs compilation error.

{code}
val df = Seq(...).toDF
df.select("'u002A/'").show
{code}

The reason of the compilation error is because "u002A/" is translated 
into "*/" (the end of comment). 

Due to this unsafety, arbitrary code can be injected like as follows.

{code}
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'u002A/{System.exit(1);}/*'").show
{code}


  was:
toCommentSafeString method replaces "\u" with "\\\u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\u", in the string 
literal in the query, codegen can break.

Following code occurs compilation error.

{code}
val df = Seq(...).toDF
df.select("'u002A/'").show
{code}

The reason of the compilation error is because "u002A/" is translated 
into "*/" (the end of comment). 

Due to this unsafety, arbitrary code can be injected like as follows.

{code}
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'u002A/{System.exit(1);}/*'").show
{code}



> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking.
> But if the even number of "\" is put before "u", like "\\u", in the string 
> literal in the query, codegen can break.
> Following code occurs compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-05 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-15165:
---
Description: 
toCommentSafeString method replaces "\u" with "\\\u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\u", in the string 
literal in the query, codegen can break.

Following code occurs compilation error.

{code}
val df = Seq(...).toDF
df.select("'u002A/'").show
{code}

The reason of the compilation error is because "u002A/" is translated 
into "*/" (the end of comment). 

Due to this unsafety, arbitrary code can be injected like as follows.

{code}
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'u002A/{System.exit(1);}/*'").show
{code}


  was:
toCommentSafeString method replaces "\u" with "u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\u", in the string 
literal in the query, codegen can break.

Following code occurs compilation error.

{code}
val df = Seq(...).toDF
df.select("'u002A/'").show
{code}

The reason of the compilation error is because "u002A/" is translated 
into "*/" (the end of comment). 

Due to this unsafety, arbitrary code can be injected like as follows.

{code}
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'u002A/{System.exit(1);}/*'").show
{code}



> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> toCommentSafeString method replaces "\u" with "\\\u" to avoid codegen 
> breaking.
> But if the even number of "\" is put before "u", like "\\u", in the string 
> literal in the query, codegen can break.
> Following code occurs compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15074) Spark shuffle service bottlenecked while fetching large amount of intermediate data

2016-05-05 Thread Sital Kedia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273241#comment-15273241
 ] 

Sital Kedia edited comment on SPARK-15074 at 5/5/16 10:32 PM:
--

Okay, I made a change to cache the index file and that made the shuffle read 
time twice as fast. I am going to put out a PR for that change soon.

Now I see the shuffle service is spending most of the time in the 
FileChannelImpl.transferTo method (Refer to the stack trace below). I wonder if 
there is a way to speed it up further?

{code}
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at 
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:427)
at 
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:492)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:607)
at 
org.apache.spark.network.buffer.LazyFileRegion.transferTo(LazyFileRegion.java:96)
at 
org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:89)
at 
io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:254)
at 
io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237)
at 
io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:281)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:761)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:311)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:729)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1127)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663)
at 
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:644)
at 
io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663)
at 
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:644)
at 
io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663)
at 
io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:693)
at 
io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:681)
at 
io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:716)
at 
io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:954)
at 
io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:244)
at 
org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:184)
at 
org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:129)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:100)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at

[jira] [Commented] (SPARK-15074) Spark shuffle service bottlenecked while fetching large amount of intermediate data

2016-05-05 Thread Sital Kedia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273241#comment-15273241
 ] 

Sital Kedia commented on SPARK-15074:
-

Okay, I made a change to cache the index file and that made the shuffle read 
time twice as fast. I am going to put out a PR for that change soon.

Now I see the shuffle service is spending most of the time in the 
FileChannelImpl.transferTo method (Refer to the stack trace below). I wonder if 
there is a way to speed it up further?


java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at 
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:427)
at 
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:492)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:607)
at 
org.apache.spark.network.buffer.LazyFileRegion.transferTo(LazyFileRegion.java:96)
at 
org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:89)
at 
io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:254)
at 
io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237)
at 
io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:281)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:761)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:311)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:729)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1127)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663)
at 
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:644)
at 
io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663)
at 
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:644)
at 
io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663)
at 
io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:693)
at 
io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:681)
at 
io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:716)
at 
io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:954)
at 
io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:244)
at 
org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:184)
at 
org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:129)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:100)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at

[jira] [Assigned] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15165:


Assignee: Apache Spark

> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>
> toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking.
> But if the even number of "\" is put before "u", like "\\u", in the string 
> literal in the query, codegen can break.
> Following code occurs compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273238#comment-15273238
 ] 

Apache Spark commented on SPARK-15165:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/12939

> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking.
> But if the even number of "\" is put before "u", like "\\u", in the string 
> literal in the query, codegen can break.
> Following code occurs compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15165:


Assignee: (was: Apache Spark)

> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking.
> But if the even number of "\" is put before "u", like "\\u", in the string 
> literal in the query, codegen can break.
> Following code occurs compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-05 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-15165:
--

 Summary: Codegen can break because toCommentSafeString is not 
actually safe
 Key: SPARK-15165
 URL: https://issues.apache.org/jira/browse/SPARK-15165
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Kousuke Saruta


toCommentSafeString method replaces "\u" with "\\u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\u", in the string 
literal in the query, codegen can break.

Following code occurs compilation error.

{code}
val df = Seq(...).toDF
df.select("'u002A/'").show
{code}

The reason of the compilation error is because "u002A/" is translated 
into "*/" (the end of comment). 

Due to this unsafety, arbitrary code can be injected like as follows.

{code}
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'u002A/{System.exit(1);}/*'").show
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-14977) Fine grained mode in Mesos is not fair

2016-05-05 Thread Michael Gummelt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt closed SPARK-14977.
---
Resolution: Not A Problem

> Fine grained mode in Mesos is not fair
> --
>
> Key: SPARK-14977
> URL: https://issues.apache.org/jira/browse/SPARK-14977
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
> Environment: Spark commit db75ccb, Debian jessie, Mesos fine grained
>Reporter: Luca Bruno
>
> I've setup a mesos cluster and I'm running spark in fine grained mode.
> Spark defaults to 2 executor cores and 2gb of ram.
> The total mesos cluster has 8 cores and 8gb of ram.
> When I submit two spark jobs simultaneously, spark will always accept full 
> resources, leading the two frameworks to use 4gb of ram each instead of 2gb.
> If I submit another spark job, it will not get offered resources from mesos, 
> at least using the default HierarchicalDRF allocator module.
> Mesos will keep offering 4gb of ram to earlier spark jobs, and spark keeps 
> accepting full resources for every new task.
> Hence new spark jobs have no chance of getting a share.
> Is this something to be solved with a custom mesos allocator? Or spark should 
> be more fair instead? Or maybe provide a configuration option to always 
> accept with the minimum resources?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14977) Fine grained mode in Mesos is not fair

2016-05-05 Thread Michael Gummelt (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273202#comment-15273202
 ] 

Michael Gummelt commented on SPARK-14977:
-

[~lethalman]: Fine-grained mode only release cores, not memory.  It's 
impossible for us to shrink the memory allocation without OOM-ing the executor, 
because the JVM doesn't relinquish memory back to the OS.

You can use dynamic allocation to terminate entire executors as they become 
idle.

Also, FYI, fine-grained mode will soon be deprecated in favor of dynamic 
allocation.

> Fine grained mode in Mesos is not fair
> --
>
> Key: SPARK-14977
> URL: https://issues.apache.org/jira/browse/SPARK-14977
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
> Environment: Spark commit db75ccb, Debian jessie, Mesos fine grained
>Reporter: Luca Bruno
>
> I've setup a mesos cluster and I'm running spark in fine grained mode.
> Spark defaults to 2 executor cores and 2gb of ram.
> The total mesos cluster has 8 cores and 8gb of ram.
> When I submit two spark jobs simultaneously, spark will always accept full 
> resources, leading the two frameworks to use 4gb of ram each instead of 2gb.
> If I submit another spark job, it will not get offered resources from mesos, 
> at least using the default HierarchicalDRF allocator module.
> Mesos will keep offering 4gb of ram to earlier spark jobs, and spark keeps 
> accepting full resources for every new task.
> Hence new spark jobs have no chance of getting a share.
> Is this something to be solved with a custom mesos allocator? Or spark should 
> be more fair instead? Or maybe provide a configuration option to always 
> accept with the minimum resources?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15162) Update PySpark LogisticRegression threshold PyDoc to be as complete as Scaladoc

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15162:


Assignee: Apache Spark

> Update PySpark LogisticRegression threshold PyDoc to be as complete as 
> Scaladoc
> ---
>
> Key: SPARK-15162
> URL: https://issues.apache.org/jira/browse/SPARK-15162
> Project: Spark
>  Issue Type: Improvement
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
>
> The PyDoc for setting and getting the threshold in logistic regression 
> doesn't have the same level of detail as the Scaladoc does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15162) Update PySpark LogisticRegression threshold PyDoc to be as complete as Scaladoc

2016-05-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273180#comment-15273180
 ] 

Apache Spark commented on SPARK-15162:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/12938

> Update PySpark LogisticRegression threshold PyDoc to be as complete as 
> Scaladoc
> ---
>
> Key: SPARK-15162
> URL: https://issues.apache.org/jira/browse/SPARK-15162
> Project: Spark
>  Issue Type: Improvement
>Reporter: holdenk
>Priority: Trivial
>
> The PyDoc for setting and getting the threshold in logistic regression 
> doesn't have the same level of detail as the Scaladoc does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15164) Mark classification algorithms as experimental where marked so in scala

2016-05-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273181#comment-15273181
 ] 

Apache Spark commented on SPARK-15164:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/12938

> Mark classification algorithms as experimental where marked so in scala
> ---
>
> Key: SPARK-15164
> URL: https://issues.apache.org/jira/browse/SPARK-15164
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15164) Mark classification algorithms as experimental where marked so in scala

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15164:


Assignee: Apache Spark

> Mark classification algorithms as experimental where marked so in scala
> ---
>
> Key: SPARK-15164
> URL: https://issues.apache.org/jira/browse/SPARK-15164
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15162) Update PySpark LogisticRegression threshold PyDoc to be as complete as Scaladoc

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15162:


Assignee: (was: Apache Spark)

> Update PySpark LogisticRegression threshold PyDoc to be as complete as 
> Scaladoc
> ---
>
> Key: SPARK-15162
> URL: https://issues.apache.org/jira/browse/SPARK-15162
> Project: Spark
>  Issue Type: Improvement
>Reporter: holdenk
>Priority: Trivial
>
> The PyDoc for setting and getting the threshold in logistic regression 
> doesn't have the same level of detail as the Scaladoc does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15164) Mark classification algorithms as experimental where marked so in scala

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15164:


Assignee: (was: Apache Spark)

> Mark classification algorithms as experimental where marked so in scala
> ---
>
> Key: SPARK-15164
> URL: https://issues.apache.org/jira/browse/SPARK-15164
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14893) Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed

2016-05-05 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14893.
---
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 2.0.0

> Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed
> ---
>
> Key: SPARK-14893
> URL: https://issues.apache.org/jira/browse/SPARK-14893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Dilip Biswal
> Fix For: 2.0.0
>
>
> The test was disabled in https://github.com/apache/spark/pull/12585. To 
> re-enable it we need to rebuild the jar using the updated source code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14812) ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit

2016-05-05 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-14812:
---

Assignee: DB Tsai

> ML, Graph 2.0 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-14812
> URL: https://issues.apache.org/jira/browse/SPARK-14812
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: DB Tsai
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15158) Too aggressive logging in SizeBasedRollingPolicy?

2016-05-05 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15158:
--
Assignee: Kai Wang

> Too aggressive logging in SizeBasedRollingPolicy?
> -
>
> Key: SPARK-15158
> URL: https://issues.apache.org/jira/browse/SPARK-15158
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Kai Wang
>Assignee: Kai Wang
>Priority: Trivial
> Fix For: 2.0.0
>
>
> The questionable line is this: 
> https://github.com/apache/spark/blob/3e27940a19e7bab448f1af11d2065ecd1ec66197/core/src/main/scala/org/apache/spark/util/logging/RollingPolicy.scala#L116
> This will output a message *whenever* anything is logged at executor level. 
> Like the following:
> SizeBasedRollingPolicy:59 83 + 140796 > 1048576
> SizeBasedRollingPolicy:59 83 + 140879 > 1048576
> SizeBasedRollingPolicy:59 83 + 140962 > 1048576
> ...
> This seems to aggressive. Should this be at least downgrade to debug level?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9926) Parallelize file listing for partitioned Hive table

2016-05-05 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-9926.
--
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Parallelize file listing for partitioned Hive table
> ---
>
> Key: SPARK-9926
> URL: https://issues.apache.org/jira/browse/SPARK-9926
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>Assignee: Ryan Blue
> Fix For: 2.0.0
>
>
> In Spark SQL, short queries like {{select * from table limit 10}} run very 
> slowly against partitioned Hive tables because of file listing. In 
> particular, if a large number of partitions are scanned on storage like S3, 
> the queries run extremely slowly. Here are some example benchmarks in my 
> environment-
> * Parquet-backed Hive table
> * Partitioned by dateint and hour
> * Stored on S3
> ||\# of partitions||\# of files||runtime||query||
> |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit 
> 10;|
> |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;|
> |240|136222|1 hour|select * from nccp_log where dateint>=20150601 and 
> dateint<=20150610 limit 10;|
> The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive 
> partition path and group them into a UnionRDD. Then, all the input files are 
> listed sequentially. In other tools such as Hive and Pig, this can be solved 
> by setting 
> [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]
>  high. But in Spark, since each HadoopRDD lists only one partition path, 
> setting this property doesn't help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15158) Too aggressive logging in SizeBasedRollingPolicy?

2016-05-05 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15158.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Too aggressive logging in SizeBasedRollingPolicy?
> -
>
> Key: SPARK-15158
> URL: https://issues.apache.org/jira/browse/SPARK-15158
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Kai Wang
>Priority: Trivial
> Fix For: 2.0.0
>
>
> The questionable line is this: 
> https://github.com/apache/spark/blob/3e27940a19e7bab448f1af11d2065ecd1ec66197/core/src/main/scala/org/apache/spark/util/logging/RollingPolicy.scala#L116
> This will output a message *whenever* anything is logged at executor level. 
> Like the following:
> SizeBasedRollingPolicy:59 83 + 140796 > 1048576
> SizeBasedRollingPolicy:59 83 + 140879 > 1048576
> SizeBasedRollingPolicy:59 83 + 140962 > 1048576
> ...
> This seems to aggressive. Should this be at least downgrade to debug level?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15134) Indent SparkSession builder patterns and update binary_classification_metrics_example.py

2016-05-05 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15134.
---
  Resolution: Fixed
Assignee: Dongjoon Hyun
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Indent SparkSession builder patterns and update 
> binary_classification_metrics_example.py
> 
>
> Key: SPARK-15134
> URL: https://issues.apache.org/jira/browse/SPARK-15134
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> This issue addresses the comments in SPARK-15031 and also fix java-linter 
> errors.
> - Use multiline format in SparkSession builder patterns.
> - Update `binary_classification_metrics_example.py` to use `SparkSession`.
> - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15135) Make sure SparkSession thread safe

2016-05-05 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15135.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Make sure SparkSession thread safe
> --
>
> Key: SPARK-15135
> URL: https://issues.apache.org/jira/browse/SPARK-15135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Fixed non-thread-safe classed used by SparkSession.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15072) Remove SparkSession.withHiveSupport

2016-05-05 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15072.
---
Resolution: Fixed

> Remove SparkSession.withHiveSupport
> ---
>
> Key: SPARK-15072
> URL: https://issues.apache.org/jira/browse/SPARK-15072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sandeep Singh
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10653) Remove unnecessary things from SparkEnv

2016-05-05 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273117#comment-15273117
 ] 

Alex Bozarth commented on SPARK-10653:
--

I'm currently running tests on a fix for this and will open a PR after. I have 
removed blockTransferService and sparkFilesDir and replaced the few references 
to them. ExecutorMemoryManager was already removed in SPARK-10984. I also took 
a quick look at the other vals in the constructor and I didn't see any other 
low hanging fruit to remove.

> Remove unnecessary things from SparkEnv
> ---
>
> Key: SPARK-10653
> URL: https://issues.apache.org/jira/browse/SPARK-10653
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> As of the writing of this message, there are at least two things that can be 
> removed from it:
> {code}
> @DeveloperApi
> class SparkEnv (
> val executorId: String,
> private[spark] val rpcEnv: RpcEnv,
> val serializer: Serializer,
> val closureSerializer: Serializer,
> val cacheManager: CacheManager,
> val mapOutputTracker: MapOutputTracker,
> val shuffleManager: ShuffleManager,
> val broadcastManager: BroadcastManager,
> val blockTransferService: BlockTransferService, // this one can go
> val blockManager: BlockManager,
> val securityManager: SecurityManager,
> val httpFileServer: HttpFileServer,
> val sparkFilesDir: String, // this one maybe? It's only used in 1 place.
> val metricsSystem: MetricsSystem,
> val shuffleMemoryManager: ShuffleMemoryManager,
> val executorMemoryManager: ExecutorMemoryManager, // this can go
> val outputCommitCoordinator: OutputCommitCoordinator,
> val conf: SparkConf) extends Logging {
>   ...
> }
> {code}
> We should avoid adding to this infinite list of things in SparkEnv's 
> constructors if they're not needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15140) ensure input object of encoder is not null

2016-05-05 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273100#comment-15273100
 ] 

Michael Armbrust commented on SPARK-15140:
--

The 2.0 behavior seems correct.  Ideally .toDS().collect() will always 
round-trip the data without change.

> ensure input object of encoder is not null
> --
>
> Key: SPARK-15140
> URL: https://issues.apache.org/jira/browse/SPARK-15140
> Project: Spark
>  Issue Type: Improvement
>Reporter: Wenchen Fan
>
> Current we assume the input object for encoder won't be null, but we don't 
> check it. For example, in 1.6 `Seq("a", null).toDS.collect` will throw NPE, 
> in 2.0 this will return Array("a", null).
> We should define this behaviour more clearly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14959) Problem Reading partitioned ORC or Parquet files

2016-05-05 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14959:
-
Priority: Critical  (was: Major)

> Problem Reading partitioned ORC or Parquet files
> -
>
> Key: SPARK-14959
> URL: https://issues.apache.org/jira/browse/SPARK-14959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Hadoop 2.7.1.2.4.0.0-169 (HDP 2.4)
>Reporter: Sebastian YEPES FERNANDEZ
>Priority: Critical
>
> Hello,
> I have noticed that in the pasts days there is an issue when trying to read 
> partitioned files from HDFS.
> I am running on Spark master branch #c544356
> The write actually works but the read fails.
> {code:title=Issue Reproduction}
> case class Data(id: Int, text: String)
> val ds = spark.createDataset( Seq(Data(0, "hello"), Data(1, "hello"), Data(0, 
> "world"), Data(1, "there")) )
> scala> 
> ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("parquet").partitionBy("id").save("/user/spark/test.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".  
>   
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> java.io.FileNotFoundException: Path is not a file: 
> /user/spark/test.parquet/id=0
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242)
>   at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1285)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:228)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
>   at 
> org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:372)
>   at 
> org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:360)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
>

[jira] [Updated] (SPARK-14959) Problem Reading partitioned ORC or Parquet files

2016-05-05 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14959:
-
Target Version/s: 2.0.0
 Component/s: (was: Input/Output)
  SQL

> Problem Reading partitioned ORC or Parquet files
> -
>
> Key: SPARK-14959
> URL: https://issues.apache.org/jira/browse/SPARK-14959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Hadoop 2.7.1.2.4.0.0-169 (HDP 2.4)
>Reporter: Sebastian YEPES FERNANDEZ
>
> Hello,
> I have noticed that in the pasts days there is an issue when trying to read 
> partitioned files from HDFS.
> I am running on Spark master branch #c544356
> The write actually works but the read fails.
> {code:title=Issue Reproduction}
> case class Data(id: Int, text: String)
> val ds = spark.createDataset( Seq(Data(0, "hello"), Data(1, "hello"), Data(0, 
> "world"), Data(1, "there")) )
> scala> 
> ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("parquet").partitionBy("id").save("/user/spark/test.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".  
>   
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> java.io.FileNotFoundException: Path is not a file: 
> /user/spark/test.parquet/id=0
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242)
>   at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1285)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:228)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
>   at 
> org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:372)
>   at 
> org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:360)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
>

[jira] [Updated] (SPARK-10653) Remove unnecessary things from SparkEnv

2016-05-05 Thread Alex Bozarth (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Bozarth updated SPARK-10653:
-
Summary: Remove unnecessary things from SparkEnv  (was: head)

> Remove unnecessary things from SparkEnv
> ---
>
> Key: SPARK-10653
> URL: https://issues.apache.org/jira/browse/SPARK-10653
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> As of the writing of this message, there are at least two things that can be 
> removed from it:
> {code}
> @DeveloperApi
> class SparkEnv (
> val executorId: String,
> private[spark] val rpcEnv: RpcEnv,
> val serializer: Serializer,
> val closureSerializer: Serializer,
> val cacheManager: CacheManager,
> val mapOutputTracker: MapOutputTracker,
> val shuffleManager: ShuffleManager,
> val broadcastManager: BroadcastManager,
> val blockTransferService: BlockTransferService, // this one can go
> val blockManager: BlockManager,
> val securityManager: SecurityManager,
> val httpFileServer: HttpFileServer,
> val sparkFilesDir: String, // this one maybe? It's only used in 1 place.
> val metricsSystem: MetricsSystem,
> val shuffleMemoryManager: ShuffleMemoryManager,
> val executorMemoryManager: ExecutorMemoryManager, // this can go
> val outputCommitCoordinator: OutputCommitCoordinator,
> val conf: SparkConf) extends Logging {
>   ...
> }
> {code}
> We should avoid adding to this infinite list of things in SparkEnv's 
> constructors if they're not needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10653) head

2016-05-05 Thread Alex Bozarth (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Bozarth updated SPARK-10653:
-
Summary: head  (was: Remove unnecessary things from SparkEnv)

> head
> 
>
> Key: SPARK-10653
> URL: https://issues.apache.org/jira/browse/SPARK-10653
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> As of the writing of this message, there are at least two things that can be 
> removed from it:
> {code}
> @DeveloperApi
> class SparkEnv (
> val executorId: String,
> private[spark] val rpcEnv: RpcEnv,
> val serializer: Serializer,
> val closureSerializer: Serializer,
> val cacheManager: CacheManager,
> val mapOutputTracker: MapOutputTracker,
> val shuffleManager: ShuffleManager,
> val broadcastManager: BroadcastManager,
> val blockTransferService: BlockTransferService, // this one can go
> val blockManager: BlockManager,
> val securityManager: SecurityManager,
> val httpFileServer: HttpFileServer,
> val sparkFilesDir: String, // this one maybe? It's only used in 1 place.
> val metricsSystem: MetricsSystem,
> val shuffleMemoryManager: ShuffleMemoryManager,
> val executorMemoryManager: ExecutorMemoryManager, // this can go
> val outputCommitCoordinator: OutputCommitCoordinator,
> val conf: SparkConf) extends Logging {
>   ...
> }
> {code}
> We should avoid adding to this infinite list of things in SparkEnv's 
> constructors if they're not needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14813) ML 2.0 QA: API: Python API coverage

2016-05-05 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14813:
--
Assignee: holdenk  (was: Yanbo Liang)

> ML 2.0 QA: API: Python API coverage
> ---
>
> Key: SPARK-14813
> URL: https://issues.apache.org/jira/browse/SPARK-14813
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: holdenk
>
> For new public APIs added to MLlib, we need to check the generated HTML doc 
> and compare the Scala & Python versions.  We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> Please use a *separate* JIRA (linked below) for this list of to-do items.
> UPDATE: This only needs to cover spark.ml since spark.mllib is going into 
> maintenance mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15159) Remove usage of HiveContext in SparkR unit test cases.

2016-05-05 Thread Vijay Parmar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272666#comment-15272666
 ] 

Vijay Parmar edited comment on SPARK-15159 at 5/5/16 8:31 PM:
--

Hi Sun,

I am interested in taking-up this task but unable to assign it to myself
Can you please guide me to the right direction.


was (Author: vsparmar):
Hi Sun,

I am interested in taking-up this task but unable to assign it to myself

Can you please guide me to the right direction like link to repo. or something 
from where I can start.

> Remove usage of HiveContext in SparkR unit test cases.
> --
>
> Key: SPARK-15159
> URL: https://issues.apache.org/jira/browse/SPARK-15159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> HiveContext is to be deprecated in 2.0. However, there are several times of 
> usage of HiveContext in SparkR unit test cases. Replace them with 
> SparkSession.withHiveSupport .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15159) Remove usage of HiveContext in SparkR unit test cases.

2016-05-05 Thread Vijay Parmar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272666#comment-15272666
 ] 

Vijay Parmar edited comment on SPARK-15159 at 5/5/16 8:31 PM:
--

Hi Sun,

I am interested in taking-up this task. Can you please guide me to the right 
direction.


was (Author: vsparmar):
Hi Sun,

I am interested in taking-up this task but unable to assign it to myself
Can you please guide me to the right direction.

> Remove usage of HiveContext in SparkR unit test cases.
> --
>
> Key: SPARK-15159
> URL: https://issues.apache.org/jira/browse/SPARK-15159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> HiveContext is to be deprecated in 2.0. However, there are several times of 
> usage of HiveContext in SparkR unit test cases. Replace them with 
> SparkSession.withHiveSupport .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15163) Mark experimental algorithms experimental in PySpark

2016-05-05 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-15163:

Component/s: PySpark

> Mark experimental algorithms experimental in PySpark
> 
>
> Key: SPARK-15163
> URL: https://issues.apache.org/jira/browse/SPARK-15163
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> While we are going through them anyways might as well mark the PySpark 
> algorithm as experimental that are marked so in Scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15164) Mark classification algorithms as experimental where marked so in scala

2016-05-05 Thread holdenk (JIRA)

holdenk created SPARK-15164:
---

 Summary: Mark classification algorithms as experimental where 
marked so in scala
 Key: SPARK-15164
 URL: https://issues.apache.org/jira/browse/SPARK-15164
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: holdenk
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15163) Mark experimental algorithms experimental in PySpark

2016-05-05 Thread holdenk (JIRA)

holdenk created SPARK-15163:
---

 Summary: Mark experimental algorithms experimental in PySpark
 Key: SPARK-15163
 URL: https://issues.apache.org/jira/browse/SPARK-15163
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: holdenk
Priority: Trivial


While we are going through them anyways might as well mark the PySpark 
algorithm as experimental that are marked so in Scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15162) Update PySpark LogisticRegression threshold PyDoc to be as complete as Scaladoc

2016-05-05 Thread holdenk (JIRA)

holdenk created SPARK-15162:
---

 Summary: Update PySpark LogisticRegression threshold PyDoc to be 
as complete as Scaladoc
 Key: SPARK-15162
 URL: https://issues.apache.org/jira/browse/SPARK-15162
 Project: Spark
  Issue Type: Improvement
Reporter: holdenk
Priority: Trivial


The PyDoc for setting and getting the threshold in logistic regression doesn't 
have the same level of detail as the Scaladoc does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15092) toDebugString missing from ML DecisionTreeClassifier

2016-05-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272956#comment-15272956
 ] 

Apache Spark commented on SPARK-15092:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/12937

> toDebugString missing from ML DecisionTreeClassifier
> 
>
> Key: SPARK-15092
> URL: https://issues.apache.org/jira/browse/SPARK-15092
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: HDP 2.3.4, Red Hat 6.7
>Reporter: Ivan SPM
>Assignee: holdenk
>Priority: Minor
>  Labels: features
>
> The attribute toDebugString is missing from the DecisionTreeClassifier and 
> DecisionTreeClassifierModel from ML. The attribute exists on the MLLib 
> DecisionTree model. 
> There's no way to check or print the model tree structure from the ML.
> The basic code for it is this:
> rom pyspark.ml import Pipeline
> from pyspark.ml.feature import VectorAssembler, StringIndexer
> from pyspark.ml.classification import DecisionTreeClassifier
> cl = DecisionTreeClassifier(labelCol='target_idx', featuresCol='features')
> pipe = Pipeline(stages=[target_index, assembler, cl])
> model = pipe.fit(df_train)
> # Prediction and model evaluation
> predictions = model.transform(df_test)
> mc_evaluator = MulticlassClassificationEvaluator(
> labelCol="target_idx", predictionCol="prediction", metricName="precision")
> accuracy = mc_evaluator.evaluate(predictions)
> print("Test Error = {}".format(1.0 - accuracy))
> now it would be great to be able to do what is being done on the MLLib model:
> print model.toDebugString(),  # it already has newline
> DecisionTreeModel classifier of depth 1 with 3 nodes
>   If (feature 0 <= 0.0)
>Predict: 0.0
>   Else (feature 0 > 0.0)
>Predict: 1.0
> but there's no toDebugString attribute either to the pipeline model or the 
> DecisionTreeClassifier model:
> cl.toDebugString()
> Attribute Error
> https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/mllib/tree.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15080) Break copyAndReset into copy and reset

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15080:


Assignee: Apache Spark

> Break copyAndReset into copy and reset
> --
>
> Key: SPARK-15080
> URL: https://issues.apache.org/jira/browse/SPARK-15080
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> We should break copy and reset into two methods rather than just one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15080) Break copyAndReset into copy and reset

2016-05-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272945#comment-15272945
 ] 

Apache Spark commented on SPARK-15080:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/12936

> Break copyAndReset into copy and reset
> --
>
> Key: SPARK-15080
> URL: https://issues.apache.org/jira/browse/SPARK-15080
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
> Fix For: 2.0.0
>
>
> We should break copy and reset into two methods rather than just one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15080) Break copyAndReset into copy and reset

2016-05-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15080:


Assignee: (was: Apache Spark)

> Break copyAndReset into copy and reset
> --
>
> Key: SPARK-15080
> URL: https://issues.apache.org/jira/browse/SPARK-15080
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
> Fix For: 2.0.0
>
>
> We should break copy and reset into two methods rather than just one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

2016-05-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14915:
--
Fix Version/s: 1.6.2

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job to never complete
> ---
>
> Key: SPARK-14915
> URL: https://issues.apache.org/jira/browse/SPARK-14915
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.3, 1.6.2, 2.0.0
>Reporter: Jason Moore
>Assignee: Jason Moore
>Priority: Critical
> Fix For: 1.6.2, 2.0.0
>
>
> In SPARK-14357, code was corrected towards the originally intended behavior 
> that a CommitDeniedException should not count towards the failure count for a 
> job.  After having run with this fix for a few weeks, it's become apparent 
> that this behavior has some unintended consequences - that a speculative task 
> will continuously receive a CDE from the driver, now causing it to fail and 
> retry over and over without limit.
> I'm thinking we could put a task that receives a CDE from the driver, into a 
> TaskState.FINISHED or some other state to indicated that the task shouldn't 
> be resubmitted by the TaskScheduler. I'd probably need some opinions on 
> whether there are other consequences for doing something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15138) Linkify ML PyDoc regression

2016-05-05 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272884#comment-15272884
 ] 

holdenk commented on SPARK-15138:
-

cc [~yanboliang]

> Linkify ML PyDoc regression
> ---
>
> Key: SPARK-15138
> URL: https://issues.apache.org/jira/browse/SPARK-15138
> Project: Spark
>  Issue Type: Sub-task
>Reporter: holdenk
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14854) Left outer join produces incorrect output when the join condition does not have left table key

2016-05-05 Thread kanika dhuria (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272865#comment-15272865
 ] 

kanika dhuria commented on SPARK-14854:
---

 Why do you think they are same issue? I was expecting all left table data  
when the join condition is false. Even when I have condition like 
$"num1".===(lit(10)), result is empty.

> Left outer join produces incorrect output when the join condition does not 
> have left table key
> --
>
> Key: SPARK-14854
> URL: https://issues.apache.org/jira/browse/SPARK-14854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: kanika dhuria
>
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val s = StructType(StructField("num", StringType, true)::Nil)
> val s1 = StructType(StructField("num1", StringType, true)::Nil)
> val m = 
> sc.textFile("file:/tmp/master.txt").map(_.split(",")).map(p=>Row(p(0)))
> val d = 
> sc.textFile("file:/tmp/detail.txt").map(_.split(",")).map(p=>Row(p(0)))
> val m1 = sqlContext.createDataFrame(m, s1)
> val d1 = sqlContext.createDataFrame(d, s)
> val j1 = d1.join(m1,$"num1".===(lit(null)),"left_outer");
> j1.take(1)
> Returns empty data set. Left table has data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15110) SparkR - Implement repartitionByColumn on DataFrame

2016-05-05 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15110.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12887
[https://github.com/apache/spark/pull/12887]

> SparkR - Implement repartitionByColumn on DataFrame
> ---
>
> Key: SPARK-15110
> URL: https://issues.apache.org/jira/browse/SPARK-15110
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> Implement repartitionByColumn on DataFrame.
> This will allow us to run R functions on each partition identified by column 
> groups with dapply() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14811) ML, Graph 2.0 QA: API: New Scala APIs, docs

2016-05-05 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14811:
--
Assignee: Yanbo Liang

> ML, Graph 2.0 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-14811
> URL: https://issues.apache.org/jira/browse/SPARK-14811
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14813) ML 2.0 QA: API: Python API coverage

2016-05-05 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14813:
--
Assignee: Yanbo Liang

> ML 2.0 QA: API: Python API coverage
> ---
>
> Key: SPARK-14813
> URL: https://issues.apache.org/jira/browse/SPARK-14813
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> For new public APIs added to MLlib, we need to check the generated HTML doc 
> and compare the Scala & Python versions.  We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> Please use a *separate* JIRA (linked below) for this list of to-do items.
> UPDATE: This only needs to cover spark.ml since spark.mllib is going into 
> maintenance mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15161) Consider moving featureImportances into TreeEnsemble models base class

2016-05-05 Thread holdenk (JIRA)

holdenk created SPARK-15161:
---

 Summary: Consider moving featureImportances into TreeEnsemble 
models base class
 Key: SPARK-15161
 URL: https://issues.apache.org/jira/browse/SPARK-15161
 Project: Spark
  Issue Type: Improvement
Reporter: holdenk
Priority: Minor


Right now each of the subclasses has it implemented, we could consider moving 
it to the base class (after 2.0). cc [~mlnick]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 233 matches

Mail list logo