date:20161010

[jira] [Commented] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS

2016-10-10 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564586#comment-15564586
 ] 

Herman van Hovell commented on SPARK-17868:
---

[~jiangxb] can you work on this one?

> Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS
> 
>
> Key: SPARK-17868
> URL: https://issues.apache.org/jira/browse/SPARK-17868
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> We generate bitmasks for grouping sets during the parsing process, and use 
> these during analysis. These bitmasks are difficult to work with in practice 
> and have lead to numerous bugs. I suggest that we remove these and use actual 
> sets instead, however we would need to generate these offsets for the 
> grouping_id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS

2016-10-10 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17868:
--
Assignee: (was: Herman van Hovell)

> Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS
> 
>
> Key: SPARK-17868
> URL: https://issues.apache.org/jira/browse/SPARK-17868
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> We generate bitmasks for grouping sets during the parsing process, and use 
> these during analysis. These bitmasks are difficult to work with in practice 
> and have lead to numerous bugs. I suggest that we remove these and use actual 
> sets instead, however we would need to generate these offsets for the 
> grouping_id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS

2016-10-10 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-17868:
-

 Summary: Do not use bitmasks during parsing and analysis of 
CUBE/ROLLUP/GROUPING SETS
 Key: SPARK-17868
 URL: https://issues.apache.org/jira/browse/SPARK-17868
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Herman van Hovell
Assignee: Herman van Hovell


We generate bitmasks for grouping sets during the parsing process, and use 
these during analysis. These bitmasks are difficult to work with in practice 
and have lead to numerous bugs. I suggest that we remove these and use actual 
sets instead, however we would need to generate these offsets for the 
grouping_id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9560) Add LDA data generator

2016-10-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9560.
--
Resolution: Won't Fix

> Add LDA data generator
> --
>
> Key: SPARK-9560
> URL: https://issues.apache.org/jira/browse/SPARK-9560
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add data generator for LDA.
> Hope it can help with performance improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15957) RFormula supports forcing to index label

2016-10-10 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-15957.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.1.0
>
>
> RFormula will index label only when it is string type currently. If the label 
> is numeric type and we use RFormula to present a classification model, there 
> is no label attributes in label column metadata. The label attributes are 
> useful when making prediction for classification, so we can force to index 
> label by {{StringIndexer}} whether it is numeric or string type for 
> classification. Then SparkR wrappers can extract label attributes from label 
> column metadata successfully. This feature can help us to fix bug similar 
> with SPARK-15153.
> For regression, we will still to keep label as numeric type.
> In this PR, we add a param indexLabel to control whether to force to index 
> label for RFormula.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-10-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564578#comment-15564578
 ] 

Sean Owen commented on SPARK-15343:
---

[~jdesmet] I'm not clear what you're advocating _in Spark_. See the discussion 
above. You're running into a problem with the YARN classpath.

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>

[jira] [Commented] (SPARK-17865) R API for global temp view

2016-10-10 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564577#comment-15564577
 ] 

Wenchen Fan commented on SPARK-17865:
-

All global temp views should be gone when the SparkContext is stopped. 

> R API for global temp view
> --
>
> Key: SPARK-17865
> URL: https://issues.apache.org/jira/browse/SPARK-17865
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> We need to add the R API for managing global temp views, mirroring the 
> changes in SPARK-17338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17865) R API for global temp view

2016-10-10 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564558#comment-15564558
 ] 

Felix Cheung commented on SPARK-17865:
--

I haven't kept up on SharedState before this but it looks to be tied to the 
SparkContext.
In R, SparkSession is the entrypoint and the SparkContext underneath comes and 
goes with the session - is there anything global when the SparkContext is 
stopped?
[~cloud_fan]

> R API for global temp view
> --
>
> Key: SPARK-17865
> URL: https://issues.apache.org/jira/browse/SPARK-17865
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> We need to add the R API for managing global temp views, mirroring the 
> changes in SPARK-17338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2016-10-10 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-17867:
---

 Summary: Dataset.dropDuplicates (i.e. distinct) should consider 
the columns with same column name
 Key: SPARK-17867
 URL: https://issues.apache.org/jira/browse/SPARK-17867
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


We find and get the first resolved attribute from output with the given column 
name in Dataset.dropDuplicates. When we have the more than one columns with the 
same name. Other columns are put into aggregation columns, instead of grouping 
columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17844) DataFrame API should simplify defining frame boundaries without partitioning/ordering

2016-10-10 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17844.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Resolved per Reynold's PR.

> DataFrame API should simplify defining frame boundaries without 
> partitioning/ordering
> -
>
> Key: SPARK-17844
> URL: https://issues.apache.org/jira/browse/SPARK-17844
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> When I was creating the example code for SPARK-10496, I realized it was 
> pretty convoluted to define the frame boundaries for window functions when 
> there is no partition column or ordering column. The reason is that we don't 
> provide a way to create a WindowSpec directly with the frame boundaries. We 
> can trivially improve this by adding rowsBetween and rangeBetween to Window 
> object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17866) Dataset.dropDuplicates (i.e., distinct) should not change the output of child plan

2016-10-10 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-17866:
---

 Summary: Dataset.dropDuplicates (i.e., distinct) should not change 
the output of child plan
 Key: SPARK-17866
 URL: https://issues.apache.org/jira/browse/SPARK-17866
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


We create new Alias with new exprId in Dataset.dropDuplicates now. However it 
causes problem when we want to select the columns as follows:

{code}
val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS()
// ds("_2") will cause analysis exception
ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int])
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17865) R API for global temp view

2016-10-10 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564534#comment-15564534
 ] 

Felix Cheung commented on SPARK-17865:
--

I can take this.

> R API for global temp view
> --
>
> Key: SPARK-17865
> URL: https://issues.apache.org/jira/browse/SPARK-17865
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> We need to add the R API for managing global temp views, mirroring the 
> changes in SPARK-17338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17719) Unify and tie up options in a single place in JDBC datasource API

2016-10-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17719:

Assignee: Hyukjin Kwon

> Unify and tie up options in a single place in JDBC datasource API
> -
>
> Key: SPARK-17719
> URL: https://issues.apache.org/jira/browse/SPARK-17719
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> - JDBC options are arbitrarily located as {{Map[String, String}}, 
> {{Properties}} and {{JDBCOptions}}. It'd be great if we can unify the usages.
> - Also, there are several options not placed in {{JDBCOptions}} which are, 
> {{batchsize}}, {{fetchszie}} and {{isolationlevel}}. It'd be good if we put 
> them together.
> - {{batchSize}} and {{isolationLevel}} are undocumented.
> - We could verify and check the argument before actually executing. For 
> example, it seems {{fetchsize}} is being checked after execution and it seems 
> throwing an exception wrapped by {{SparkException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17776) Potentially duplicated names which might have conflicts between JDBC options and properties instance

2016-10-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17776:

Assignee: Hyukjin Kwon

> Potentially duplicated names which might have conflicts between JDBC options 
> and properties instance
> 
>
> Key: SPARK-17776
> URL: https://issues.apache.org/jira/browse/SPARK-17776
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> This was discussed here - 
> https://github.com/apache/spark/pull/15292#discussion_r81128083
> Currently, `read.format("jdbc").option(...)`, 
> `write.format("jdbc").option(...)` and `write.jdbc(...)` write all the 
> options into `Properties` instance which might have some conflicts.
> We should not pass those Spark-only options into JDBC drivers but just handle 
> within Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17776) Potentially duplicated names which might have conflicts between JDBC options and properties instance

2016-10-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17776.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15292
[https://github.com/apache/spark/pull/15292]

> Potentially duplicated names which might have conflicts between JDBC options 
> and properties instance
> 
>
> Key: SPARK-17776
> URL: https://issues.apache.org/jira/browse/SPARK-17776
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> This was discussed here - 
> https://github.com/apache/spark/pull/15292#discussion_r81128083
> Currently, `read.format("jdbc").option(...)`, 
> `write.format("jdbc").option(...)` and `write.jdbc(...)` write all the 
> options into `Properties` instance which might have some conflicts.
> We should not pass those Spark-only options into JDBC drivers but just handle 
> within Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17719) Unify and tie up options in a single place in JDBC datasource API

2016-10-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17719.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15292
[https://github.com/apache/spark/pull/15292]

> Unify and tie up options in a single place in JDBC datasource API
> -
>
> Key: SPARK-17719
> URL: https://issues.apache.org/jira/browse/SPARK-17719
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> - JDBC options are arbitrarily located as {{Map[String, String}}, 
> {{Properties}} and {{JDBCOptions}}. It'd be great if we can unify the usages.
> - Also, there are several options not placed in {{JDBCOptions}} which are, 
> {{batchsize}}, {{fetchszie}} and {{isolationlevel}}. It'd be good if we put 
> them together.
> - {{batchSize}} and {{isolationLevel}} are undocumented.
> - We could verify and check the argument before actually executing. For 
> example, it seems {{fetchsize}} is being checked after execution and it seems 
> throwing an exception wrapped by {{SparkException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17865) R API for global temp view

2016-10-10 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564495#comment-15564495
 ] 

Reynold Xin commented on SPARK-17865:
-

[~jiangxb1987] never mind -- there is already the Python API.


> R API for global temp view
> --
>
> Key: SPARK-17865
> URL: https://issues.apache.org/jira/browse/SPARK-17865
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> We need to add the R API for managing global temp views, mirroring the 
> changes in SPARK-17338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17865) Python API for global temp view

2016-10-10 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564492#comment-15564492
 ] 

Reynold Xin commented on SPARK-17865:
-

Ah OK - can you then make sure the Python API is updated together with your 
follow-up PR?

Yes, let me change this to R.


> Python API for global temp view
> ---
>
> Key: SPARK-17865
> URL: https://issues.apache.org/jira/browse/SPARK-17865
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> We need to add the Python API for managing global temp views, mirroring the 
> changes in SPARK-17338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17865) R API for global temp view

2016-10-10 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564494#comment-15564494
 ] 

Reynold Xin commented on SPARK-17865:
-

cc [~felixcheung] [~yanboliang] know anybody to take on this work?

> R API for global temp view
> --
>
> Key: SPARK-17865
> URL: https://issues.apache.org/jira/browse/SPARK-17865
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> We need to add the R API for managing global temp views, mirroring the 
> changes in SPARK-17338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17865) R API for global temp view

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17865:

Description: 
We need to add the R API for managing global temp views, mirroring the changes 
in SPARK-17338.


  was:
We need to add the Python API for managing global temp views, mirroring the 
changes in SPARK-17338.



> R API for global temp view
> --
>
> Key: SPARK-17865
> URL: https://issues.apache.org/jira/browse/SPARK-17865
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> We need to add the R API for managing global temp views, mirroring the 
> changes in SPARK-17338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17865) R API for global temp view

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17865:

Summary: R API for global temp view  (was: Python API for global temp view)

> R API for global temp view
> --
>
> Key: SPARK-17865
> URL: https://issues.apache.org/jira/browse/SPARK-17865
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> We need to add the Python API for managing global temp views, mirroring the 
> changes in SPARK-17338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17865) Python API for global temp view

2016-10-10 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564489#comment-15564489
 ] 

Wenchen Fan commented on SPARK-17865:
-

python API is already added, but R API hasn't. Should we update this JIRA to 
add R API?

> Python API for global temp view
> ---
>
> Key: SPARK-17865
> URL: https://issues.apache.org/jira/browse/SPARK-17865
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> We need to add the Python API for managing global temp views, mirroring the 
> changes in SPARK-17338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17865) Python API for global temp view

2016-10-10 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-17865:
---

 Summary: Python API for global temp view
 Key: SPARK-17865
 URL: https://issues.apache.org/jira/browse/SPARK-17865
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin


We need to add the Python API for managing global temp views, mirroring the 
changes in SPARK-17338.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17865) Python API for global temp view

2016-10-10 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564484#comment-15564484
 ] 

Reynold Xin commented on SPARK-17865:
-

cc [~cloud_fan]

[~jiangxb1987] would you be interested in doing this?


> Python API for global temp view
> ---
>
> Key: SPARK-17865
> URL: https://issues.apache.org/jira/browse/SPARK-17865
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> We need to add the Python API for managing global temp views, mirroring the 
> changes in SPARK-17338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17338) Add global temp view support

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17338:

Summary: Add global temp view support  (was: add global temp view support)

> Add global temp view support
> 
>
> Key: SPARK-17338
> URL: https://issues.apache.org/jira/browse/SPARK-17338
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17338) add global temp view support

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17338:

Summary: add global temp view support  (was: add global temp view)

> add global temp view support
> 
>
> Key: SPARK-17338
> URL: https://issues.apache.org/jira/browse/SPARK-17338
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17338) Add global temp view support

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17338:

Description: 
Global temporary view is a cross-session temporary view, which means it's 
shared among all sessions. Its lifetime is the lifetime of the Spark 
application, i.e. it will be automatically dropped when the application 
terminates. It's tied to a system preserved database global_temp(configurable 
via SparkConf), and we must use the qualified name to refer a global temp view, 
e.g. SELECT * FROM global_temp.view1.



> Add global temp view support
> 
>
> Key: SPARK-17338
> URL: https://issues.apache.org/jira/browse/SPARK-17338
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>
> Global temporary view is a cross-session temporary view, which means it's 
> shared among all sessions. Its lifetime is the lifetime of the Spark 
> application, i.e. it will be automatically dropped when the application 
> terminates. It's tied to a system preserved database global_temp(configurable 
> via SparkConf), and we must use the qualified name to refer a global temp 
> view, e.g. SELECT * FROM global_temp.view1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17858) Provide option for Spark SQL to skip corrupt files

2016-10-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564479#comment-15564479
 ] 

Sean Owen commented on SPARK-17858:
---

Yeah, the related JIRA gives an argument that we shouldn't do this. You end up 
more easily silently ignoring data if it doesn't fail the query. I'm not that 
sure this is a good idea.

> Provide option for Spark SQL to skip corrupt files
> --
>
> Key: SPARK-17858
> URL: https://issues.apache.org/jira/browse/SPARK-17858
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>
> In Spark 2.0, corrupt files will fail a SQL query. However, the user may just 
> want to skip corrupt files and still run the query.
> Another painful thing is the current exception doesn't contain the paths of 
> corrupt files, makes the user hard to fix their files.
> Note: In Spark 1.6, Spark SQL always skip corrupt files because of 
> SPARK-17850.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3577) Add task metric to report spill time

2016-10-10 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564472#comment-15564472
 ] 

Reynold Xin commented on SPARK-3577:


[~dreamworks007] can you take a look at the problem here? 
https://github.com/apache/spark/pull/15347


> Add task metric to report spill time
> 
>
> Key: SPARK-3577
> URL: https://issues.apache.org/jira/browse/SPARK-3577
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
> Attachments: spill_size.jpg
>
>
> The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
> {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
> We should probably add task metrics to report this spill time, since for 
> shuffles, this would have previously been reported as part of shuffle write 
> time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17864) Mark data type APIs as stable, rather than DeveloperApi

2016-10-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564464#comment-15564464
 ] 

Apache Spark commented on SPARK-17864:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15426

> Mark data type APIs as stable, rather than DeveloperApi
> ---
>
> Key: SPARK-17864
> URL: https://issues.apache.org/jira/browse/SPARK-17864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: releasenotes
>
> The data type API has not been changed since Spark 1.3.0, and is ready for 
> graduation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17864) Mark data type APIs as stable, rather than DeveloperApi

2016-10-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17864:


Assignee: Apache Spark  (was: Reynold Xin)

> Mark data type APIs as stable, rather than DeveloperApi
> ---
>
> Key: SPARK-17864
> URL: https://issues.apache.org/jira/browse/SPARK-17864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>  Labels: releasenotes
>
> The data type API has not been changed since Spark 1.3.0, and is ready for 
> graduation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17864) Mark data type APIs as stable, rather than DeveloperApi

2016-10-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17864:


Assignee: Reynold Xin  (was: Apache Spark)

> Mark data type APIs as stable, rather than DeveloperApi
> ---
>
> Key: SPARK-17864
> URL: https://issues.apache.org/jira/browse/SPARK-17864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: releasenotes
>
> The data type API has not been changed since Spark 1.3.0, and is ready for 
> graduation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17799) InterfaceStability annotation

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17799:

Labels: releasenotes  (was: )

> InterfaceStability annotation
> -
>
> Key: SPARK-17799
> URL: https://issues.apache.org/jira/browse/SPARK-17799
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: releasenotes
>
> Based on discussions on the dev list 
> (http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-separate-API-annotation-into-two-components-InterfaceAudience-amp-InterfaceStability-td17470.html#none),
>  there are consensus to introduce an InterfaceStability annotation to 
> eventually replace the current DeveloperApi / Experimental annotation.
> This is an umbrella ticket to track its progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17864) Mark data type APIs as stable, rather than DeveloperApi

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17864:

Labels: releasenotes  (was: )

> Mark data type APIs as stable, rather than DeveloperApi
> ---
>
> Key: SPARK-17864
> URL: https://issues.apache.org/jira/browse/SPARK-17864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: releasenotes
>
> The data type API has not been changed since Spark 1.3.0, and is ready for 
> graduation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17864) Mark data type APIs as stable, rather than DeveloperApi

2016-10-10 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-17864:
---

 Summary: Mark data type APIs as stable, rather than DeveloperApi
 Key: SPARK-17864
 URL: https://issues.apache.org/jira/browse/SPARK-17864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


The data type API has not been changed since Spark 1.3.0, and is ready for 
graduation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3577) Add task metric to report spill time

2016-10-10 Thread Gaoxiang Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563038#comment-15563038
 ] 

Gaoxiang Liu edited comment on SPARK-3577 at 10/11/16 4:49 AM:
---

[~rxin] I find that the spill size metrics is already added  in 
https://github.com/apache/spark/commit/bb8098f203e6faddf2e1a04b03d62037e6c7#diff-1bd3dc38f6306e0a822f93d62c32b1d0,
 and I have confirm in the UI. (please refer to the attachment of this JIRA - 
https://issues.apache.org/jira/secure/attachment/12832515/spill_size.jpg)

Also, we noticed that it's wield that the spill size is somehow not reported in 
the reducer , but reported in the mapper.

Back to the previous question, for the spill time, if it's still relevant to 
add, then I plan to work on it if there is no objections. 


was (Author: dreamworks007):
I find that the spill size metrics is already added  in 
https://github.com/apache/spark/commit/bb8098f203e6faddf2e1a04b03d62037e6c7#diff-1bd3dc38f6306e0a822f93d62c32b1d0,
 and I have confirm in the UI. (please refer to the attachment of this JIRA - 
https://issues.apache.org/jira/secure/attachment/12832515/spill_size.jpg)

Also, we noticed that it's wield that the spill size is somehow not reported in 
the reducer , but reported in the mapper.

Back to the previous question, for the spill time, if it's still relevant to 
add, then I plan to work on it if there is no objections. 

> Add task metric to report spill time
> 
>
> Key: SPARK-3577
> URL: https://issues.apache.org/jira/browse/SPARK-3577
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
> Attachments: spill_size.jpg
>
>
> The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
> {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
> We should probably add task metrics to report this spill time, since for 
> shuffles, this would have previously been reported as part of shuffle write 
> time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17816) Json serialzation of accumulators are failing with ConcurrentModificationException

2016-10-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564410#comment-15564410
 ] 

Apache Spark commented on SPARK-17816:
--

User 'seyfe' has created a pull request for this issue:
https://github.com/apache/spark/pull/15425

> Json serialzation of accumulators are failing with 
> ConcurrentModificationException
> --
>
> Key: SPARK-17816
> URL: https://issues.apache.org/jira/browse/SPARK-17816
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Ergin Seyfe
>Assignee: Ergin Seyfe
> Fix For: 2.1.0
>
>
> This is the stack trace: See  {{ConcurrentModificationException}}:
> {code}
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
> at java.util.ArrayList$Itr.next(ArrayList.java:851)
> at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
> at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
> at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
> at scala.collection.AbstractTraversable.to(Traversable.scala:104)
> at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
> at scala.collection.AbstractTraversable.toList(Traversable.scala:104)
> at 
> org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
> at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283)
> at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
> at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:137)
> at 
> org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:157)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
> at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
> at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65)
> at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1244)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Closed] (SPARK-4630) Dynamically determine optimal number of partitions

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-4630.
--
Resolution: Duplicate
  Assignee: (was: Kostas Sakellis)

> Dynamically determine optimal number of partitions
> --
>
> Key: SPARK-4630
> URL: https://issues.apache.org/jira/browse/SPARK-4630
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kostas Sakellis
>
> Partition sizes play a big part in how fast stages execute during a Spark 
> job. There is a direct relationship between the size of partitions to the 
> number of tasks - larger partitions, fewer tasks. For better performance, 
> Spark has a sweet spot for how large partitions should be that get executed 
> by a task. If partitions are too small, then the user pays a disproportionate 
> cost in scheduling overhead. If the partitions are too large, then task 
> execution slows down due to gc pressure and spilling to disk.
> To increase performance of jobs, users often hand optimize the number(size) 
> of partitions that the next stage gets. Factors that come into play are:
> Incoming partition sizes from previous stage
> number of available executors
> available memory per executor (taking into account 
> spark.shuffle.memoryFraction)
> Spark has access to this data and so should be able to automatically do the 
> partition sizing for the user. This feature can be turned off/on with a 
> configuration option. 
> To make this happen, we propose modifying the DAGScheduler to take into 
> account partition sizes upon stage completion. Before scheduling the next 
> stage, the scheduler can examine the sizes of the partitions and determine 
> the appropriate number tasks to create. Since this change requires 
> non-trivial modifications to the DAGScheduler, a detailed design doc will be 
> attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-10-10 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564350#comment-15564350
 ] 

Takeshi Yamamuro commented on SPARK-14393:
--

Since coalesce() just after monotonicallyIncreasingId() breaks the semantics, 
it's okay not to do that.
For example, coalesce() before monotonicallyIncreasingId() is okay;
{code}
scala> 
spark.range(10).repartition(5).coalesce(1).select(monotonicallyIncreasingId()).show
warning: there was one deprecation warning; re-run with -deprecation for details
+-+
|monotonically_increasing_id()|
+-+
|0|
|1|
|2|
|3|
|4|
|5|
|6|
|7|
|8|
|9|
+-+
{code}

> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jason Piper
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17816) Json serialzation of accumulators are failing with ConcurrentModificationException

2016-10-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17816.
--
   Resolution: Fixed
 Assignee: Ergin Seyfe
Fix Version/s: 2.1.0

> Json serialzation of accumulators are failing with 
> ConcurrentModificationException
> --
>
> Key: SPARK-17816
> URL: https://issues.apache.org/jira/browse/SPARK-17816
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Ergin Seyfe
>Assignee: Ergin Seyfe
> Fix For: 2.1.0
>
>
> This is the stack trace: See  {{ConcurrentModificationException}}:
> {code}
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
> at java.util.ArrayList$Itr.next(ArrayList.java:851)
> at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
> at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
> at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
> at scala.collection.AbstractTraversable.to(Traversable.scala:104)
> at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
> at scala.collection.AbstractTraversable.toList(Traversable.scala:104)
> at 
> org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
> at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283)
> at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
> at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:137)
> at 
> org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:157)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
> at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
> at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65)
> at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1244)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15814) Aggregator can return null result

2016-10-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-15814:
-
Fix Version/s: 2.0.1
   2.1.0

> Aggregator can return null result
> -
>
> Key: SPARK-15814
> URL: https://issues.apache.org/jira/browse/SPARK-15814
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15577) Java can't import DataFrame type alias

2016-10-10 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15577.
--
Resolution: Not A Problem

> Java can't import DataFrame type alias
> --
>
> Key: SPARK-15577
> URL: https://issues.apache.org/jira/browse/SPARK-15577
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> After SPARK-13244, all Java code needs to be updated to use Dataset 
> instead of DataFrame as we used a type alias. Should we consider adding a 
> DataFrame to the Java API which just extends Dataset for compatibility?
> cc [~liancheng] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15577) Java can't import DataFrame type alias

2016-10-10 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564316#comment-15564316
 ] 

Hyukjin Kwon commented on SPARK-15577:
--

Let me please close this as a not-a-problem. Please revoke my change if anyone 
feels it is not right.

> Java can't import DataFrame type alias
> --
>
> Key: SPARK-15577
> URL: https://issues.apache.org/jira/browse/SPARK-15577
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> After SPARK-13244, all Java code needs to be updated to use Dataset 
> instead of DataFrame as we used a type alias. Should we consider adding a 
> DataFrame to the Java API which just extends Dataset for compatibility?
> cc [~liancheng] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15577) Java can't import DataFrame type alias

2016-10-10 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564302#comment-15564302
 ] 

Jakob Odersky commented on SPARK-15577:
---

this cleaning of jiras is really good to see :) Considering that spark 2.0 has 
already shipped with the type alias, I think it is safe to close this ticket. 
We can always reopen it if necessary.

> Java can't import DataFrame type alias
> --
>
> Key: SPARK-15577
> URL: https://issues.apache.org/jira/browse/SPARK-15577
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> After SPARK-13244, all Java code needs to be updated to use Dataset 
> instead of DataFrame as we used a type alias. Should we consider adding a 
> DataFrame to the Java API which just extends Dataset for compatibility?
> cc [~liancheng] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9265) Dataframe.limit joined with another dataframe can be non-deterministic

2016-10-10 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558875#comment-15558875
 ] 

Xiao Li edited comment on SPARK-9265 at 10/11/16 3:14 AM:
--

This has been resolved since Limit and Sort are executed as a 
`TakeOrderedAndProject` operator. Close it now. Thanks!


was (Author: smilegator):
This has been resolved since our Optimizer push down `Limit` below `Sort`. 
Close it now. Thanks!

> Dataframe.limit joined with another dataframe can be non-deterministic
> --
>
> Key: SPARK-9265
> URL: https://issues.apache.org/jira/browse/SPARK-9265
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Tathagata Das
>Priority: Critical
>
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> val recentFailures = table("failed_suites").cache()
> val topRecentFailures = 
> recentFailures.groupBy('suiteName).agg(count("*").as('failCount)).orderBy('failCount.desc).limit(10)
> topRecentFailures.show(100)
> val mot = topRecentFailures.as("a").join(recentFailures.as("b"), 
> $"a.suiteName" === $"b.suiteName")
>   
> (1 to 10).foreach { i => 
>   println(s"$i: " + mot.count())
> }
> {code}
> This shows.
> {code}
> ++-+
> |   suiteName|failCount|
> ++-+
> |org.apache.spark|   85|
> |org.apache.spark|   26|
> |org.apache.spark|   26|
> |org.apache.spark|   17|
> |org.apache.spark|   17|
> |org.apache.spark|   15|
> |org.apache.spark|   13|
> |org.apache.spark|   13|
> |org.apache.spark|   11|
> |org.apache.spark|9|
> ++-+
> 1: 174
> 2: 166
> 3: 174
> 4: 106
> 5: 158
> 6: 110
> 7: 174
> 8: 158
> 9: 166
> 10: 106
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16896) Loading csv with duplicate column names

2016-10-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16896.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Loading csv with duplicate column names
> ---
>
> Key: SPARK-16896
> URL: https://issues.apache.org/jira/browse/SPARK-16896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> It would be great if the library allows us to load csv with duplicate column 
> names. I understand that having duplicate columns in the data is odd but 
> sometimes we get data that has duplicate columns. Getting upstream data like 
> that can happen. We may choose to ignore them but currently there is no way 
> to drop those as we are not able to load them at all. Currently as a 
> pre-processing I loaded the data into R, changed the column names and then 
> make a fixed version with which Spark Java API can work.
> But if talk about other options, e.g. R has read.csv which automatically 
> takes care of such situation by appending a number to the column name.
> Also case sensitivity in column names can also cause problems. I mean if we 
> have columns like
> ColumnName, columnName
> I may want to have them as separate. But the option to do this is not 
> documented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16896) Loading csv with duplicate column names

2016-10-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16896:

Assignee: Hyukjin Kwon

> Loading csv with duplicate column names
> ---
>
> Key: SPARK-16896
> URL: https://issues.apache.org/jira/browse/SPARK-16896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>Assignee: Hyukjin Kwon
>
> It would be great if the library allows us to load csv with duplicate column 
> names. I understand that having duplicate columns in the data is odd but 
> sometimes we get data that has duplicate columns. Getting upstream data like 
> that can happen. We may choose to ignore them but currently there is no way 
> to drop those as we are not able to load them at all. Currently as a 
> pre-processing I loaded the data into R, changed the column names and then 
> make a fixed version with which Spark Java API can work.
> But if talk about other options, e.g. R has read.csv which automatically 
> takes care of such situation by appending a number to the column name.
> Also case sensitivity in column names can also cause problems. I mean if we 
> have columns like
> ColumnName, columnName
> I may want to have them as separate. But the option to do this is not 
> documented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-10-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17738:
-
Component/s: Tests

> Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP 
> append/extract
> --
>
> Key: SPARK-17738
> URL: https://issues.apache.org/jira/browse/SPARK-17738
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Davies Liu
>Assignee: Davies Liu
>  Labels: flaky-test
> Fix For: 2.0.2, 2.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-10-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17738:
-
Labels: flaky-test  (was: )

> Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP 
> append/extract
> --
>
> Key: SPARK-17738
> URL: https://issues.apache.org/jira/browse/SPARK-17738
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Davies Liu
>Assignee: Davies Liu
>  Labels: flaky-test
> Fix For: 2.0.2, 2.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-10-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17738:
-
Issue Type: Test  (was: Bug)

> Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP 
> append/extract
> --
>
> Key: SPARK-17738
> URL: https://issues.apache.org/jira/browse/SPARK-17738
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Davies Liu
>Assignee: Davies Liu
>  Labels: flaky-test
> Fix For: 2.0.2, 2.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-10-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17738.
--
   Resolution: Fixed
Fix Version/s: 2.0.2

> Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP 
> append/extract
> --
>
> Key: SPARK-17738
> URL: https://issues.apache.org/jira/browse/SPARK-17738
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Davies Liu
>Assignee: Davies Liu
>  Labels: flaky-test
> Fix For: 2.0.2, 2.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17772) Add helper testing methods for instance weighting

2016-10-10 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564174#comment-15564174
 ] 

Seth Hendrickson commented on SPARK-17772:
--

I'm working on this.

> Add helper testing methods for instance weighting
> -
>
> Key: SPARK-17772
> URL: https://issues.apache.org/jira/browse/SPARK-17772
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> More and more ML algos are accepting instance weights. We keep replicating 
> code to test instance weighting in every test suite, which will get out of 
> hand rather quickly. We can and should implement some generic instance weight 
> test helper methods so that we can reduce duplicated code and standardize 
> these tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2016-10-10 Thread Ioana Delaney (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564169#comment-15564169
 ] 

Ioana Delaney commented on SPARK-17626:
---

[~ron8hu] Thank you for your comments. Our current star schema detection uses 
simple, basic heuristics to identify the star join with the largest fact table 
and places it on the driving arm of the join. Even with this simple, intuitive 
join reordering, the performance results show very good improvement. The next 
step would be to improve the schema detection logic based on cardinality 
heuristics and then with more reliable informational RI constraints. When CBO 
implements the planning rules, the two algorithms can be integrated. Regarding 
predicate selectivity hint, it can also be used to diagnose query 
planning/performance. By changing predicate selectivity, we can influence the 
choice of a plan and thus test various planning options.

> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: https://issues.apache.org/jira/browse/SPARK-17626
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed 99
> Failed  0
> Total q time (s)   14,962
> Max time1,467
> Min time3
> Mean time 145
> Geomean44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)  -19% 
> Mean time improved (%)   -19%
> Geomean improved (%) -24%
> End to end improved (seconds)  -3,603
> Number of queries improved (>10%)  45
> Number of queries degraded (>10%)   6
> Number of queries unchanged48
> Top 10 queries improved (%)  -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables. Selectivity hint is optional and it was not used in the above TPC-DS 
> tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17863) SELECT distinct does not work if there is a order by clause

2016-10-10 Thread Yin Huai (JIRA)

Yin Huai created SPARK-17863:


 Summary: SELECT distinct does not work if there is a order by 
clause
 Key: SPARK-17863
 URL: https://issues.apache.org/jira/browse/SPARK-17863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Critical


{code}
select distinct struct.a, struct.b
from (
  select named_struct('a', 1, 'b', 2, 'c', 3) as struct
  union all
  select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
order by struct.a, struct.b
{code}
This query generates
{code}
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  1|  2|
+---+---+
{code}
The plan is wrong
{code}
== Parsed Logical Plan ==
'Sort ['struct.a ASC, 'struct.b ASC], true
+- 'Distinct
   +- 'Project ['struct.a, 'struct.b]
  +- 'SubqueryAlias tmp
 +- 'Union
:- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
:  +- OneRowRelation$
+- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
   +- OneRowRelation$

== Analyzed Logical Plan ==
a: int, b: int
Project [a#21819, b#21820]
+- Sort [struct#21805.a ASC, struct#21805.b ASC], true
   +- Distinct
  +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
struct#21805]
 +- SubqueryAlias tmp
+- Union
   :- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
   :  +- OneRowRelation$
   +- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
  +- OneRowRelation$

== Optimized Logical Plan ==
Project [a#21819, b#21820]
+- Sort [struct#21805.a ASC, struct#21805.b ASC], true
   +- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
struct#21805]
  +- Union
 :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
 :  +- OneRowRelation$
 +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
+- OneRowRelation$

== Physical Plan ==
*Project [a#21819, b#21820]
+- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
   +- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
  +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], 
output=[a#21819, b#21820, struct#21805])
 +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
+- *HashAggregate(keys=[a#21819, b#21820, struct#21805], 
functions=[], output=[a#21819, b#21820, struct#21805])
   +- Union
  :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS 
struct#21805]
  :  +- Scan OneRowRelation[]
  +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS 
struct#21806]
 +- Scan OneRowRelation[]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17863) SELECT distinct does not work if there is a order by clause

2016-10-10 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17863:
-
Labels: correctness  (was: )

> SELECT distinct does not work if there is a order by clause
> ---
>
> Key: SPARK-17863
> URL: https://issues.apache.org/jira/browse/SPARK-17863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>  Labels: correctness
>
> {code}
> select distinct struct.a, struct.b
> from (
>   select named_struct('a', 1, 'b', 2, 'c', 3) as struct
>   union all
>   select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
> order by struct.a, struct.b
> {code}
> This query generates
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> |  1|  2|
> +---+---+
> {code}
> The plan is wrong
> {code}
> == Parsed Logical Plan ==
> 'Sort ['struct.a ASC, 'struct.b ASC], true
> +- 'Distinct
>+- 'Project ['struct.a, 'struct.b]
>   +- 'SubqueryAlias tmp
>  +- 'Union
> :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
> :  +- OneRowRelation$
> +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>+- OneRowRelation$
> == Analyzed Logical Plan ==
> a: int, b: int
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Distinct
>   +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
> struct#21805]
>  +- SubqueryAlias tmp
> +- Union
>:- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
>:  +- OneRowRelation$
>+- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
>   +- OneRowRelation$
> == Optimized Logical Plan ==
> Project [a#21819, b#21820]
> +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
>+- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
> struct#21805]
>   +- Union
>  :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
>  :  +- OneRowRelation$
>  +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
> +- OneRowRelation$
> == Physical Plan ==
> *Project [a#21819, b#21820]
> +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
>+- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
>   +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], 
> output=[a#21819, b#21820, struct#21805])
>  +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
> +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], 
> functions=[], output=[a#21819, b#21820, struct#21805])
>+- Union
>   :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS 
> struct#21805]
>   :  +- Scan OneRowRelation[]
>   +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS 
> struct#21806]
>  +- Scan OneRowRelation[]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17338) add global temp view

2016-10-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564138#comment-15564138
 ] 

Apache Spark commented on SPARK-17338:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15424

> add global temp view
> 
>
> Key: SPARK-17338
> URL: https://issues.apache.org/jira/browse/SPARK-17338
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15577) Java can't import DataFrame type alias

2016-10-10 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564115#comment-15564115
 ] 

Hyukjin Kwon commented on SPARK-15577:
--

WDYT - [~holdenk]

> Java can't import DataFrame type alias
> --
>
> Key: SPARK-15577
> URL: https://issues.apache.org/jira/browse/SPARK-15577
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> After SPARK-13244, all Java code needs to be updated to use Dataset 
> instead of DataFrame as we used a type alias. Should we consider adding a 
> DataFrame to the Java API which just extends Dataset for compatibility?
> cc [~liancheng] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15577) Java can't import DataFrame type alias

2016-10-10 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564115#comment-15564115
 ] 

Hyukjin Kwon edited comment on SPARK-15577 at 10/11/16 1:32 AM:


WDYT? - [~holdenk]


was (Author: hyukjin.kwon):
WDYT - [~holdenk]

> Java can't import DataFrame type alias
> --
>
> Key: SPARK-15577
> URL: https://issues.apache.org/jira/browse/SPARK-15577
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> After SPARK-13244, all Java code needs to be updated to use Dataset 
> instead of DataFrame as we used a type alias. Should we consider adding a 
> DataFrame to the Java API which just extends Dataset for compatibility?
> cc [~liancheng] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2016-10-10 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564097#comment-15564097
 ] 

Weichen Xu edited comment on SPARK-17139 at 10/11/16 1:25 AM:
--

I'm working on it hard and will create PR this week, thanks!


was (Author: weichenxu123):
I'm working on it hardly and will create PR this week, thanks!

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15577) Java can't import DataFrame type alias

2016-10-10 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564103#comment-15564103
 ] 

Hyukjin Kwon commented on SPARK-15577:
--

Cool, then I guess we might be able to take an action on the status of this 
JIRA?

> Java can't import DataFrame type alias
> --
>
> Key: SPARK-15577
> URL: https://issues.apache.org/jira/browse/SPARK-15577
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> After SPARK-13244, all Java code needs to be updated to use Dataset 
> instead of DataFrame as we used a type alias. Should we consider adding a 
> DataFrame to the Java API which just extends Dataset for compatibility?
> cc [~liancheng] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2016-10-10 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564097#comment-15564097
 ] 

Weichen Xu commented on SPARK-17139:


I'm working on it hardly and will create PR this week, thanks!

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions

2016-10-10 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564071#comment-15564071
 ] 

holdenk commented on SPARK-4630:


I also agree this would be really good to revisit, from talking with users at 
conferences and feedback on High Performance Spark I think this is a very real 
issue that a lot of users find challenging to deal with. Just my $0.0064 :p :)

> Dynamically determine optimal number of partitions
> --
>
> Key: SPARK-4630
> URL: https://issues.apache.org/jira/browse/SPARK-4630
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kostas Sakellis
>Assignee: Kostas Sakellis
>
> Partition sizes play a big part in how fast stages execute during a Spark 
> job. There is a direct relationship between the size of partitions to the 
> number of tasks - larger partitions, fewer tasks. For better performance, 
> Spark has a sweet spot for how large partitions should be that get executed 
> by a task. If partitions are too small, then the user pays a disproportionate 
> cost in scheduling overhead. If the partitions are too large, then task 
> execution slows down due to gc pressure and spilling to disk.
> To increase performance of jobs, users often hand optimize the number(size) 
> of partitions that the next stage gets. Factors that come into play are:
> Incoming partition sizes from previous stage
> number of available executors
> available memory per executor (taking into account 
> spark.shuffle.memoryFraction)
> Spark has access to this data and so should be able to automatically do the 
> partition sizing for the user. This feature can be turned off/on with a 
> configuration option. 
> To make this happen, we propose modifying the DAGScheduler to take into 
> account partition sizes upon stage completion. Before scheduling the next 
> stage, the scheduler can examine the sizes of the partitions and determine 
> the appropriate number tasks to create. Since this change requires 
> non-trivial modifications to the DAGScheduler, a detailed design doc will be 
> attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16980) Load only catalog table partition metadata required to answer a query

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16980:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-17861

> Load only catalog table partition metadata required to answer a query
> -
>
> Key: SPARK-16980
> URL: https://issues.apache.org/jira/browse/SPARK-16980
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>Assignee: Michael Allman
>
> Currently, when a user reads from a partitioned Hive table whose metadata are 
> not cached (and for which Hive table conversion is enabled and supported), 
> all partition metadata are fetched from the metastore:
> https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260
> However, if the user's query includes partition pruning predicates then we 
> only need the subset of these metadata which satisfy those predicates.
> This issue tracks work to modify the current query planning scheme so that 
> unnecessary partition metadata are not loaded.
> I've prototyped two possible approaches. The first extends 
> {{o.a.s.s.c.catalog.ExternalCatalog}} and as such is more generally 
> applicable. It requires some new abstractions and refactoring of 
> {{HadoopFsRelation}} and {{FileCatalog}}, among others. It places a greater 
> burden on other implementations of {{ExternalCatalog}}. Currently the only 
> other implementation of {{ExternalCatalog}} is {{InMemoryCatalog}}, and my 
> prototype throws an {{UnsupportedOperationException}} on that implementation.
> The second prototype is simpler and only touches code in the {{hive}} 
> project. Basically, conversion of a partitioned {{MetastoreRelation}} to 
> {{HadoopFsRelation}} is deferred to physical planning. During physical 
> planning, the partition pruning filters in the query plan are used to 
> identify the required partition metadata and a {{HadoopFsRelation}} is built 
> from those. The new query plan is then re-injected into the physical planner 
> and proceeds as normal for a {{HadoopFsRelation}}.
> On the Spark dev mailing list, [~ekhliang] expressed a preference for the 
> approach I took in my first POC. (See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Scaling-partitioned-Hive-table-support-td18586.html)
>  Based on that, I'm going to open a PR with that patch as a starting point 
> for an architectural/design review. It will not be a complete patch ready for 
> integration into Spark master. Rather, I would like to get early feedback on 
> the implementation details so I can shape the PR before committing a large 
> amount of time on a finished product. I will open another PR for the second 
> approach for comparison if requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17862) Feature flag SPARK-16980

2016-10-10 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-17862:
---

 Summary: Feature flag SPARK-16980
 Key: SPARK-17862
 URL: https://issues.apache.org/jira/browse/SPARK-17862
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17861) Push data source partitions into metastore for catalog tables

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17861:

Description: 
Initially, Spark SQL does not store any partition information in the catalog 
for data source tables, because initially it was designed to work with 
arbitrary files. This, however, has a few issues for catalog tables:

1. Listing partitions for a large table (with millions of partitions) can be 
very slow during cold start.
2. Does not support heterogeneous partition naming schemes.
3. Cannot leverage pushing partition pruning into the metastore.

This ticket tracks the work required to push the tracking of partitions into 
the metastore. This change should be feature flagged.



  was:
Initially, Spark SQL does not store any partition information in the catalog 
for data source tables, because initially it was designed to work with 
arbitrary files. This, however, has a few issues for catalog tables:

1. Listing partitions for a large table (with millions of partitions) can be 
very slow during cold start.
2. Does not support heterogeneous partition naming schemes.
3. Cannot leverage pushing partition pruning into the metastore.

This ticket tracks the work required to push the tracking of partitions into 
the metastore.




> Push data source partitions into metastore for catalog tables
> -
>
> Key: SPARK-17861
> URL: https://issues.apache.org/jira/browse/SPARK-17861
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Initially, Spark SQL does not store any partition information in the catalog 
> for data source tables, because initially it was designed to work with 
> arbitrary files. This, however, has a few issues for catalog tables:
> 1. Listing partitions for a large table (with millions of partitions) can be 
> very slow during cold start.
> 2. Does not support heterogeneous partition naming schemes.
> 3. Cannot leverage pushing partition pruning into the metastore.
> This ticket tracks the work required to push the tracking of partitions into 
> the metastore. This change should be feature flagged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17861) Push data source partitions into metastore for catalog tables and support partition pruning

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17861:

Summary: Push data source partitions into metastore for catalog tables and 
support partition pruning  (was: Push data source partitions into metastore for 
catalog tables)

> Push data source partitions into metastore for catalog tables and support 
> partition pruning
> ---
>
> Key: SPARK-17861
> URL: https://issues.apache.org/jira/browse/SPARK-17861
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Initially, Spark SQL does not store any partition information in the catalog 
> for data source tables, because initially it was designed to work with 
> arbitrary files. This, however, has a few issues for catalog tables:
> 1. Listing partitions for a large table (with millions of partitions) can be 
> very slow during cold start.
> 2. Does not support heterogeneous partition naming schemes.
> 3. Cannot leverage pushing partition pruning into the metastore.
> This ticket tracks the work required to push the tracking of partitions into 
> the metastore. This change should be feature flagged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17861) Store data source partitions in metastore and push partition pruning into metastore

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17861:

Summary: Store data source partitions in metastore and push partition 
pruning into metastore  (was: Store data source partitions in metastore and 
push partition pruning into the metastore)

> Store data source partitions in metastore and push partition pruning into 
> metastore
> ---
>
> Key: SPARK-17861
> URL: https://issues.apache.org/jira/browse/SPARK-17861
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Initially, Spark SQL does not store any partition information in the catalog 
> for data source tables, because initially it was designed to work with 
> arbitrary files. This, however, has a few issues for catalog tables:
> 1. Listing partitions for a large table (with millions of partitions) can be 
> very slow during cold start.
> 2. Does not support heterogeneous partition naming schemes.
> 3. Cannot leverage pushing partition pruning into the metastore.
> This ticket tracks the work required to push the tracking of partitions into 
> the metastore. This change should be feature flagged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17861) Store data source partitions in metastore and push partition pruning into the metastore

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17861:

Summary: Store data source partitions in metastore and push partition 
pruning into the metastore  (was: Push data source partitions into metastore 
for catalog tables and support partition pruning)

> Store data source partitions in metastore and push partition pruning into the 
> metastore
> ---
>
> Key: SPARK-17861
> URL: https://issues.apache.org/jira/browse/SPARK-17861
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Initially, Spark SQL does not store any partition information in the catalog 
> for data source tables, because initially it was designed to work with 
> arbitrary files. This, however, has a few issues for catalog tables:
> 1. Listing partitions for a large table (with millions of partitions) can be 
> very slow during cold start.
> 2. Does not support heterogeneous partition naming schemes.
> 3. Cannot leverage pushing partition pruning into the metastore.
> This ticket tracks the work required to push the tracking of partitions into 
> the metastore. This change should be feature flagged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17861) Push data source partitions into metastore for catalog tables

2016-10-10 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564058#comment-15564058
 ] 

Reynold Xin commented on SPARK-17861:
-

cc [~michael] this is the main work I want to get in for 2.1.


> Push data source partitions into metastore for catalog tables
> -
>
> Key: SPARK-17861
> URL: https://issues.apache.org/jira/browse/SPARK-17861
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Initially, Spark SQL does not store any partition information in the catalog 
> for data source tables, because initially it was designed to work with 
> arbitrary files. This, however, has a few issues for catalog tables:
> 1. Listing partitions for a large table (with millions of partitions) can be 
> very slow during cold start.
> 2. Does not support heterogeneous partition naming schemes.
> 3. Cannot leverage pushing partition pruning into the metastore.
> This ticket tracks the work required to push the tracking of partitions into 
> the metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17861) Push data source partitions into metastore for catalog tables

2016-10-10 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-17861:
---

 Summary: Push data source partitions into metastore for catalog 
tables
 Key: SPARK-17861
 URL: https://issues.apache.org/jira/browse/SPARK-17861
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


Initially, Spark SQL does not store any partition information in the catalog 
for data source tables, because initially it was designed to work with 
arbitrary files. This, however, has a few issues for catalog tables:

1. Listing partitions for a large table (with millions of partitions) can be 
very slow during cold start.
2. Does not support heterogeneous partition naming schemes.
3. Cannot leverage pushing partition pruning into the metastore.

This ticket tracks the work required to push the tracking of partitions into 
the metastore.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17861) Push data source partitions into metastore for catalog tables

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17861:

Priority: Critical  (was: Major)

> Push data source partitions into metastore for catalog tables
> -
>
> Key: SPARK-17861
> URL: https://issues.apache.org/jira/browse/SPARK-17861
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Initially, Spark SQL does not store any partition information in the catalog 
> for data source tables, because initially it was designed to work with 
> arbitrary files. This, however, has a few issues for catalog tables:
> 1. Listing partitions for a large table (with millions of partitions) can be 
> very slow during cold start.
> 2. Does not support heterogeneous partition naming schemes.
> 3. Cannot leverage pushing partition pruning into the metastore.
> This ticket tracks the work required to push the tracking of partitions into 
> the metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16980) Load only catalog table partition metadata required to answer a query

2016-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16980:

Assignee: Michael Allman

> Load only catalog table partition metadata required to answer a query
> -
>
> Key: SPARK-16980
> URL: https://issues.apache.org/jira/browse/SPARK-16980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>Assignee: Michael Allman
>
> Currently, when a user reads from a partitioned Hive table whose metadata are 
> not cached (and for which Hive table conversion is enabled and supported), 
> all partition metadata are fetched from the metastore:
> https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260
> However, if the user's query includes partition pruning predicates then we 
> only need the subset of these metadata which satisfy those predicates.
> This issue tracks work to modify the current query planning scheme so that 
> unnecessary partition metadata are not loaded.
> I've prototyped two possible approaches. The first extends 
> {{o.a.s.s.c.catalog.ExternalCatalog}} and as such is more generally 
> applicable. It requires some new abstractions and refactoring of 
> {{HadoopFsRelation}} and {{FileCatalog}}, among others. It places a greater 
> burden on other implementations of {{ExternalCatalog}}. Currently the only 
> other implementation of {{ExternalCatalog}} is {{InMemoryCatalog}}, and my 
> prototype throws an {{UnsupportedOperationException}} on that implementation.
> The second prototype is simpler and only touches code in the {{hive}} 
> project. Basically, conversion of a partitioned {{MetastoreRelation}} to 
> {{HadoopFsRelation}} is deferred to physical planning. During physical 
> planning, the partition pruning filters in the query plan are used to 
> identify the required partition metadata and a {{HadoopFsRelation}} is built 
> from those. The new query plan is then re-injected into the physical planner 
> and proceeds as normal for a {{HadoopFsRelation}}.
> On the Spark dev mailing list, [~ekhliang] expressed a preference for the 
> approach I took in my first POC. (See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Scaling-partitioned-Hive-table-support-td18586.html)
>  Based on that, I'm going to open a PR with that patch as a starting point 
> for an architectural/design review. It will not be a complete patch ready for 
> integration into Spark master. Rather, I would like to get early feedback on 
> the implementation details so I can shape the PR before committing a large 
> amount of time on a finished product. I will open another PR for the second 
> approach for comparison if requested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17801) [ML]Random Forest Regression fails for large input

2016-10-10 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563974#comment-15563974
 ] 

Joseph K. Bradley commented on SPARK-17801:
---

Btw, that maxBins setting is way too high.  It should be in the hundreds at 
most.

> [ML]Random Forest Regression fails for large input
> --
>
> Key: SPARK-17801
> URL: https://issues.apache.org/jira/browse/SPARK-17801
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04
>Reporter: samkit
>Priority: Minor
>
> Random Forest Regression
> Data:https://www.kaggle.com/c/grupo-bimbo-inventory-demand/download/train.csv.zip
> Parameters:
> NumTrees:500Maximum Bins:7477383 MaxDepth:27
> MinInstancesPerNode:8648  SamplingRate:1.0
> Java Options:
> "-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" 
> "-Dspark.driver.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC 
> -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:-UseAdaptiveSizePolicy 
> -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit  
> -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g 
> -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster" 
> "-Dspark.speculation=true" " "-Dspark.speculation.multiplier=2" 
> "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms"  
> "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768" 
> "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6" 
> "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution 
> -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC 
> -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g 
> -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps" 
> "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g" 
> "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails" 
> "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution" 
> "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2" 
> "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2" "-XX:-UseGCOverheadLimit" 
> "-XX:CMSInitiatingOccupancyFraction=75" "-XX:NewSize=8g" "-XX:MaxNewSize=8g" 
> "-XX:SurvivorRatio=3" "-DnumPartitions=36"
> Partial Driver StackTrace:
> org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:740)
>   
> org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:525)
>   org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:160)
>   
> org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:209)
>   
> org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:197)
>   org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   org.apache.spark.ml.Estimator.fit(Estimator.scala:59)
>   org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
>   org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
> For complete Executor and Driver ErrorLog
> https://gist.github.com/anonymous/603ac7f8f17e43c51ba93b2934cd4cb6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17801) [ML]Random Forest Regression fails for large input

2016-10-10 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563966#comment-15563966
 ] 

Joseph K. Bradley edited comment on SPARK-17801 at 10/11/16 12:10 AM:
--

Have you tried this with Spark 1.6.2 or 2.0.0 or 2.0.1?


was (Author: josephkb):
Have you tried this with Spark 2.0?

> [ML]Random Forest Regression fails for large input
> --
>
> Key: SPARK-17801
> URL: https://issues.apache.org/jira/browse/SPARK-17801
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04
>Reporter: samkit
>Priority: Minor
>
> Random Forest Regression
> Data:https://www.kaggle.com/c/grupo-bimbo-inventory-demand/download/train.csv.zip
> Parameters:
> NumTrees:500Maximum Bins:7477383 MaxDepth:27
> MinInstancesPerNode:8648  SamplingRate:1.0
> Java Options:
> "-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" 
> "-Dspark.driver.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC 
> -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:-UseAdaptiveSizePolicy 
> -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit  
> -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g 
> -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster" 
> "-Dspark.speculation=true" " "-Dspark.speculation.multiplier=2" 
> "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms"  
> "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768" 
> "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6" 
> "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution 
> -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC 
> -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g 
> -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps" 
> "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g" 
> "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails" 
> "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution" 
> "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2" 
> "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2" "-XX:-UseGCOverheadLimit" 
> "-XX:CMSInitiatingOccupancyFraction=75" "-XX:NewSize=8g" "-XX:MaxNewSize=8g" 
> "-XX:SurvivorRatio=3" "-DnumPartitions=36"
> Partial Driver StackTrace:
> org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:740)
>   
> org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:525)
>   org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:160)
>   
> org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:209)
>   
> org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:197)
>   org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   org.apache.spark.ml.Estimator.fit(Estimator.scala:59)
>   org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
>   org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
> For complete Executor and Driver ErrorLog
> https://gist.github.com/anonymous/603ac7f8f17e43c51ba93b2934cd4cb6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17801) [ML]Random Forest Regression fails for large input

2016-10-10 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563966#comment-15563966
 ] 

Joseph K. Bradley commented on SPARK-17801:
---

Have you tried this with Spark 2.0?

> [ML]Random Forest Regression fails for large input
> --
>
> Key: SPARK-17801
> URL: https://issues.apache.org/jira/browse/SPARK-17801
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04
>Reporter: samkit
>Priority: Minor
>
> Random Forest Regression
> Data:https://www.kaggle.com/c/grupo-bimbo-inventory-demand/download/train.csv.zip
> Parameters:
> NumTrees:500Maximum Bins:7477383 MaxDepth:27
> MinInstancesPerNode:8648  SamplingRate:1.0
> Java Options:
> "-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" 
> "-Dspark.driver.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC 
> -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:-UseAdaptiveSizePolicy 
> -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit  
> -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g 
> -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster" 
> "-Dspark.speculation=true" " "-Dspark.speculation.multiplier=2" 
> "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms"  
> "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768" 
> "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6" 
> "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution 
> -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC 
> -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g 
> -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps" 
> "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g" 
> "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails" 
> "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution" 
> "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2" 
> "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2" "-XX:-UseGCOverheadLimit" 
> "-XX:CMSInitiatingOccupancyFraction=75" "-XX:NewSize=8g" "-XX:MaxNewSize=8g" 
> "-XX:SurvivorRatio=3" "-DnumPartitions=36"
> Partial Driver StackTrace:
> org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:740)
>   
> org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:525)
>   org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:160)
>   
> org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:209)
>   
> org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:197)
>   org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   org.apache.spark.ml.Estimator.fit(Estimator.scala:59)
>   org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
>   org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
> For complete Executor and Driver ErrorLog
> https://gist.github.com/anonymous/603ac7f8f17e43c51ba93b2934cd4cb6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature

2016-10-10 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14610.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 12374
[https://github.com/apache/spark/pull/12374]

> Remove superfluous split from random forest findSplitsForContinousFeature
> -
>
> Key: SPARK-14610
> URL: https://issues.apache.org/jira/browse/SPARK-14610
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently, the method findSplitsForContinuousFeature in random forest 
> produces an unnecessary split. For example, if a continuous feature has 
> unique values: (1, 2, 3), then the possible splits generated by this method 
> are:
> * {1|2,3}
> * {1,2|3} 
> * {1,2,3|}
> The following unit test is quite clearly incorrect:
> {code:title=rf.scala|borderStyle=solid}
> val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
>   val splits = 
> RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
>   assert(splits.length === 3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature

2016-10-10 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14610:
--
Assignee: Seth Hendrickson

> Remove superfluous split from random forest findSplitsForContinousFeature
> -
>
> Key: SPARK-14610
> URL: https://issues.apache.org/jira/browse/SPARK-14610
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> Currently, the method findSplitsForContinuousFeature in random forest 
> produces an unnecessary split. For example, if a continuous feature has 
> unique values: (1, 2, 3), then the possible splits generated by this method 
> are:
> * {1|2,3}
> * {1,2|3} 
> * {1,2,3|}
> The following unit test is quite clearly incorrect:
> {code:title=rf.scala|borderStyle=solid}
> val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
>   val splits = 
> RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
>   assert(splits.length === 3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-10 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563944#comment-15563944
 ] 

Hossein Falaki commented on SPARK-17781:


Yes, but somehow inside {{worker.R}} Date fields in the list are treated as 
Double. Case in point is the reproducing example in the body of the ticket. To 
be honest, I am still confused about this too.

> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9478) Add class weights to Random Forest

2016-10-10 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563919#comment-15563919
 ] 

Seth Hendrickson commented on SPARK-9478:
-

I'm going to revive this, and hopefully submit a PR soon.

> Add class weights to Random Forest
> --
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17860) SHOW COLUMN's database conflict check should respect case sensitivity setting

2016-10-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17860:


Assignee: Apache Spark

> SHOW COLUMN's database conflict check should respect case sensitivity setting
> -
>
> Key: SPARK-17860
> URL: https://issues.apache.org/jira/browse/SPARK-17860
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dilip Biswal
>Assignee: Apache Spark
>Priority: Minor
>
> SHOW COLUMNS command validates the user supplied database 
> name with database name from qualified table name name to make
> sure both of them are consistent. This comparison should respect
> case sensitivity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17860) SHOW COLUMN's database conflict check should respect case sensitivity setting

2016-10-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563915#comment-15563915
 ] 

Apache Spark commented on SPARK-17860:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/15423

> SHOW COLUMN's database conflict check should respect case sensitivity setting
> -
>
> Key: SPARK-17860
> URL: https://issues.apache.org/jira/browse/SPARK-17860
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> SHOW COLUMNS command validates the user supplied database 
> name with database name from qualified table name name to make
> sure both of them are consistent. This comparison should respect
> case sensitivity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17860) SHOW COLUMN's database conflict check should respect case sensitivity setting

2016-10-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17860:


Assignee: (was: Apache Spark)

> SHOW COLUMN's database conflict check should respect case sensitivity setting
> -
>
> Key: SPARK-17860
> URL: https://issues.apache.org/jira/browse/SPARK-17860
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> SHOW COLUMNS command validates the user supplied database 
> name with database name from qualified table name name to make
> sure both of them are consistent. This comparison should respect
> case sensitivity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15577) Java can't import DataFrame type alias

2016-10-10 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563910#comment-15563910
 ] 

Jakob Odersky commented on SPARK-15577:
---

This was considered and trade-offs were actively discussed, but ultimately the 
type alias was chosen over sub classing.
I think the main argument in favor of aliasing was to avoid incompatibilities 
in future libraries, i.e. there is utility function was written to accept a 
{{DataFrame}}, however I want to pass in a {{Dataset\[Row\]}}.

[This email thread| 
http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-DataFrame-vs-Dataset-in-Spark-2-0-td16445.html]
 contains the whole discussion

> Java can't import DataFrame type alias
> --
>
> Key: SPARK-15577
> URL: https://issues.apache.org/jira/browse/SPARK-15577
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> After SPARK-13244, all Java code needs to be updated to use Dataset 
> instead of DataFrame as we used a type alias. Should we consider adding a 
> DataFrame to the Java API which just extends Dataset for compatibility?
> cc [~liancheng] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15577) Java can't import DataFrame type alias

2016-10-10 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563910#comment-15563910
 ] 

Jakob Odersky edited comment on SPARK-15577 at 10/10/16 11:41 PM:
--

This was considered and trade-offs were actively discussed, but ultimately the 
type alias was chosen over sub classing.
I think a principal argument in favor of aliasing was to avoid 
incompatibilities in future libraries, i.e. there is utility function was 
written to accept a {{DataFrame}}, however I want to pass in a 
{{Dataset\[Row\]}}.

[This email thread| 
http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-DataFrame-vs-Dataset-in-Spark-2-0-td16445.html]
 contains the whole discussion


was (Author: jodersky):
This was considered and trade-offs were actively discussed, but ultimately the 
type alias was chosen over sub classing.
I think the main argument in favor of aliasing was to avoid incompatibilities 
in future libraries, i.e. there is utility function was written to accept a 
{{DataFrame}}, however I want to pass in a {{Dataset\[Row\]}}.

[This email thread| 
http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-DataFrame-vs-Dataset-in-Spark-2-0-td16445.html]
 contains the whole discussion

> Java can't import DataFrame type alias
> --
>
> Key: SPARK-15577
> URL: https://issues.apache.org/jira/browse/SPARK-15577
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> After SPARK-13244, all Java code needs to be updated to use Dataset 
> instead of DataFrame as we used a type alias. Should we consider adding a 
> DataFrame to the Java API which just extends Dataset for compatibility?
> cc [~liancheng] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-10-10 Thread Jo Desmet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563894#comment-15563894
 ] 

Jo Desmet commented on SPARK-15343:
---

Still not acceptable, I mean how can we. This tool has been presented as a tool 
with near-perfect Hadoop integration, which we are about to drop. It absolutely 
does not make sense to drop this as 'not-a-problem': it is a clearly described 
regression for a mainstream deployment. I am all for galvanizing the Hadoop 
community in upping their libraries, and Spark is a good motivation, but this 
is all too harsh. There is absolutely no way back once we go that direction. It 
might mean loosing and alienating our existing user-base. For example, big 
corporations are using vetted Hadoop Stacks. I do work for one such bigger 
corporation, and I know that that kind of thing means a lot to them. Before we 
know we will end up with a complex maze of what version of hadoop works with 
what version of Mesos, Hadoop, etc.
If Java 9 provides the solution, or providing shaded libraries, then we should 
wait till that is in place before moving forward.

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>

[jira] [Created] (SPARK-17860) SHOW COLUMN's database conflict check should use case sensitive compare.

2016-10-10 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-17860:


 Summary: SHOW COLUMN's database conflict check should use case 
sensitive compare.
 Key: SPARK-17860
 URL: https://issues.apache.org/jira/browse/SPARK-17860
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Dilip Biswal
Priority: Minor


SHOW COLUMNS command validates the user supplied database 
name with database name from qualified table name name to make
sure both of them are consistent. This comparison should respect
case sensitivity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17860) SHOW COLUMN's database conflict check should respect case sensitivity setting

2016-10-10 Thread Dilip Biswal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal updated SPARK-17860:
-
Summary: SHOW COLUMN's database conflict check should respect case 
sensitivity setting  (was: SHOW COLUMN's database conflict check should use 
case sensitive compare.)

> SHOW COLUMN's database conflict check should respect case sensitivity setting
> -
>
> Key: SPARK-17860
> URL: https://issues.apache.org/jira/browse/SPARK-17860
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> SHOW COLUMNS command validates the user supplied database 
> name with database name from qualified table name name to make
> sure both of them are consistent. This comparison should respect
> case sensitivity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17857) SHOW TABLES IN schema throws exception if schema doesn't exist

2016-10-10 Thread Todd Nemet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563877#comment-15563877
 ] 

Todd Nemet commented on SPARK-17857:


I didn't even think to check Hive 1.2, since I figured it would have the same 
behavior as Spark 1.5 and 1.6. I'm really surprised by that.

For context, due to 
[SPARK-9686|https://issues.apache.org/jira/browse/SPARK-9686] we use SHOW and 
DESCRIBE to walk the schema in Spark instead of connection.getMetaData(), like 
we do in Hive and Impala.

We are catching the exception to workaround it, but since it's a change from 
1.x I thought I'd file it as a minor bug. But since it's not a difference from 
Hive perhaps it's not a bug.



> SHOW TABLES IN schema throws exception if schema doesn't exist
> --
>
> Key: SPARK-17857
> URL: https://issues.apache.org/jira/browse/SPARK-17857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Todd Nemet
>Priority: Minor
>
> SHOW TABLES IN badschema; throws 
> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException if badschema 
> doesn't exist. In Spark 1.x it would return an empty result set.
> On Spark 2.0.1:
> {code}
> [683|12:45:56] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/ -n hive
> Connecting to jdbc:hive2://localhost:10006/
> 16/10/10 12:46:00 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/10 12:46:00 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/10 12:46:00 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 1.2.1.spark2 by Apache Hive
> 0: jdbc:hive2://localhost:10006/> show schemas;
> +---+--+
> | databaseName  |
> +---+--+
> | default   |
> | looker_scratch|
> | spark_jira|
> | spark_looker_scratch  |
> | spark_looker_test |
> +---+--+
> 5 rows selected (0.61 seconds)
> 0: jdbc:hive2://localhost:10006/> show tables in spark_looker_test;
> +--+--+--+
> |  tableName   | isTemporary  |
> +--+--+--+
> | all_types| false|
> | order_items  | false|
> | orders   | false|
> | users| false|
> +--+--+--+
> 4 rows selected (0.611 seconds)
> 0: jdbc:hive2://localhost:10006/> show tables in badschema;
> Error: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: 
> Database 'badschema' not found; (state=,code=0)
> {code}
> On Spark 1.6.2:
> {code}
> [680|12:47:26] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10005/ -n hive
> Connecting to jdbc:hive2://localhost:10005/
> 16/10/10 12:47:29 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/10 12:47:29 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/10 12:47:30 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 1.2.1.spark2 by Apache Hive
> 0: jdbc:hive2://localhost:10005/> show schemas;
> ++--+
> |   result   |
> ++--+
> | default|
> | spark_jira |
> | spark_looker_test  |
> | spark_scratch  |
> ++--+
> 4 rows selected (0.613 seconds)
> 0: jdbc:hive2://localhost:10005/> show tables in spark_looker_test;
> +--+--+--+
> |  tableName   | isTemporary  |
> +--+--+--+
> | all_types| false|
> | order_items  | false|
> | orders   | false|
> | users| false|
> +--+--+--+
> 4 rows selected (0.575 seconds)
> 0: jdbc:hive2://localhost:10005/> show tables in badschema;
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> ++--+--+
> No rows selected (0.458 seconds)
> {code}
> [Relevant part of Hive QL 
> docs|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-10 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563862#comment-15563862
 ] 

Shivaram Venkataraman commented on SPARK-17781:
---

I'm not sure I follow. The class of the elements inside should remain `Date` ?

{code}
> l <- lapply(1:2, function(x) { Sys.Date() })
> class(l[[1]])
[1] "Date"
> class(l[[2]])
[1] "Date"
{code}


> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions

2016-10-10 Thread Mike Dusenberry (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563839#comment-15563839
 ] 

Mike Dusenberry commented on SPARK-4630:


It would be really nice to revisit this issue, perhaps even updated to just 
focus on {{DataSet}} given the current direction of the project.  

Basically, I would be really interested in Smart DataSet/DataFrame/RDD 
repartitioning that automatically decides the proper number of partitions based 
on characteristics of the data (i.e. width) and the cluster.  From my 
experience, the outcome of a "wrong" number of partitions is frequent OOM 
errors, 2GB partition limits (although that's been lifted in 2.x, it's still a 
perf issue), and if you save a DataFrame/DataSet to, say, Parquet format with 
too few partitions, the individual compressed files may be too large to read 
later on (say if a partition is 2GB, but there isn't enough executor memory to 
open that when working with the files later on with a different Spark setup -- 
think perhaps a batch preprocessing cluster vs. production serving cluster).

> Dynamically determine optimal number of partitions
> --
>
> Key: SPARK-4630
> URL: https://issues.apache.org/jira/browse/SPARK-4630
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kostas Sakellis
>Assignee: Kostas Sakellis
>
> Partition sizes play a big part in how fast stages execute during a Spark 
> job. There is a direct relationship between the size of partitions to the 
> number of tasks - larger partitions, fewer tasks. For better performance, 
> Spark has a sweet spot for how large partitions should be that get executed 
> by a task. If partitions are too small, then the user pays a disproportionate 
> cost in scheduling overhead. If the partitions are too large, then task 
> execution slows down due to gc pressure and spilling to disk.
> To increase performance of jobs, users often hand optimize the number(size) 
> of partitions that the next stage gets. Factors that come into play are:
> Incoming partition sizes from previous stage
> number of available executors
> available memory per executor (taking into account 
> spark.shuffle.memoryFraction)
> Spark has access to this data and so should be able to automatically do the 
> partition sizing for the user. This feature can be turned off/on with a 
> configuration option. 
> To make this happen, we propose modifying the DAGScheduler to take into 
> account partition sizes upon stage completion. Before scheduling the next 
> stage, the scheduler can examine the sizes of the partitions and determine 
> the appropriate number tasks to create. Since this change requires 
> non-trivial modifications to the DAGScheduler, a detailed design doc will be 
> attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4630) Dynamically determine optimal number of partitions

2016-10-10 Thread Mike Dusenberry (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563839#comment-15563839
 ] 

Mike Dusenberry edited comment on SPARK-4630 at 10/10/16 11:10 PM:
---

It would be really nice to revisit this issue, perhaps even updated to just 
focus on {{DataSet}} given the current direction of the project.  

Basically, I would be really interested in smart DataSet/DataFrame/RDD 
repartitioning that automatically decides the proper number of partitions based 
on characteristics of the data (i.e. width) and the cluster.  From my 
experience, the outcome of a "wrong" number of partitions is frequent OOM 
errors, 2GB partition limits (although that's been lifted in 2.x, it's still a 
perf issue), and if you save a DataFrame/DataSet to, say, Parquet format with 
too few partitions, the individual compressed files may be too large to read 
later on (say if a partition is 2GB, but there isn't enough executor memory to 
open that when working with the files later on with a different Spark setup -- 
think perhaps a batch preprocessing cluster vs. production serving cluster).


was (Author: mwdus...@us.ibm.com):
It would be really nice to revisit this issue, perhaps even updated to just 
focus on {{DataSet}} given the current direction of the project.  

Basically, I would be really interested in Smart DataSet/DataFrame/RDD 
repartitioning that automatically decides the proper number of partitions based 
on characteristics of the data (i.e. width) and the cluster.  From my 
experience, the outcome of a "wrong" number of partitions is frequent OOM 
errors, 2GB partition limits (although that's been lifted in 2.x, it's still a 
perf issue), and if you save a DataFrame/DataSet to, say, Parquet format with 
too few partitions, the individual compressed files may be too large to read 
later on (say if a partition is 2GB, but there isn't enough executor memory to 
open that when working with the files later on with a different Spark setup -- 
think perhaps a batch preprocessing cluster vs. production serving cluster).

> Dynamically determine optimal number of partitions
> --
>
> Key: SPARK-4630
> URL: https://issues.apache.org/jira/browse/SPARK-4630
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kostas Sakellis
>Assignee: Kostas Sakellis
>
> Partition sizes play a big part in how fast stages execute during a Spark 
> job. There is a direct relationship between the size of partitions to the 
> number of tasks - larger partitions, fewer tasks. For better performance, 
> Spark has a sweet spot for how large partitions should be that get executed 
> by a task. If partitions are too small, then the user pays a disproportionate 
> cost in scheduling overhead. If the partitions are too large, then task 
> execution slows down due to gc pressure and spilling to disk.
> To increase performance of jobs, users often hand optimize the number(size) 
> of partitions that the next stage gets. Factors that come into play are:
> Incoming partition sizes from previous stage
> number of available executors
> available memory per executor (taking into account 
> spark.shuffle.memoryFraction)
> Spark has access to this data and so should be able to automatically do the 
> partition sizing for the user. This feature can be turned off/on with a 
> configuration option. 
> To make this happen, we propose modifying the DAGScheduler to take into 
> account partition sizes upon stage completion. Before scheduling the next 
> stage, the scheduler can examine the sizes of the partitions and determine 
> the appropriate number tasks to create. Since this change requires 
> non-trivial modifications to the DAGScheduler, a detailed design doc will be 
> attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17859) persist should not impede with spark's ability to perform a broadcast join.

2016-10-10 Thread Franck Tago (JIRA)

Franck Tago created SPARK-17859:
---

 Summary: persist should not impede with spark's ability to perform 
a broadcast join.
 Key: SPARK-17859
 URL: https://issues.apache.org/jira/browse/SPARK-17859
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.0.0
 Environment: spark 2.0.0 , Linux RedHat
Reporter: Franck Tago


I am using Spark 2.0.0 

My investigation leads me to conclude that calling persist could prevent 
broadcast join  from happening .

Example
Case1: No persist call 
var  df1 =spark.range(100).select($"id".as("id1"))
df1: org.apache.spark.sql.DataFrame = [id1: bigint]

 var df2 =spark.range(1000).select($"id".as("id2"))
df2: org.apache.spark.sql.DataFrame = [id2: bigint]

 df1.join(df2 , $"id1" === $"id2" ).explain 
== Physical Plan ==
*BroadcastHashJoin [id1#117L], [id2#123L], Inner, BuildRight
:- *Project [id#114L AS id1#117L]
:  +- *Range (0, 100, splits=2)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   +- *Project [id#120L AS id2#123L]
  +- *Range (0, 1000, splits=2)

Case 2:  persist call 

 df1.persist.join(df2 , $"id1" === $"id2" ).explain 
16/10/10 15:50:21 WARN CacheManager: Asked to cache already cached data.
== Physical Plan ==
*SortMergeJoin [id1#3L], [id2#9L], Inner
:- *Sort [id1#3L ASC], false, 0
:  +- Exchange hashpartitioning(id1#3L, 10)
: +- InMemoryTableScan [id1#3L]
::  +- InMemoryRelation [id1#3L], true, 1, StorageLevel(disk, 
memory, deserialized, 1 replicas)
:: :  +- *Project [id#0L AS id1#3L]
:: : +- *Range (0, 100, splits=2)
+- *Sort [id2#9L ASC], false, 0
   +- Exchange hashpartitioning(id2#9L, 10)
  +- InMemoryTableScan [id2#9L]
 :  +- InMemoryRelation [id2#9L], true, 1, StorageLevel(disk, 
memory, deserialized, 1 replicas)
 : :  +- *Project [id#6L AS id2#9L]
 : : +- *Range (0, 1000, splits=2)


Why does the persist call prevent the broadcast join . 

My opinion is that it should not .

I was made aware that the persist call is  lazy and that might have something 
to do with it , but I still contend that it should not . 

Losing broadcast joins is really costly.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17858) Provide option for Spark SQL to skip corrupt files

2016-10-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17858:
-
Description: 
In Spark 2.0, corrupt files will fail a SQL query. However, the user may just 
want to skip corrupt files and still run the query.

Another painful thing is the current exception doesn't contain the paths of 
corrupt files, makes the user hard to fix their files.

Note: In Spark 1.6, Spark SQL always skip corrupt files because of SPARK-17850.

  was:
In Spark 2.0, corrupt files will fail a job. However, the user may just want to 
skip corrupt files and still run the query.

Another painful thing is the current exception doesn't contain the paths of 
corrupt files, makes the user hard to fix their files.


> Provide option for Spark SQL to skip corrupt files
> --
>
> Key: SPARK-17858
> URL: https://issues.apache.org/jira/browse/SPARK-17858
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>
> In Spark 2.0, corrupt files will fail a SQL query. However, the user may just 
> want to skip corrupt files and still run the query.
> Another painful thing is the current exception doesn't contain the paths of 
> corrupt files, makes the user hard to fix their files.
> Note: In Spark 1.6, Spark SQL always skip corrupt files because of 
> SPARK-17850.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17858) Provide option for Spark SQL to skip corrupt files

2016-10-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17858:
-
Description: 
In Spark 2.0, corrupt files will fail a job. However, the user may just want to 
skip corrupt files and still run the query.

Another painful thing is the current exception doesn't contain the paths of 
corrupt files, makes the user hard to fix their files.

  was:In Spark 2.0, corrupt files will fail a job. However, the user may not 


> Provide option for Spark SQL to skip corrupt files
> --
>
> Key: SPARK-17858
> URL: https://issues.apache.org/jira/browse/SPARK-17858
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>
> In Spark 2.0, corrupt files will fail a job. However, the user may just want 
> to skip corrupt files and still run the query.
> Another painful thing is the current exception doesn't contain the paths of 
> corrupt files, makes the user hard to fix their files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17858) Provide option for Spark SQL to skip corrupt files

2016-10-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17858:
-
Issue Type: Improvement  (was: Bug)

> Provide option for Spark SQL to skip corrupt files
> --
>
> Key: SPARK-17858
> URL: https://issues.apache.org/jira/browse/SPARK-17858
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>
> In Spark 2.0, corrupt files will fail a job. However, the user may not 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17858) Provide option for Spark SQL to skip corrupt files

2016-10-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17858:
-
Description: In Spark 2.0, corrupt files will fail a job. However, the user 
may not 

> Provide option for Spark SQL to skip corrupt files
> --
>
> Key: SPARK-17858
> URL: https://issues.apache.org/jira/browse/SPARK-17858
> Project: Spark
>  Issue Type: Bug
>Reporter: Shixiong Zhu
>
> In Spark 2.0, corrupt files will fail a job. However, the user may not 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17858) Provide option for Spark SQL to skip corrupt files

2016-10-10 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-17858:


 Summary: Provide option for Spark SQL to skip corrupt files
 Key: SPARK-17858
 URL: https://issues.apache.org/jira/browse/SPARK-17858
 Project: Spark
  Issue Type: Bug
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17850) HadoopRDD should not swallow EOFException

2016-10-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17850:


Assignee: (was: Apache Spark)

> HadoopRDD should not swallow EOFException
> -
>
> Key: SPARK-17850
> URL: https://issues.apache.org/jira/browse/SPARK-17850
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.1
>Reporter: Shixiong Zhu
>  Labels: correctness
>
> The code in 
> https://github.com/apache/spark/blob/2bcd5d5ce3eaf0eb1600a12a2b55ddb40927533b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L256
>  catches EOFException and mark RecordReader finished. However, in some cases, 
> RecordReader will throw EOFException to indicate the stream is corrupted. See 
> the following stack trace as an example:
> {code}
> Caused by: java.io.EOFException: Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>   at java.io.InputStream.read(InputStream.java:101)
>   at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:134)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Then HadoopRDD doesn't fail the job when files are corrupted (e.g., corrupted 
> gzip files).
> Note: NewHadoopRDD doesn't have this issue.
> This is reported by Bilal Aslam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17850) HadoopRDD should not swallow EOFException

2016-10-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17850:


Assignee: Apache Spark

> HadoopRDD should not swallow EOFException
> -
>
> Key: SPARK-17850
> URL: https://issues.apache.org/jira/browse/SPARK-17850
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>  Labels: correctness
>
> The code in 
> https://github.com/apache/spark/blob/2bcd5d5ce3eaf0eb1600a12a2b55ddb40927533b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L256
>  catches EOFException and mark RecordReader finished. However, in some cases, 
> RecordReader will throw EOFException to indicate the stream is corrupted. See 
> the following stack trace as an example:
> {code}
> Caused by: java.io.EOFException: Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>   at java.io.InputStream.read(InputStream.java:101)
>   at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:134)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Then HadoopRDD doesn't fail the job when files are corrupted (e.g., corrupted 
> gzip files).
> Note: NewHadoopRDD doesn't have this issue.
> This is reported by Bilal Aslam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17850) HadoopRDD should not swallow EOFException

2016-10-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563774#comment-15563774
 ] 

Apache Spark commented on SPARK-17850:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15422

> HadoopRDD should not swallow EOFException
> -
>
> Key: SPARK-17850
> URL: https://issues.apache.org/jira/browse/SPARK-17850
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.1
>Reporter: Shixiong Zhu
>  Labels: correctness
>
> The code in 
> https://github.com/apache/spark/blob/2bcd5d5ce3eaf0eb1600a12a2b55ddb40927533b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L256
>  catches EOFException and mark RecordReader finished. However, in some cases, 
> RecordReader will throw EOFException to indicate the stream is corrupted. See 
> the following stack trace as an example:
> {code}
> Caused by: java.io.EOFException: Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>   at java.io.InputStream.read(InputStream.java:101)
>   at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:134)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Then HadoopRDD doesn't fail the job when files are corrupted (e.g., corrupted 
> gzip files).
> Note: NewHadoopRDD doesn't have this issue.
> This is reported by Bilal Aslam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 232 matches

Mail list logo