[jira] [Commented] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS
[ https://issues.apache.org/jira/browse/SPARK-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564586#comment-15564586 ] Herman van Hovell commented on SPARK-17868: --- [~jiangxb] can you work on this one? > Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS > > > Key: SPARK-17868 > URL: https://issues.apache.org/jira/browse/SPARK-17868 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > We generate bitmasks for grouping sets during the parsing process, and use > these during analysis. These bitmasks are difficult to work with in practice > and have lead to numerous bugs. I suggest that we remove these and use actual > sets instead, however we would need to generate these offsets for the > grouping_id. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS
[ https://issues.apache.org/jira/browse/SPARK-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-17868: -- Assignee: (was: Herman van Hovell) > Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS > > > Key: SPARK-17868 > URL: https://issues.apache.org/jira/browse/SPARK-17868 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell > > We generate bitmasks for grouping sets during the parsing process, and use > these during analysis. These bitmasks are difficult to work with in practice > and have lead to numerous bugs. I suggest that we remove these and use actual > sets instead, however we would need to generate these offsets for the > grouping_id. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS
Herman van Hovell created SPARK-17868: - Summary: Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS Key: SPARK-17868 URL: https://issues.apache.org/jira/browse/SPARK-17868 Project: Spark Issue Type: Improvement Components: SQL Reporter: Herman van Hovell Assignee: Herman van Hovell We generate bitmasks for grouping sets during the parsing process, and use these during analysis. These bitmasks are difficult to work with in practice and have lead to numerous bugs. I suggest that we remove these and use actual sets instead, however we would need to generate these offsets for the grouping_id. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9560) Add LDA data generator
[ https://issues.apache.org/jira/browse/SPARK-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9560. -- Resolution: Won't Fix > Add LDA data generator > -- > > Key: SPARK-9560 > URL: https://issues.apache.org/jira/browse/SPARK-9560 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang > > Add data generator for LDA. > Hope it can help with performance improvement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15957) RFormula supports forcing to index label
[ https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-15957. - Resolution: Fixed Fix Version/s: 2.1.0 > RFormula supports forcing to index label > > > Key: SPARK-15957 > URL: https://issues.apache.org/jira/browse/SPARK-15957 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 2.1.0 > > > RFormula will index label only when it is string type currently. If the label > is numeric type and we use RFormula to present a classification model, there > is no label attributes in label column metadata. The label attributes are > useful when making prediction for classification, so we can force to index > label by {{StringIndexer}} whether it is numeric or string type for > classification. Then SparkR wrappers can extract label attributes from label > column metadata successfully. This feature can help us to fix bug similar > with SPARK-15153. > For regression, we will still to keep label as numeric type. > In this PR, we add a param indexLabel to control whether to force to index > label for RFormula. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564578#comment-15564578 ] Sean Owen commented on SPARK-15343: --- [~jdesmet] I'm not clear what you're advocating _in Spark_. See the discussion above. You're running into a problem with the YARN classpath. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) >
[jira] [Commented] (SPARK-17865) R API for global temp view
[ https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564577#comment-15564577 ] Wenchen Fan commented on SPARK-17865: - All global temp views should be gone when the SparkContext is stopped. > R API for global temp view > -- > > Key: SPARK-17865 > URL: https://issues.apache.org/jira/browse/SPARK-17865 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We need to add the R API for managing global temp views, mirroring the > changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17865) R API for global temp view
[ https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564558#comment-15564558 ] Felix Cheung commented on SPARK-17865: -- I haven't kept up on SharedState before this but it looks to be tied to the SparkContext. In R, SparkSession is the entrypoint and the SparkContext underneath comes and goes with the session - is there anything global when the SparkContext is stopped? [~cloud_fan] > R API for global temp view > -- > > Key: SPARK-17865 > URL: https://issues.apache.org/jira/browse/SPARK-17865 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We need to add the R API for managing global temp views, mirroring the > changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
Liang-Chi Hsieh created SPARK-17867: --- Summary: Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name Key: SPARK-17867 URL: https://issues.apache.org/jira/browse/SPARK-17867 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh We find and get the first resolved attribute from output with the given column name in Dataset.dropDuplicates. When we have the more than one columns with the same name. Other columns are put into aggregation columns, instead of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17844) DataFrame API should simplify defining frame boundaries without partitioning/ordering
[ https://issues.apache.org/jira/browse/SPARK-17844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17844. --- Resolution: Fixed Fix Version/s: 2.1.0 Resolved per Reynold's PR. > DataFrame API should simplify defining frame boundaries without > partitioning/ordering > - > > Key: SPARK-17844 > URL: https://issues.apache.org/jira/browse/SPARK-17844 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.0 > > > When I was creating the example code for SPARK-10496, I realized it was > pretty convoluted to define the frame boundaries for window functions when > there is no partition column or ordering column. The reason is that we don't > provide a way to create a WindowSpec directly with the frame boundaries. We > can trivially improve this by adding rowsBetween and rangeBetween to Window > object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17866) Dataset.dropDuplicates (i.e., distinct) should not change the output of child plan
Liang-Chi Hsieh created SPARK-17866: --- Summary: Dataset.dropDuplicates (i.e., distinct) should not change the output of child plan Key: SPARK-17866 URL: https://issues.apache.org/jira/browse/SPARK-17866 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh We create new Alias with new exprId in Dataset.dropDuplicates now. However it causes problem when we want to select the columns as follows: {code} val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS() // ds("_2") will cause analysis exception ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int]) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17865) R API for global temp view
[ https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564534#comment-15564534 ] Felix Cheung commented on SPARK-17865: -- I can take this. > R API for global temp view > -- > > Key: SPARK-17865 > URL: https://issues.apache.org/jira/browse/SPARK-17865 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We need to add the R API for managing global temp views, mirroring the > changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17719) Unify and tie up options in a single place in JDBC datasource API
[ https://issues.apache.org/jira/browse/SPARK-17719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-17719: Assignee: Hyukjin Kwon > Unify and tie up options in a single place in JDBC datasource API > - > > Key: SPARK-17719 > URL: https://issues.apache.org/jira/browse/SPARK-17719 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > - JDBC options are arbitrarily located as {{Map[String, String}}, > {{Properties}} and {{JDBCOptions}}. It'd be great if we can unify the usages. > - Also, there are several options not placed in {{JDBCOptions}} which are, > {{batchsize}}, {{fetchszie}} and {{isolationlevel}}. It'd be good if we put > them together. > - {{batchSize}} and {{isolationLevel}} are undocumented. > - We could verify and check the argument before actually executing. For > example, it seems {{fetchsize}} is being checked after execution and it seems > throwing an exception wrapped by {{SparkException}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17776) Potentially duplicated names which might have conflicts between JDBC options and properties instance
[ https://issues.apache.org/jira/browse/SPARK-17776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-17776: Assignee: Hyukjin Kwon > Potentially duplicated names which might have conflicts between JDBC options > and properties instance > > > Key: SPARK-17776 > URL: https://issues.apache.org/jira/browse/SPARK-17776 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > This was discussed here - > https://github.com/apache/spark/pull/15292#discussion_r81128083 > Currently, `read.format("jdbc").option(...)`, > `write.format("jdbc").option(...)` and `write.jdbc(...)` write all the > options into `Properties` instance which might have some conflicts. > We should not pass those Spark-only options into JDBC drivers but just handle > within Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17776) Potentially duplicated names which might have conflicts between JDBC options and properties instance
[ https://issues.apache.org/jira/browse/SPARK-17776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-17776. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15292 [https://github.com/apache/spark/pull/15292] > Potentially duplicated names which might have conflicts between JDBC options > and properties instance > > > Key: SPARK-17776 > URL: https://issues.apache.org/jira/browse/SPARK-17776 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > This was discussed here - > https://github.com/apache/spark/pull/15292#discussion_r81128083 > Currently, `read.format("jdbc").option(...)`, > `write.format("jdbc").option(...)` and `write.jdbc(...)` write all the > options into `Properties` instance which might have some conflicts. > We should not pass those Spark-only options into JDBC drivers but just handle > within Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17719) Unify and tie up options in a single place in JDBC datasource API
[ https://issues.apache.org/jira/browse/SPARK-17719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-17719. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15292 [https://github.com/apache/spark/pull/15292] > Unify and tie up options in a single place in JDBC datasource API > - > > Key: SPARK-17719 > URL: https://issues.apache.org/jira/browse/SPARK-17719 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > - JDBC options are arbitrarily located as {{Map[String, String}}, > {{Properties}} and {{JDBCOptions}}. It'd be great if we can unify the usages. > - Also, there are several options not placed in {{JDBCOptions}} which are, > {{batchsize}}, {{fetchszie}} and {{isolationlevel}}. It'd be good if we put > them together. > - {{batchSize}} and {{isolationLevel}} are undocumented. > - We could verify and check the argument before actually executing. For > example, it seems {{fetchsize}} is being checked after execution and it seems > throwing an exception wrapped by {{SparkException}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17865) R API for global temp view
[ https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564495#comment-15564495 ] Reynold Xin commented on SPARK-17865: - [~jiangxb1987] never mind -- there is already the Python API. > R API for global temp view > -- > > Key: SPARK-17865 > URL: https://issues.apache.org/jira/browse/SPARK-17865 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We need to add the R API for managing global temp views, mirroring the > changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17865) Python API for global temp view
[ https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564492#comment-15564492 ] Reynold Xin commented on SPARK-17865: - Ah OK - can you then make sure the Python API is updated together with your follow-up PR? Yes, let me change this to R. > Python API for global temp view > --- > > Key: SPARK-17865 > URL: https://issues.apache.org/jira/browse/SPARK-17865 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We need to add the Python API for managing global temp views, mirroring the > changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17865) R API for global temp view
[ https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564494#comment-15564494 ] Reynold Xin commented on SPARK-17865: - cc [~felixcheung] [~yanboliang] know anybody to take on this work? > R API for global temp view > -- > > Key: SPARK-17865 > URL: https://issues.apache.org/jira/browse/SPARK-17865 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We need to add the R API for managing global temp views, mirroring the > changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17865) R API for global temp view
[ https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17865: Description: We need to add the R API for managing global temp views, mirroring the changes in SPARK-17338. was: We need to add the Python API for managing global temp views, mirroring the changes in SPARK-17338. > R API for global temp view > -- > > Key: SPARK-17865 > URL: https://issues.apache.org/jira/browse/SPARK-17865 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We need to add the R API for managing global temp views, mirroring the > changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17865) R API for global temp view
[ https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17865: Summary: R API for global temp view (was: Python API for global temp view) > R API for global temp view > -- > > Key: SPARK-17865 > URL: https://issues.apache.org/jira/browse/SPARK-17865 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We need to add the Python API for managing global temp views, mirroring the > changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17865) Python API for global temp view
[ https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564489#comment-15564489 ] Wenchen Fan commented on SPARK-17865: - python API is already added, but R API hasn't. Should we update this JIRA to add R API? > Python API for global temp view > --- > > Key: SPARK-17865 > URL: https://issues.apache.org/jira/browse/SPARK-17865 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We need to add the Python API for managing global temp views, mirroring the > changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17865) Python API for global temp view
Reynold Xin created SPARK-17865: --- Summary: Python API for global temp view Key: SPARK-17865 URL: https://issues.apache.org/jira/browse/SPARK-17865 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin We need to add the Python API for managing global temp views, mirroring the changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17865) Python API for global temp view
[ https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564484#comment-15564484 ] Reynold Xin commented on SPARK-17865: - cc [~cloud_fan] [~jiangxb1987] would you be interested in doing this? > Python API for global temp view > --- > > Key: SPARK-17865 > URL: https://issues.apache.org/jira/browse/SPARK-17865 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > We need to add the Python API for managing global temp views, mirroring the > changes in SPARK-17338. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17338) Add global temp view support
[ https://issues.apache.org/jira/browse/SPARK-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17338: Summary: Add global temp view support (was: add global temp view support) > Add global temp view support > > > Key: SPARK-17338 > URL: https://issues.apache.org/jira/browse/SPARK-17338 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17338) add global temp view support
[ https://issues.apache.org/jira/browse/SPARK-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17338: Summary: add global temp view support (was: add global temp view) > add global temp view support > > > Key: SPARK-17338 > URL: https://issues.apache.org/jira/browse/SPARK-17338 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17338) Add global temp view support
[ https://issues.apache.org/jira/browse/SPARK-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17338: Description: Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database global_temp(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1. > Add global temp view support > > > Key: SPARK-17338 > URL: https://issues.apache.org/jira/browse/SPARK-17338 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > > Global temporary view is a cross-session temporary view, which means it's > shared among all sessions. Its lifetime is the lifetime of the Spark > application, i.e. it will be automatically dropped when the application > terminates. It's tied to a system preserved database global_temp(configurable > via SparkConf), and we must use the qualified name to refer a global temp > view, e.g. SELECT * FROM global_temp.view1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17858) Provide option for Spark SQL to skip corrupt files
[ https://issues.apache.org/jira/browse/SPARK-17858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564479#comment-15564479 ] Sean Owen commented on SPARK-17858: --- Yeah, the related JIRA gives an argument that we shouldn't do this. You end up more easily silently ignoring data if it doesn't fail the query. I'm not that sure this is a good idea. > Provide option for Spark SQL to skip corrupt files > -- > > Key: SPARK-17858 > URL: https://issues.apache.org/jira/browse/SPARK-17858 > Project: Spark > Issue Type: Improvement >Reporter: Shixiong Zhu > > In Spark 2.0, corrupt files will fail a SQL query. However, the user may just > want to skip corrupt files and still run the query. > Another painful thing is the current exception doesn't contain the paths of > corrupt files, makes the user hard to fix their files. > Note: In Spark 1.6, Spark SQL always skip corrupt files because of > SPARK-17850. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3577) Add task metric to report spill time
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564472#comment-15564472 ] Reynold Xin commented on SPARK-3577: [~dreamworks007] can you take a look at the problem here? https://github.com/apache/spark/pull/15347 > Add task metric to report spill time > > > Key: SPARK-3577 > URL: https://issues.apache.org/jira/browse/SPARK-3577 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.1.0 >Reporter: Kay Ousterhout >Priority: Minor > Attachments: spill_size.jpg > > > The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into > {{ExternalSorter}}. The write time recorded in those metrics is never used. > We should probably add task metrics to report this spill time, since for > shuffles, this would have previously been reported as part of shuffle write > time (with the original hash-based sorter). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17864) Mark data type APIs as stable, rather than DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-17864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564464#comment-15564464 ] Apache Spark commented on SPARK-17864: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/15426 > Mark data type APIs as stable, rather than DeveloperApi > --- > > Key: SPARK-17864 > URL: https://issues.apache.org/jira/browse/SPARK-17864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: releasenotes > > The data type API has not been changed since Spark 1.3.0, and is ready for > graduation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17864) Mark data type APIs as stable, rather than DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-17864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17864: Assignee: Apache Spark (was: Reynold Xin) > Mark data type APIs as stable, rather than DeveloperApi > --- > > Key: SPARK-17864 > URL: https://issues.apache.org/jira/browse/SPARK-17864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > Labels: releasenotes > > The data type API has not been changed since Spark 1.3.0, and is ready for > graduation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17864) Mark data type APIs as stable, rather than DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-17864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17864: Assignee: Reynold Xin (was: Apache Spark) > Mark data type APIs as stable, rather than DeveloperApi > --- > > Key: SPARK-17864 > URL: https://issues.apache.org/jira/browse/SPARK-17864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: releasenotes > > The data type API has not been changed since Spark 1.3.0, and is ready for > graduation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17799) InterfaceStability annotation
[ https://issues.apache.org/jira/browse/SPARK-17799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17799: Labels: releasenotes (was: ) > InterfaceStability annotation > - > > Key: SPARK-17799 > URL: https://issues.apache.org/jira/browse/SPARK-17799 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: releasenotes > > Based on discussions on the dev list > (http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-separate-API-annotation-into-two-components-InterfaceAudience-amp-InterfaceStability-td17470.html#none), > there are consensus to introduce an InterfaceStability annotation to > eventually replace the current DeveloperApi / Experimental annotation. > This is an umbrella ticket to track its progress. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17864) Mark data type APIs as stable, rather than DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-17864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17864: Labels: releasenotes (was: ) > Mark data type APIs as stable, rather than DeveloperApi > --- > > Key: SPARK-17864 > URL: https://issues.apache.org/jira/browse/SPARK-17864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: releasenotes > > The data type API has not been changed since Spark 1.3.0, and is ready for > graduation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17864) Mark data type APIs as stable, rather than DeveloperApi
Reynold Xin created SPARK-17864: --- Summary: Mark data type APIs as stable, rather than DeveloperApi Key: SPARK-17864 URL: https://issues.apache.org/jira/browse/SPARK-17864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin The data type API has not been changed since Spark 1.3.0, and is ready for graduation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3577) Add task metric to report spill time
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563038#comment-15563038 ] Gaoxiang Liu edited comment on SPARK-3577 at 10/11/16 4:49 AM: --- [~rxin] I find that the spill size metrics is already added in https://github.com/apache/spark/commit/bb8098f203e6faddf2e1a04b03d62037e6c7#diff-1bd3dc38f6306e0a822f93d62c32b1d0, and I have confirm in the UI. (please refer to the attachment of this JIRA - https://issues.apache.org/jira/secure/attachment/12832515/spill_size.jpg) Also, we noticed that it's wield that the spill size is somehow not reported in the reducer , but reported in the mapper. Back to the previous question, for the spill time, if it's still relevant to add, then I plan to work on it if there is no objections. was (Author: dreamworks007): I find that the spill size metrics is already added in https://github.com/apache/spark/commit/bb8098f203e6faddf2e1a04b03d62037e6c7#diff-1bd3dc38f6306e0a822f93d62c32b1d0, and I have confirm in the UI. (please refer to the attachment of this JIRA - https://issues.apache.org/jira/secure/attachment/12832515/spill_size.jpg) Also, we noticed that it's wield that the spill size is somehow not reported in the reducer , but reported in the mapper. Back to the previous question, for the spill time, if it's still relevant to add, then I plan to work on it if there is no objections. > Add task metric to report spill time > > > Key: SPARK-3577 > URL: https://issues.apache.org/jira/browse/SPARK-3577 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.1.0 >Reporter: Kay Ousterhout >Priority: Minor > Attachments: spill_size.jpg > > > The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into > {{ExternalSorter}}. The write time recorded in those metrics is never used. > We should probably add task metrics to report this spill time, since for > shuffles, this would have previously been reported as part of shuffle write > time (with the original hash-based sorter). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17816) Json serialzation of accumulators are failing with ConcurrentModificationException
[ https://issues.apache.org/jira/browse/SPARK-17816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564410#comment-15564410 ] Apache Spark commented on SPARK-17816: -- User 'seyfe' has created a pull request for this issue: https://github.com/apache/spark/pull/15425 > Json serialzation of accumulators are failing with > ConcurrentModificationException > -- > > Key: SPARK-17816 > URL: https://issues.apache.org/jira/browse/SPARK-17816 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Ergin Seyfe >Assignee: Ergin Seyfe > Fix For: 2.1.0 > > > This is the stack trace: See {{ConcurrentModificationException}}: > {code} > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183) > at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45) > at scala.collection.TraversableLike$class.to(TraversableLike.scala:590) > at scala.collection.AbstractTraversable.to(Traversable.scala:104) > at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294) > at scala.collection.AbstractTraversable.toList(Traversable.scala:104) > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283) > at > org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) > at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283) > at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:137) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:157) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) > at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1244) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Closed] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-4630. -- Resolution: Duplicate Assignee: (was: Kostas Sakellis) > Dynamically determine optimal number of partitions > -- > > Key: SPARK-4630 > URL: https://issues.apache.org/jira/browse/SPARK-4630 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Kostas Sakellis > > Partition sizes play a big part in how fast stages execute during a Spark > job. There is a direct relationship between the size of partitions to the > number of tasks - larger partitions, fewer tasks. For better performance, > Spark has a sweet spot for how large partitions should be that get executed > by a task. If partitions are too small, then the user pays a disproportionate > cost in scheduling overhead. If the partitions are too large, then task > execution slows down due to gc pressure and spilling to disk. > To increase performance of jobs, users often hand optimize the number(size) > of partitions that the next stage gets. Factors that come into play are: > Incoming partition sizes from previous stage > number of available executors > available memory per executor (taking into account > spark.shuffle.memoryFraction) > Spark has access to this data and so should be able to automatically do the > partition sizing for the user. This feature can be turned off/on with a > configuration option. > To make this happen, we propose modifying the DAGScheduler to take into > account partition sizes upon stage completion. Before scheduling the next > stage, the scheduler can examine the sizes of the partitions and determine > the appropriate number tasks to create. Since this change requires > non-trivial modifications to the DAGScheduler, a detailed design doc will be > attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce
[ https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564350#comment-15564350 ] Takeshi Yamamuro commented on SPARK-14393: -- Since coalesce() just after monotonicallyIncreasingId() breaks the semantics, it's okay not to do that. For example, coalesce() before monotonicallyIncreasingId() is okay; {code} scala> spark.range(10).repartition(5).coalesce(1).select(monotonicallyIncreasingId()).show warning: there was one deprecation warning; re-run with -deprecation for details +-+ |monotonically_increasing_id()| +-+ |0| |1| |2| |3| |4| |5| |6| |7| |8| |9| +-+ {code} > monotonicallyIncreasingId not monotonically increasing with downstream > coalesce > --- > > Key: SPARK-14393 > URL: https://issues.apache.org/jira/browse/SPARK-14393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jason Piper > > When utilising monotonicallyIncreasingId with a coalesce, it appears that > every partition uses the same offset (0) leading to non-monotonically > increasing IDs. > See examples below > {code} > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show() > +---+ > |monotonicallyincreasingid()| > +---+ > |25769803776| > |51539607552| > |77309411328| > | 103079215104| > | 128849018880| > | 163208757248| > | 188978561024| > | 214748364800| > | 240518168576| > | 266287972352| > +---+ > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > +---+ > >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 1| > | 0| > | 0| > | 1| > | 2| > | 3| > | 0| > | 1| > | 2| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17816) Json serialzation of accumulators are failing with ConcurrentModificationException
[ https://issues.apache.org/jira/browse/SPARK-17816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-17816. -- Resolution: Fixed Assignee: Ergin Seyfe Fix Version/s: 2.1.0 > Json serialzation of accumulators are failing with > ConcurrentModificationException > -- > > Key: SPARK-17816 > URL: https://issues.apache.org/jira/browse/SPARK-17816 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Ergin Seyfe >Assignee: Ergin Seyfe > Fix For: 2.1.0 > > > This is the stack trace: See {{ConcurrentModificationException}}: > {code} > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183) > at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45) > at scala.collection.TraversableLike$class.to(TraversableLike.scala:590) > at scala.collection.AbstractTraversable.to(Traversable.scala:104) > at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294) > at scala.collection.AbstractTraversable.toList(Traversable.scala:104) > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283) > at > org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) > at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283) > at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:137) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:157) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) > at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1244) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15814) Aggregator can return null result
[ https://issues.apache.org/jira/browse/SPARK-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-15814: - Fix Version/s: 2.0.1 2.1.0 > Aggregator can return null result > - > > Key: SPARK-15814 > URL: https://issues.apache.org/jira/browse/SPARK-15814 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15577) Java can't import DataFrame type alias
[ https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-15577. -- Resolution: Not A Problem > Java can't import DataFrame type alias > -- > > Key: SPARK-15577 > URL: https://issues.apache.org/jira/browse/SPARK-15577 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.0.0 >Reporter: holdenk > > After SPARK-13244, all Java code needs to be updated to use Dataset > instead of DataFrame as we used a type alias. Should we consider adding a > DataFrame to the Java API which just extends Dataset for compatibility? > cc [~liancheng] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15577) Java can't import DataFrame type alias
[ https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564316#comment-15564316 ] Hyukjin Kwon commented on SPARK-15577: -- Let me please close this as a not-a-problem. Please revoke my change if anyone feels it is not right. > Java can't import DataFrame type alias > -- > > Key: SPARK-15577 > URL: https://issues.apache.org/jira/browse/SPARK-15577 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.0.0 >Reporter: holdenk > > After SPARK-13244, all Java code needs to be updated to use Dataset > instead of DataFrame as we used a type alias. Should we consider adding a > DataFrame to the Java API which just extends Dataset for compatibility? > cc [~liancheng] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15577) Java can't import DataFrame type alias
[ https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564302#comment-15564302 ] Jakob Odersky commented on SPARK-15577: --- this cleaning of jiras is really good to see :) Considering that spark 2.0 has already shipped with the type alias, I think it is safe to close this ticket. We can always reopen it if necessary. > Java can't import DataFrame type alias > -- > > Key: SPARK-15577 > URL: https://issues.apache.org/jira/browse/SPARK-15577 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.0.0 >Reporter: holdenk > > After SPARK-13244, all Java code needs to be updated to use Dataset > instead of DataFrame as we used a type alias. Should we consider adding a > DataFrame to the Java API which just extends Dataset for compatibility? > cc [~liancheng] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9265) Dataframe.limit joined with another dataframe can be non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558875#comment-15558875 ] Xiao Li edited comment on SPARK-9265 at 10/11/16 3:14 AM: -- This has been resolved since Limit and Sort are executed as a `TakeOrderedAndProject` operator. Close it now. Thanks! was (Author: smilegator): This has been resolved since our Optimizer push down `Limit` below `Sort`. Close it now. Thanks! > Dataframe.limit joined with another dataframe can be non-deterministic > -- > > Key: SPARK-9265 > URL: https://issues.apache.org/jira/browse/SPARK-9265 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Tathagata Das >Priority: Critical > > {code} > import org.apache.spark.sql._ > import org.apache.spark.sql.functions._ > val recentFailures = table("failed_suites").cache() > val topRecentFailures = > recentFailures.groupBy('suiteName).agg(count("*").as('failCount)).orderBy('failCount.desc).limit(10) > topRecentFailures.show(100) > val mot = topRecentFailures.as("a").join(recentFailures.as("b"), > $"a.suiteName" === $"b.suiteName") > > (1 to 10).foreach { i => > println(s"$i: " + mot.count()) > } > {code} > This shows. > {code} > ++-+ > | suiteName|failCount| > ++-+ > |org.apache.spark| 85| > |org.apache.spark| 26| > |org.apache.spark| 26| > |org.apache.spark| 17| > |org.apache.spark| 17| > |org.apache.spark| 15| > |org.apache.spark| 13| > |org.apache.spark| 13| > |org.apache.spark| 11| > |org.apache.spark|9| > ++-+ > 1: 174 > 2: 166 > 3: 174 > 4: 106 > 5: 158 > 6: 110 > 7: 174 > 8: 158 > 9: 166 > 10: 106 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16896) Loading csv with duplicate column names
[ https://issues.apache.org/jira/browse/SPARK-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16896. - Resolution: Fixed Fix Version/s: 2.1.0 > Loading csv with duplicate column names > --- > > Key: SPARK-16896 > URL: https://issues.apache.org/jira/browse/SPARK-16896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Aseem Bansal >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > It would be great if the library allows us to load csv with duplicate column > names. I understand that having duplicate columns in the data is odd but > sometimes we get data that has duplicate columns. Getting upstream data like > that can happen. We may choose to ignore them but currently there is no way > to drop those as we are not able to load them at all. Currently as a > pre-processing I loaded the data into R, changed the column names and then > make a fixed version with which Spark Java API can work. > But if talk about other options, e.g. R has read.csv which automatically > takes care of such situation by appending a number to the column name. > Also case sensitivity in column names can also cause problems. I mean if we > have columns like > ColumnName, columnName > I may want to have them as separate. But the option to do this is not > documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16896) Loading csv with duplicate column names
[ https://issues.apache.org/jira/browse/SPARK-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-16896: Assignee: Hyukjin Kwon > Loading csv with duplicate column names > --- > > Key: SPARK-16896 > URL: https://issues.apache.org/jira/browse/SPARK-16896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Aseem Bansal >Assignee: Hyukjin Kwon > > It would be great if the library allows us to load csv with duplicate column > names. I understand that having duplicate columns in the data is odd but > sometimes we get data that has duplicate columns. Getting upstream data like > that can happen. We may choose to ignore them but currently there is no way > to drop those as we are not able to load them at all. Currently as a > pre-processing I loaded the data into R, changed the column names and then > make a fixed version with which Spark Java API can work. > But if talk about other options, e.g. R has read.csv which automatically > takes care of such situation by appending a number to the column name. > Also case sensitivity in column names can also cause problems. I mean if we > have columns like > ColumnName, columnName > I may want to have them as separate. But the option to do this is not > documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract
[ https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-17738: - Component/s: Tests > Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP > append/extract > -- > > Key: SPARK-17738 > URL: https://issues.apache.org/jira/browse/SPARK-17738 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Davies Liu >Assignee: Davies Liu > Labels: flaky-test > Fix For: 2.0.2, 2.1.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract
[ https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-17738: - Labels: flaky-test (was: ) > Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP > append/extract > -- > > Key: SPARK-17738 > URL: https://issues.apache.org/jira/browse/SPARK-17738 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Davies Liu >Assignee: Davies Liu > Labels: flaky-test > Fix For: 2.0.2, 2.1.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract
[ https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-17738: - Issue Type: Test (was: Bug) > Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP > append/extract > -- > > Key: SPARK-17738 > URL: https://issues.apache.org/jira/browse/SPARK-17738 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Davies Liu >Assignee: Davies Liu > Labels: flaky-test > Fix For: 2.0.2, 2.1.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract
[ https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-17738. -- Resolution: Fixed Fix Version/s: 2.0.2 > Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP > append/extract > -- > > Key: SPARK-17738 > URL: https://issues.apache.org/jira/browse/SPARK-17738 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Davies Liu >Assignee: Davies Liu > Labels: flaky-test > Fix For: 2.0.2, 2.1.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17772) Add helper testing methods for instance weighting
[ https://issues.apache.org/jira/browse/SPARK-17772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564174#comment-15564174 ] Seth Hendrickson commented on SPARK-17772: -- I'm working on this. > Add helper testing methods for instance weighting > - > > Key: SPARK-17772 > URL: https://issues.apache.org/jira/browse/SPARK-17772 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > More and more ML algos are accepting instance weights. We keep replicating > code to test instance weighting in every test suite, which will get out of > hand rather quickly. We can and should implement some generic instance weight > test helper methods so that we can reduce duplicated code and standardize > these tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics
[ https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564169#comment-15564169 ] Ioana Delaney commented on SPARK-17626: --- [~ron8hu] Thank you for your comments. Our current star schema detection uses simple, basic heuristics to identify the star join with the largest fact table and places it on the driving arm of the join. Even with this simple, intuitive join reordering, the performance results show very good improvement. The next step would be to improve the schema detection logic based on cardinality heuristics and then with more reliable informational RI constraints. When CBO implements the planning rules, the two algorithms can be integrated. Regarding predicate selectivity hint, it can also be used to diagnose query planning/performance. By changing predicate selectivity, we can influence the choice of a plan and thus test various planning options. > TPC-DS performance improvements using star-schema heuristics > > > Key: SPARK-17626 > URL: https://issues.apache.org/jira/browse/SPARK-17626 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0 >Reporter: Ioana Delaney >Priority: Critical > Attachments: StarSchemaJoinReordering.pptx > > > *TPC-DS performance improvements using star-schema heuristics* > \\ > \\ > TPC-DS consists of multiple snowflake schema, which are multiple star schema > with dimensions linking to dimensions. A star schema consists of a fact table > referencing a number of dimension tables. Fact table holds the main data > about a business. Dimension table, a usually smaller table, describes data > reflecting the dimension/attribute of a business. > \\ > \\ > As part of the benchmark performance investigation, we observed a pattern of > sub-optimal execution plans of large fact tables joins. Manual rewrite of > some of the queries into selective fact-dimensions joins resulted in > significant performance improvement. This prompted us to develop a simple > join reordering algorithm based on star schema detection. The performance > testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. > \\ > \\ > *Summary of the results:* > {code} > Passed 99 > Failed 0 > Total q time (s) 14,962 > Max time1,467 > Min time3 > Mean time 145 > Geomean44 > {code} > *Compared to baseline* (Negative = improvement; Positive = Degradation): > {code} > End to end improved (%) -19% > Mean time improved (%) -19% > Geomean improved (%) -24% > End to end improved (seconds) -3,603 > Number of queries improved (>10%) 45 > Number of queries degraded (>10%) 6 > Number of queries unchanged48 > Top 10 queries improved (%) -20% > {code} > Cluster: 20-node cluster with each node having: > * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 > v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. > * Total memory for the cluster: 2.5TB > * Total storage: 400TB > * Total CPU cores: 480 > Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA > Database info: > * Schema: TPCDS > * Scale factor: 1TB total space > * Storage format: Parquet with Snappy compression > Our investigation and results are included in the attached document. > There are two parts to this improvement: > # Join reordering using star schema detection > # New selectivity hint to specify the selectivity of the predicates over base > tables. Selectivity hint is optional and it was not used in the above TPC-DS > tests. > \\ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17863) SELECT distinct does not work if there is a order by clause
Yin Huai created SPARK-17863: Summary: SELECT distinct does not work if there is a order by clause Key: SPARK-17863 URL: https://issues.apache.org/jira/browse/SPARK-17863 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Critical {code} select distinct struct.a, struct.b from ( select named_struct('a', 1, 'b', 2, 'c', 3) as struct union all select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp order by struct.a, struct.b {code} This query generates {code} +---+---+ | a| b| +---+---+ | 1| 2| | 1| 2| +---+---+ {code} The plan is wrong {code} == Parsed Logical Plan == 'Sort ['struct.a ASC, 'struct.b ASC], true +- 'Distinct +- 'Project ['struct.a, 'struct.b] +- 'SubqueryAlias tmp +- 'Union :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805] : +- OneRowRelation$ +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806] +- OneRowRelation$ == Analyzed Logical Plan == a: int, b: int Project [a#21819, b#21820] +- Sort [struct#21805.a ASC, struct#21805.b ASC], true +- Distinct +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, struct#21805] +- SubqueryAlias tmp +- Union :- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805] : +- OneRowRelation$ +- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806] +- OneRowRelation$ == Optimized Logical Plan == Project [a#21819, b#21820] +- Sort [struct#21805.a ASC, struct#21805.b ASC], true +- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, struct#21805] +- Union :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805] : +- OneRowRelation$ +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806] +- OneRowRelation$ == Physical Plan == *Project [a#21819, b#21820] +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0 +- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200) +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], output=[a#21819, b#21820, struct#21805]) +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200) +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], output=[a#21819, b#21820, struct#21805]) +- Union :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805] : +- Scan OneRowRelation[] +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806] +- Scan OneRowRelation[] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17863) SELECT distinct does not work if there is a order by clause
[ https://issues.apache.org/jira/browse/SPARK-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-17863: - Labels: correctness (was: ) > SELECT distinct does not work if there is a order by clause > --- > > Key: SPARK-17863 > URL: https://issues.apache.org/jira/browse/SPARK-17863 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > Labels: correctness > > {code} > select distinct struct.a, struct.b > from ( > select named_struct('a', 1, 'b', 2, 'c', 3) as struct > union all > select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp > order by struct.a, struct.b > {code} > This query generates > {code} > +---+---+ > | a| b| > +---+---+ > | 1| 2| > | 1| 2| > +---+---+ > {code} > The plan is wrong > {code} > == Parsed Logical Plan == > 'Sort ['struct.a ASC, 'struct.b ASC], true > +- 'Distinct >+- 'Project ['struct.a, 'struct.b] > +- 'SubqueryAlias tmp > +- 'Union > :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805] > : +- OneRowRelation$ > +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806] >+- OneRowRelation$ > == Analyzed Logical Plan == > a: int, b: int > Project [a#21819, b#21820] > +- Sort [struct#21805.a ASC, struct#21805.b ASC], true >+- Distinct > +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, > struct#21805] > +- SubqueryAlias tmp > +- Union >:- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805] >: +- OneRowRelation$ >+- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806] > +- OneRowRelation$ > == Optimized Logical Plan == > Project [a#21819, b#21820] > +- Sort [struct#21805.a ASC, struct#21805.b ASC], true >+- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, > struct#21805] > +- Union > :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805] > : +- OneRowRelation$ > +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806] > +- OneRowRelation$ > == Physical Plan == > *Project [a#21819, b#21820] > +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0 >+- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200) > +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], > output=[a#21819, b#21820, struct#21805]) > +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200) > +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], > functions=[], output=[a#21819, b#21820, struct#21805]) >+- Union > :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS > struct#21805] > : +- Scan OneRowRelation[] > +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS > struct#21806] > +- Scan OneRowRelation[] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17338) add global temp view
[ https://issues.apache.org/jira/browse/SPARK-17338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564138#comment-15564138 ] Apache Spark commented on SPARK-17338: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/15424 > add global temp view > > > Key: SPARK-17338 > URL: https://issues.apache.org/jira/browse/SPARK-17338 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15577) Java can't import DataFrame type alias
[ https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564115#comment-15564115 ] Hyukjin Kwon commented on SPARK-15577: -- WDYT - [~holdenk] > Java can't import DataFrame type alias > -- > > Key: SPARK-15577 > URL: https://issues.apache.org/jira/browse/SPARK-15577 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.0.0 >Reporter: holdenk > > After SPARK-13244, all Java code needs to be updated to use Dataset > instead of DataFrame as we used a type alias. Should we consider adding a > DataFrame to the Java API which just extends Dataset for compatibility? > cc [~liancheng] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15577) Java can't import DataFrame type alias
[ https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564115#comment-15564115 ] Hyukjin Kwon edited comment on SPARK-15577 at 10/11/16 1:32 AM: WDYT? - [~holdenk] was (Author: hyukjin.kwon): WDYT - [~holdenk] > Java can't import DataFrame type alias > -- > > Key: SPARK-15577 > URL: https://issues.apache.org/jira/browse/SPARK-15577 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.0.0 >Reporter: holdenk > > After SPARK-13244, all Java code needs to be updated to use Dataset > instead of DataFrame as we used a type alias. Should we consider adding a > DataFrame to the Java API which just extends Dataset for compatibility? > cc [~liancheng] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564097#comment-15564097 ] Weichen Xu edited comment on SPARK-17139 at 10/11/16 1:25 AM: -- I'm working on it hard and will create PR this week, thanks! was (Author: weichenxu123): I'm working on it hardly and will create PR this week, thanks! > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Add model summary to multinomial logistic regression using same interface as > in other ML models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15577) Java can't import DataFrame type alias
[ https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564103#comment-15564103 ] Hyukjin Kwon commented on SPARK-15577: -- Cool, then I guess we might be able to take an action on the status of this JIRA? > Java can't import DataFrame type alias > -- > > Key: SPARK-15577 > URL: https://issues.apache.org/jira/browse/SPARK-15577 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.0.0 >Reporter: holdenk > > After SPARK-13244, all Java code needs to be updated to use Dataset > instead of DataFrame as we used a type alias. Should we consider adding a > DataFrame to the Java API which just extends Dataset for compatibility? > cc [~liancheng] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564097#comment-15564097 ] Weichen Xu commented on SPARK-17139: I'm working on it hardly and will create PR this week, thanks! > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Add model summary to multinomial logistic regression using same interface as > in other ML models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564071#comment-15564071 ] holdenk commented on SPARK-4630: I also agree this would be really good to revisit, from talking with users at conferences and feedback on High Performance Spark I think this is a very real issue that a lot of users find challenging to deal with. Just my $0.0064 :p :) > Dynamically determine optimal number of partitions > -- > > Key: SPARK-4630 > URL: https://issues.apache.org/jira/browse/SPARK-4630 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Kostas Sakellis >Assignee: Kostas Sakellis > > Partition sizes play a big part in how fast stages execute during a Spark > job. There is a direct relationship between the size of partitions to the > number of tasks - larger partitions, fewer tasks. For better performance, > Spark has a sweet spot for how large partitions should be that get executed > by a task. If partitions are too small, then the user pays a disproportionate > cost in scheduling overhead. If the partitions are too large, then task > execution slows down due to gc pressure and spilling to disk. > To increase performance of jobs, users often hand optimize the number(size) > of partitions that the next stage gets. Factors that come into play are: > Incoming partition sizes from previous stage > number of available executors > available memory per executor (taking into account > spark.shuffle.memoryFraction) > Spark has access to this data and so should be able to automatically do the > partition sizing for the user. This feature can be turned off/on with a > configuration option. > To make this happen, we propose modifying the DAGScheduler to take into > account partition sizes upon stage completion. Before scheduling the next > stage, the scheduler can examine the sizes of the partitions and determine > the appropriate number tasks to create. Since this change requires > non-trivial modifications to the DAGScheduler, a detailed design doc will be > attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16980) Load only catalog table partition metadata required to answer a query
[ https://issues.apache.org/jira/browse/SPARK-16980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16980: Issue Type: Sub-task (was: Improvement) Parent: SPARK-17861 > Load only catalog table partition metadata required to answer a query > - > > Key: SPARK-16980 > URL: https://issues.apache.org/jira/browse/SPARK-16980 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Michael Allman >Assignee: Michael Allman > > Currently, when a user reads from a partitioned Hive table whose metadata are > not cached (and for which Hive table conversion is enabled and supported), > all partition metadata are fetched from the metastore: > https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260 > However, if the user's query includes partition pruning predicates then we > only need the subset of these metadata which satisfy those predicates. > This issue tracks work to modify the current query planning scheme so that > unnecessary partition metadata are not loaded. > I've prototyped two possible approaches. The first extends > {{o.a.s.s.c.catalog.ExternalCatalog}} and as such is more generally > applicable. It requires some new abstractions and refactoring of > {{HadoopFsRelation}} and {{FileCatalog}}, among others. It places a greater > burden on other implementations of {{ExternalCatalog}}. Currently the only > other implementation of {{ExternalCatalog}} is {{InMemoryCatalog}}, and my > prototype throws an {{UnsupportedOperationException}} on that implementation. > The second prototype is simpler and only touches code in the {{hive}} > project. Basically, conversion of a partitioned {{MetastoreRelation}} to > {{HadoopFsRelation}} is deferred to physical planning. During physical > planning, the partition pruning filters in the query plan are used to > identify the required partition metadata and a {{HadoopFsRelation}} is built > from those. The new query plan is then re-injected into the physical planner > and proceeds as normal for a {{HadoopFsRelation}}. > On the Spark dev mailing list, [~ekhliang] expressed a preference for the > approach I took in my first POC. (See > http://apache-spark-developers-list.1001551.n3.nabble.com/Scaling-partitioned-Hive-table-support-td18586.html) > Based on that, I'm going to open a PR with that patch as a starting point > for an architectural/design review. It will not be a complete patch ready for > integration into Spark master. Rather, I would like to get early feedback on > the implementation details so I can shape the PR before committing a large > amount of time on a finished product. I will open another PR for the second > approach for comparison if requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17862) Feature flag SPARK-16980
Reynold Xin created SPARK-17862: --- Summary: Feature flag SPARK-16980 Key: SPARK-17862 URL: https://issues.apache.org/jira/browse/SPARK-17862 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17861) Push data source partitions into metastore for catalog tables
[ https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17861: Description: Initially, Spark SQL does not store any partition information in the catalog for data source tables, because initially it was designed to work with arbitrary files. This, however, has a few issues for catalog tables: 1. Listing partitions for a large table (with millions of partitions) can be very slow during cold start. 2. Does not support heterogeneous partition naming schemes. 3. Cannot leverage pushing partition pruning into the metastore. This ticket tracks the work required to push the tracking of partitions into the metastore. This change should be feature flagged. was: Initially, Spark SQL does not store any partition information in the catalog for data source tables, because initially it was designed to work with arbitrary files. This, however, has a few issues for catalog tables: 1. Listing partitions for a large table (with millions of partitions) can be very slow during cold start. 2. Does not support heterogeneous partition naming schemes. 3. Cannot leverage pushing partition pruning into the metastore. This ticket tracks the work required to push the tracking of partitions into the metastore. > Push data source partitions into metastore for catalog tables > - > > Key: SPARK-17861 > URL: https://issues.apache.org/jira/browse/SPARK-17861 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > Initially, Spark SQL does not store any partition information in the catalog > for data source tables, because initially it was designed to work with > arbitrary files. This, however, has a few issues for catalog tables: > 1. Listing partitions for a large table (with millions of partitions) can be > very slow during cold start. > 2. Does not support heterogeneous partition naming schemes. > 3. Cannot leverage pushing partition pruning into the metastore. > This ticket tracks the work required to push the tracking of partitions into > the metastore. This change should be feature flagged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17861) Push data source partitions into metastore for catalog tables and support partition pruning
[ https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17861: Summary: Push data source partitions into metastore for catalog tables and support partition pruning (was: Push data source partitions into metastore for catalog tables) > Push data source partitions into metastore for catalog tables and support > partition pruning > --- > > Key: SPARK-17861 > URL: https://issues.apache.org/jira/browse/SPARK-17861 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > Initially, Spark SQL does not store any partition information in the catalog > for data source tables, because initially it was designed to work with > arbitrary files. This, however, has a few issues for catalog tables: > 1. Listing partitions for a large table (with millions of partitions) can be > very slow during cold start. > 2. Does not support heterogeneous partition naming schemes. > 3. Cannot leverage pushing partition pruning into the metastore. > This ticket tracks the work required to push the tracking of partitions into > the metastore. This change should be feature flagged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17861) Store data source partitions in metastore and push partition pruning into metastore
[ https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17861: Summary: Store data source partitions in metastore and push partition pruning into metastore (was: Store data source partitions in metastore and push partition pruning into the metastore) > Store data source partitions in metastore and push partition pruning into > metastore > --- > > Key: SPARK-17861 > URL: https://issues.apache.org/jira/browse/SPARK-17861 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > Initially, Spark SQL does not store any partition information in the catalog > for data source tables, because initially it was designed to work with > arbitrary files. This, however, has a few issues for catalog tables: > 1. Listing partitions for a large table (with millions of partitions) can be > very slow during cold start. > 2. Does not support heterogeneous partition naming schemes. > 3. Cannot leverage pushing partition pruning into the metastore. > This ticket tracks the work required to push the tracking of partitions into > the metastore. This change should be feature flagged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17861) Store data source partitions in metastore and push partition pruning into the metastore
[ https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17861: Summary: Store data source partitions in metastore and push partition pruning into the metastore (was: Push data source partitions into metastore for catalog tables and support partition pruning) > Store data source partitions in metastore and push partition pruning into the > metastore > --- > > Key: SPARK-17861 > URL: https://issues.apache.org/jira/browse/SPARK-17861 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > Initially, Spark SQL does not store any partition information in the catalog > for data source tables, because initially it was designed to work with > arbitrary files. This, however, has a few issues for catalog tables: > 1. Listing partitions for a large table (with millions of partitions) can be > very slow during cold start. > 2. Does not support heterogeneous partition naming schemes. > 3. Cannot leverage pushing partition pruning into the metastore. > This ticket tracks the work required to push the tracking of partitions into > the metastore. This change should be feature flagged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17861) Push data source partitions into metastore for catalog tables
[ https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564058#comment-15564058 ] Reynold Xin commented on SPARK-17861: - cc [~michael] this is the main work I want to get in for 2.1. > Push data source partitions into metastore for catalog tables > - > > Key: SPARK-17861 > URL: https://issues.apache.org/jira/browse/SPARK-17861 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > Initially, Spark SQL does not store any partition information in the catalog > for data source tables, because initially it was designed to work with > arbitrary files. This, however, has a few issues for catalog tables: > 1. Listing partitions for a large table (with millions of partitions) can be > very slow during cold start. > 2. Does not support heterogeneous partition naming schemes. > 3. Cannot leverage pushing partition pruning into the metastore. > This ticket tracks the work required to push the tracking of partitions into > the metastore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17861) Push data source partitions into metastore for catalog tables
Reynold Xin created SPARK-17861: --- Summary: Push data source partitions into metastore for catalog tables Key: SPARK-17861 URL: https://issues.apache.org/jira/browse/SPARK-17861 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Initially, Spark SQL does not store any partition information in the catalog for data source tables, because initially it was designed to work with arbitrary files. This, however, has a few issues for catalog tables: 1. Listing partitions for a large table (with millions of partitions) can be very slow during cold start. 2. Does not support heterogeneous partition naming schemes. 3. Cannot leverage pushing partition pruning into the metastore. This ticket tracks the work required to push the tracking of partitions into the metastore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17861) Push data source partitions into metastore for catalog tables
[ https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17861: Priority: Critical (was: Major) > Push data source partitions into metastore for catalog tables > - > > Key: SPARK-17861 > URL: https://issues.apache.org/jira/browse/SPARK-17861 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > Initially, Spark SQL does not store any partition information in the catalog > for data source tables, because initially it was designed to work with > arbitrary files. This, however, has a few issues for catalog tables: > 1. Listing partitions for a large table (with millions of partitions) can be > very slow during cold start. > 2. Does not support heterogeneous partition naming schemes. > 3. Cannot leverage pushing partition pruning into the metastore. > This ticket tracks the work required to push the tracking of partitions into > the metastore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16980) Load only catalog table partition metadata required to answer a query
[ https://issues.apache.org/jira/browse/SPARK-16980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16980: Assignee: Michael Allman > Load only catalog table partition metadata required to answer a query > - > > Key: SPARK-16980 > URL: https://issues.apache.org/jira/browse/SPARK-16980 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Michael Allman >Assignee: Michael Allman > > Currently, when a user reads from a partitioned Hive table whose metadata are > not cached (and for which Hive table conversion is enabled and supported), > all partition metadata are fetched from the metastore: > https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260 > However, if the user's query includes partition pruning predicates then we > only need the subset of these metadata which satisfy those predicates. > This issue tracks work to modify the current query planning scheme so that > unnecessary partition metadata are not loaded. > I've prototyped two possible approaches. The first extends > {{o.a.s.s.c.catalog.ExternalCatalog}} and as such is more generally > applicable. It requires some new abstractions and refactoring of > {{HadoopFsRelation}} and {{FileCatalog}}, among others. It places a greater > burden on other implementations of {{ExternalCatalog}}. Currently the only > other implementation of {{ExternalCatalog}} is {{InMemoryCatalog}}, and my > prototype throws an {{UnsupportedOperationException}} on that implementation. > The second prototype is simpler and only touches code in the {{hive}} > project. Basically, conversion of a partitioned {{MetastoreRelation}} to > {{HadoopFsRelation}} is deferred to physical planning. During physical > planning, the partition pruning filters in the query plan are used to > identify the required partition metadata and a {{HadoopFsRelation}} is built > from those. The new query plan is then re-injected into the physical planner > and proceeds as normal for a {{HadoopFsRelation}}. > On the Spark dev mailing list, [~ekhliang] expressed a preference for the > approach I took in my first POC. (See > http://apache-spark-developers-list.1001551.n3.nabble.com/Scaling-partitioned-Hive-table-support-td18586.html) > Based on that, I'm going to open a PR with that patch as a starting point > for an architectural/design review. It will not be a complete patch ready for > integration into Spark master. Rather, I would like to get early feedback on > the implementation details so I can shape the PR before committing a large > amount of time on a finished product. I will open another PR for the second > approach for comparison if requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17801) [ML]Random Forest Regression fails for large input
[ https://issues.apache.org/jira/browse/SPARK-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563974#comment-15563974 ] Joseph K. Bradley commented on SPARK-17801: --- Btw, that maxBins setting is way too high. It should be in the hundreds at most. > [ML]Random Forest Regression fails for large input > -- > > Key: SPARK-17801 > URL: https://issues.apache.org/jira/browse/SPARK-17801 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04 >Reporter: samkit >Priority: Minor > > Random Forest Regression > Data:https://www.kaggle.com/c/grupo-bimbo-inventory-demand/download/train.csv.zip > Parameters: > NumTrees:500Maximum Bins:7477383 MaxDepth:27 > MinInstancesPerNode:8648 SamplingRate:1.0 > Java Options: > "-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" > "-Dspark.driver.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:-UseAdaptiveSizePolicy > -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit > -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g > -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster" > "-Dspark.speculation=true" " "-Dspark.speculation.multiplier=2" > "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms" > "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768" > "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6" > "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution > -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC > -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g > -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps" > "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g" > "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails" > "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution" > "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2" > "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2" "-XX:-UseGCOverheadLimit" > "-XX:CMSInitiatingOccupancyFraction=75" "-XX:NewSize=8g" "-XX:MaxNewSize=8g" > "-XX:SurvivorRatio=3" "-DnumPartitions=36" > Partial Driver StackTrace: > org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:740) > > org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:525) > org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:160) > > org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:209) > > org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:197) > org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > org.apache.spark.ml.Estimator.fit(Estimator.scala:59) > org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > For complete Executor and Driver ErrorLog > https://gist.github.com/anonymous/603ac7f8f17e43c51ba93b2934cd4cb6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17801) [ML]Random Forest Regression fails for large input
[ https://issues.apache.org/jira/browse/SPARK-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563966#comment-15563966 ] Joseph K. Bradley edited comment on SPARK-17801 at 10/11/16 12:10 AM: -- Have you tried this with Spark 1.6.2 or 2.0.0 or 2.0.1? was (Author: josephkb): Have you tried this with Spark 2.0? > [ML]Random Forest Regression fails for large input > -- > > Key: SPARK-17801 > URL: https://issues.apache.org/jira/browse/SPARK-17801 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04 >Reporter: samkit >Priority: Minor > > Random Forest Regression > Data:https://www.kaggle.com/c/grupo-bimbo-inventory-demand/download/train.csv.zip > Parameters: > NumTrees:500Maximum Bins:7477383 MaxDepth:27 > MinInstancesPerNode:8648 SamplingRate:1.0 > Java Options: > "-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" > "-Dspark.driver.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:-UseAdaptiveSizePolicy > -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit > -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g > -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster" > "-Dspark.speculation=true" " "-Dspark.speculation.multiplier=2" > "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms" > "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768" > "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6" > "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution > -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC > -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g > -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps" > "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g" > "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails" > "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution" > "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2" > "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2" "-XX:-UseGCOverheadLimit" > "-XX:CMSInitiatingOccupancyFraction=75" "-XX:NewSize=8g" "-XX:MaxNewSize=8g" > "-XX:SurvivorRatio=3" "-DnumPartitions=36" > Partial Driver StackTrace: > org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:740) > > org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:525) > org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:160) > > org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:209) > > org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:197) > org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > org.apache.spark.ml.Estimator.fit(Estimator.scala:59) > org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > For complete Executor and Driver ErrorLog > https://gist.github.com/anonymous/603ac7f8f17e43c51ba93b2934cd4cb6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17801) [ML]Random Forest Regression fails for large input
[ https://issues.apache.org/jira/browse/SPARK-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563966#comment-15563966 ] Joseph K. Bradley commented on SPARK-17801: --- Have you tried this with Spark 2.0? > [ML]Random Forest Regression fails for large input > -- > > Key: SPARK-17801 > URL: https://issues.apache.org/jira/browse/SPARK-17801 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04 >Reporter: samkit >Priority: Minor > > Random Forest Regression > Data:https://www.kaggle.com/c/grupo-bimbo-inventory-demand/download/train.csv.zip > Parameters: > NumTrees:500Maximum Bins:7477383 MaxDepth:27 > MinInstancesPerNode:8648 SamplingRate:1.0 > Java Options: > "-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" > "-Dspark.driver.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:-UseAdaptiveSizePolicy > -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit > -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g > -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster" > "-Dspark.speculation=true" " "-Dspark.speculation.multiplier=2" > "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms" > "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768" > "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6" > "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution > -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC > -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g > -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps" > "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g" > "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails" > "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution" > "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2" > "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2" "-XX:-UseGCOverheadLimit" > "-XX:CMSInitiatingOccupancyFraction=75" "-XX:NewSize=8g" "-XX:MaxNewSize=8g" > "-XX:SurvivorRatio=3" "-DnumPartitions=36" > Partial Driver StackTrace: > org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:740) > > org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:525) > org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:160) > > org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:209) > > org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:197) > org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > org.apache.spark.ml.Estimator.fit(Estimator.scala:59) > org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > For complete Executor and Driver ErrorLog > https://gist.github.com/anonymous/603ac7f8f17e43c51ba93b2934cd4cb6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature
[ https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-14610. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 12374 [https://github.com/apache/spark/pull/12374] > Remove superfluous split from random forest findSplitsForContinousFeature > - > > Key: SPARK-14610 > URL: https://issues.apache.org/jira/browse/SPARK-14610 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > Fix For: 2.1.0 > > > Currently, the method findSplitsForContinuousFeature in random forest > produces an unnecessary split. For example, if a continuous feature has > unique values: (1, 2, 3), then the possible splits generated by this method > are: > * {1|2,3} > * {1,2|3} > * {1,2,3|} > The following unit test is quite clearly incorrect: > {code:title=rf.scala|borderStyle=solid} > val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble) > val splits = > RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) > assert(splits.length === 3) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature
[ https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14610: -- Assignee: Seth Hendrickson > Remove superfluous split from random forest findSplitsForContinousFeature > - > > Key: SPARK-14610 > URL: https://issues.apache.org/jira/browse/SPARK-14610 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > > Currently, the method findSplitsForContinuousFeature in random forest > produces an unnecessary split. For example, if a continuous feature has > unique values: (1, 2, 3), then the possible splits generated by this method > are: > * {1|2,3} > * {1,2|3} > * {1,2,3|} > The following unit test is quite clearly incorrect: > {code:title=rf.scala|borderStyle=solid} > val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble) > val splits = > RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0) > assert(splits.length === 3) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()
[ https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563944#comment-15563944 ] Hossein Falaki commented on SPARK-17781: Yes, but somehow inside {{worker.R}} Date fields in the list are treated as Double. Case in point is the reproducing example in the body of the ticket. To be honest, I am still confused about this too. > datetime is serialized as double inside dapply() > > > Key: SPARK-17781 > URL: https://issues.apache.org/jira/browse/SPARK-17781 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > When we ship a SparkDataFrame to workers for dapply family functions, inside > the worker DateTime objects are serialized as double. > To reproduce: > {code} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > dapplyCollect(df, function(x) { return(x$date) }) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add class weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563919#comment-15563919 ] Seth Hendrickson commented on SPARK-9478: - I'm going to revive this, and hopefully submit a PR soon. > Add class weights to Random Forest > -- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17860) SHOW COLUMN's database conflict check should respect case sensitivity setting
[ https://issues.apache.org/jira/browse/SPARK-17860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17860: Assignee: Apache Spark > SHOW COLUMN's database conflict check should respect case sensitivity setting > - > > Key: SPARK-17860 > URL: https://issues.apache.org/jira/browse/SPARK-17860 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dilip Biswal >Assignee: Apache Spark >Priority: Minor > > SHOW COLUMNS command validates the user supplied database > name with database name from qualified table name name to make > sure both of them are consistent. This comparison should respect > case sensitivity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17860) SHOW COLUMN's database conflict check should respect case sensitivity setting
[ https://issues.apache.org/jira/browse/SPARK-17860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563915#comment-15563915 ] Apache Spark commented on SPARK-17860: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/15423 > SHOW COLUMN's database conflict check should respect case sensitivity setting > - > > Key: SPARK-17860 > URL: https://issues.apache.org/jira/browse/SPARK-17860 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dilip Biswal >Priority: Minor > > SHOW COLUMNS command validates the user supplied database > name with database name from qualified table name name to make > sure both of them are consistent. This comparison should respect > case sensitivity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17860) SHOW COLUMN's database conflict check should respect case sensitivity setting
[ https://issues.apache.org/jira/browse/SPARK-17860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17860: Assignee: (was: Apache Spark) > SHOW COLUMN's database conflict check should respect case sensitivity setting > - > > Key: SPARK-17860 > URL: https://issues.apache.org/jira/browse/SPARK-17860 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dilip Biswal >Priority: Minor > > SHOW COLUMNS command validates the user supplied database > name with database name from qualified table name name to make > sure both of them are consistent. This comparison should respect > case sensitivity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15577) Java can't import DataFrame type alias
[ https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563910#comment-15563910 ] Jakob Odersky commented on SPARK-15577: --- This was considered and trade-offs were actively discussed, but ultimately the type alias was chosen over sub classing. I think the main argument in favor of aliasing was to avoid incompatibilities in future libraries, i.e. there is utility function was written to accept a {{DataFrame}}, however I want to pass in a {{Dataset\[Row\]}}. [This email thread| http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-DataFrame-vs-Dataset-in-Spark-2-0-td16445.html] contains the whole discussion > Java can't import DataFrame type alias > -- > > Key: SPARK-15577 > URL: https://issues.apache.org/jira/browse/SPARK-15577 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.0.0 >Reporter: holdenk > > After SPARK-13244, all Java code needs to be updated to use Dataset > instead of DataFrame as we used a type alias. Should we consider adding a > DataFrame to the Java API which just extends Dataset for compatibility? > cc [~liancheng] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15577) Java can't import DataFrame type alias
[ https://issues.apache.org/jira/browse/SPARK-15577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563910#comment-15563910 ] Jakob Odersky edited comment on SPARK-15577 at 10/10/16 11:41 PM: -- This was considered and trade-offs were actively discussed, but ultimately the type alias was chosen over sub classing. I think a principal argument in favor of aliasing was to avoid incompatibilities in future libraries, i.e. there is utility function was written to accept a {{DataFrame}}, however I want to pass in a {{Dataset\[Row\]}}. [This email thread| http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-DataFrame-vs-Dataset-in-Spark-2-0-td16445.html] contains the whole discussion was (Author: jodersky): This was considered and trade-offs were actively discussed, but ultimately the type alias was chosen over sub classing. I think the main argument in favor of aliasing was to avoid incompatibilities in future libraries, i.e. there is utility function was written to accept a {{DataFrame}}, however I want to pass in a {{Dataset\[Row\]}}. [This email thread| http://apache-spark-developers-list.1001551.n3.nabble.com/discuss-DataFrame-vs-Dataset-in-Spark-2-0-td16445.html] contains the whole discussion > Java can't import DataFrame type alias > -- > > Key: SPARK-15577 > URL: https://issues.apache.org/jira/browse/SPARK-15577 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.0.0 >Reporter: holdenk > > After SPARK-13244, all Java code needs to be updated to use Dataset > instead of DataFrame as we used a type alias. Should we consider adding a > DataFrame to the Java API which just extends Dataset for compatibility? > cc [~liancheng] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563894#comment-15563894 ] Jo Desmet commented on SPARK-15343: --- Still not acceptable, I mean how can we. This tool has been presented as a tool with near-perfect Hadoop integration, which we are about to drop. It absolutely does not make sense to drop this as 'not-a-problem': it is a clearly described regression for a mainstream deployment. I am all for galvanizing the Hadoop community in upping their libraries, and Spark is a good motivation, but this is all too harsh. There is absolutely no way back once we go that direction. It might mean loosing and alienating our existing user-base. For example, big corporations are using vetted Hadoop Stacks. I do work for one such bigger corporation, and I know that that kind of thing means a lot to them. Before we know we will end up with a complex maze of what version of hadoop works with what version of Mesos, Hadoop, etc. If Java 9 provides the solution, or providing shaded libraries, then we should wait till that is in place before moving forward. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) >
[jira] [Created] (SPARK-17860) SHOW COLUMN's database conflict check should use case sensitive compare.
Dilip Biswal created SPARK-17860: Summary: SHOW COLUMN's database conflict check should use case sensitive compare. Key: SPARK-17860 URL: https://issues.apache.org/jira/browse/SPARK-17860 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Dilip Biswal Priority: Minor SHOW COLUMNS command validates the user supplied database name with database name from qualified table name name to make sure both of them are consistent. This comparison should respect case sensitivity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17860) SHOW COLUMN's database conflict check should respect case sensitivity setting
[ https://issues.apache.org/jira/browse/SPARK-17860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dilip Biswal updated SPARK-17860: - Summary: SHOW COLUMN's database conflict check should respect case sensitivity setting (was: SHOW COLUMN's database conflict check should use case sensitive compare.) > SHOW COLUMN's database conflict check should respect case sensitivity setting > - > > Key: SPARK-17860 > URL: https://issues.apache.org/jira/browse/SPARK-17860 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dilip Biswal >Priority: Minor > > SHOW COLUMNS command validates the user supplied database > name with database name from qualified table name name to make > sure both of them are consistent. This comparison should respect > case sensitivity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17857) SHOW TABLES IN schema throws exception if schema doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-17857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563877#comment-15563877 ] Todd Nemet commented on SPARK-17857: I didn't even think to check Hive 1.2, since I figured it would have the same behavior as Spark 1.5 and 1.6. I'm really surprised by that. For context, due to [SPARK-9686|https://issues.apache.org/jira/browse/SPARK-9686] we use SHOW and DESCRIBE to walk the schema in Spark instead of connection.getMetaData(), like we do in Hive and Impala. We are catching the exception to workaround it, but since it's a change from 1.x I thought I'd file it as a minor bug. But since it's not a difference from Hive perhaps it's not a bug. > SHOW TABLES IN schema throws exception if schema doesn't exist > -- > > Key: SPARK-17857 > URL: https://issues.apache.org/jira/browse/SPARK-17857 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Todd Nemet >Priority: Minor > > SHOW TABLES IN badschema; throws > org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException if badschema > doesn't exist. In Spark 1.x it would return an empty result set. > On Spark 2.0.1: > {code} > [683|12:45:56] ~/Documents/spark/spark$ bin/beeline -u > jdbc:hive2://localhost:10006/ -n hive > Connecting to jdbc:hive2://localhost:10006/ > 16/10/10 12:46:00 INFO jdbc.Utils: Supplied authorities: localhost:10006 > 16/10/10 12:46:00 INFO jdbc.Utils: Resolved authority: localhost:10006 > 16/10/10 12:46:00 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10006/ > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 1.2.1.spark2 by Apache Hive > 0: jdbc:hive2://localhost:10006/> show schemas; > +---+--+ > | databaseName | > +---+--+ > | default | > | looker_scratch| > | spark_jira| > | spark_looker_scratch | > | spark_looker_test | > +---+--+ > 5 rows selected (0.61 seconds) > 0: jdbc:hive2://localhost:10006/> show tables in spark_looker_test; > +--+--+--+ > | tableName | isTemporary | > +--+--+--+ > | all_types| false| > | order_items | false| > | orders | false| > | users| false| > +--+--+--+ > 4 rows selected (0.611 seconds) > 0: jdbc:hive2://localhost:10006/> show tables in badschema; > Error: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: > Database 'badschema' not found; (state=,code=0) > {code} > On Spark 1.6.2: > {code} > [680|12:47:26] ~/Documents/spark/spark$ bin/beeline -u > jdbc:hive2://localhost:10005/ -n hive > Connecting to jdbc:hive2://localhost:10005/ > 16/10/10 12:47:29 INFO jdbc.Utils: Supplied authorities: localhost:10005 > 16/10/10 12:47:29 INFO jdbc.Utils: Resolved authority: localhost:10005 > 16/10/10 12:47:30 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10005/ > Connected to: Spark SQL (version 1.6.2) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 1.2.1.spark2 by Apache Hive > 0: jdbc:hive2://localhost:10005/> show schemas; > ++--+ > | result | > ++--+ > | default| > | spark_jira | > | spark_looker_test | > | spark_scratch | > ++--+ > 4 rows selected (0.613 seconds) > 0: jdbc:hive2://localhost:10005/> show tables in spark_looker_test; > +--+--+--+ > | tableName | isTemporary | > +--+--+--+ > | all_types| false| > | order_items | false| > | orders | false| > | users| false| > +--+--+--+ > 4 rows selected (0.575 seconds) > 0: jdbc:hive2://localhost:10005/> show tables in badschema; > ++--+--+ > | tableName | isTemporary | > ++--+--+ > ++--+--+ > No rows selected (0.458 seconds) > {code} > [Relevant part of Hive QL > docs|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()
[ https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563862#comment-15563862 ] Shivaram Venkataraman commented on SPARK-17781: --- I'm not sure I follow. The class of the elements inside should remain `Date` ? {code} > l <- lapply(1:2, function(x) { Sys.Date() }) > class(l[[1]]) [1] "Date" > class(l[[2]]) [1] "Date" {code} > datetime is serialized as double inside dapply() > > > Key: SPARK-17781 > URL: https://issues.apache.org/jira/browse/SPARK-17781 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > When we ship a SparkDataFrame to workers for dapply family functions, inside > the worker DateTime objects are serialized as double. > To reproduce: > {code} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > dapplyCollect(df, function(x) { return(x$date) }) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563839#comment-15563839 ] Mike Dusenberry commented on SPARK-4630: It would be really nice to revisit this issue, perhaps even updated to just focus on {{DataSet}} given the current direction of the project. Basically, I would be really interested in Smart DataSet/DataFrame/RDD repartitioning that automatically decides the proper number of partitions based on characteristics of the data (i.e. width) and the cluster. From my experience, the outcome of a "wrong" number of partitions is frequent OOM errors, 2GB partition limits (although that's been lifted in 2.x, it's still a perf issue), and if you save a DataFrame/DataSet to, say, Parquet format with too few partitions, the individual compressed files may be too large to read later on (say if a partition is 2GB, but there isn't enough executor memory to open that when working with the files later on with a different Spark setup -- think perhaps a batch preprocessing cluster vs. production serving cluster). > Dynamically determine optimal number of partitions > -- > > Key: SPARK-4630 > URL: https://issues.apache.org/jira/browse/SPARK-4630 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Kostas Sakellis >Assignee: Kostas Sakellis > > Partition sizes play a big part in how fast stages execute during a Spark > job. There is a direct relationship between the size of partitions to the > number of tasks - larger partitions, fewer tasks. For better performance, > Spark has a sweet spot for how large partitions should be that get executed > by a task. If partitions are too small, then the user pays a disproportionate > cost in scheduling overhead. If the partitions are too large, then task > execution slows down due to gc pressure and spilling to disk. > To increase performance of jobs, users often hand optimize the number(size) > of partitions that the next stage gets. Factors that come into play are: > Incoming partition sizes from previous stage > number of available executors > available memory per executor (taking into account > spark.shuffle.memoryFraction) > Spark has access to this data and so should be able to automatically do the > partition sizing for the user. This feature can be turned off/on with a > configuration option. > To make this happen, we propose modifying the DAGScheduler to take into > account partition sizes upon stage completion. Before scheduling the next > stage, the scheduler can examine the sizes of the partitions and determine > the appropriate number tasks to create. Since this change requires > non-trivial modifications to the DAGScheduler, a detailed design doc will be > attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4630) Dynamically determine optimal number of partitions
[ https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563839#comment-15563839 ] Mike Dusenberry edited comment on SPARK-4630 at 10/10/16 11:10 PM: --- It would be really nice to revisit this issue, perhaps even updated to just focus on {{DataSet}} given the current direction of the project. Basically, I would be really interested in smart DataSet/DataFrame/RDD repartitioning that automatically decides the proper number of partitions based on characteristics of the data (i.e. width) and the cluster. From my experience, the outcome of a "wrong" number of partitions is frequent OOM errors, 2GB partition limits (although that's been lifted in 2.x, it's still a perf issue), and if you save a DataFrame/DataSet to, say, Parquet format with too few partitions, the individual compressed files may be too large to read later on (say if a partition is 2GB, but there isn't enough executor memory to open that when working with the files later on with a different Spark setup -- think perhaps a batch preprocessing cluster vs. production serving cluster). was (Author: mwdus...@us.ibm.com): It would be really nice to revisit this issue, perhaps even updated to just focus on {{DataSet}} given the current direction of the project. Basically, I would be really interested in Smart DataSet/DataFrame/RDD repartitioning that automatically decides the proper number of partitions based on characteristics of the data (i.e. width) and the cluster. From my experience, the outcome of a "wrong" number of partitions is frequent OOM errors, 2GB partition limits (although that's been lifted in 2.x, it's still a perf issue), and if you save a DataFrame/DataSet to, say, Parquet format with too few partitions, the individual compressed files may be too large to read later on (say if a partition is 2GB, but there isn't enough executor memory to open that when working with the files later on with a different Spark setup -- think perhaps a batch preprocessing cluster vs. production serving cluster). > Dynamically determine optimal number of partitions > -- > > Key: SPARK-4630 > URL: https://issues.apache.org/jira/browse/SPARK-4630 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Kostas Sakellis >Assignee: Kostas Sakellis > > Partition sizes play a big part in how fast stages execute during a Spark > job. There is a direct relationship between the size of partitions to the > number of tasks - larger partitions, fewer tasks. For better performance, > Spark has a sweet spot for how large partitions should be that get executed > by a task. If partitions are too small, then the user pays a disproportionate > cost in scheduling overhead. If the partitions are too large, then task > execution slows down due to gc pressure and spilling to disk. > To increase performance of jobs, users often hand optimize the number(size) > of partitions that the next stage gets. Factors that come into play are: > Incoming partition sizes from previous stage > number of available executors > available memory per executor (taking into account > spark.shuffle.memoryFraction) > Spark has access to this data and so should be able to automatically do the > partition sizing for the user. This feature can be turned off/on with a > configuration option. > To make this happen, we propose modifying the DAGScheduler to take into > account partition sizes upon stage completion. Before scheduling the next > stage, the scheduler can examine the sizes of the partitions and determine > the appropriate number tasks to create. Since this change requires > non-trivial modifications to the DAGScheduler, a detailed design doc will be > attached before proceeding with the work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17859) persist should not impede with spark's ability to perform a broadcast join.
Franck Tago created SPARK-17859: --- Summary: persist should not impede with spark's ability to perform a broadcast join. Key: SPARK-17859 URL: https://issues.apache.org/jira/browse/SPARK-17859 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.0.0 Environment: spark 2.0.0 , Linux RedHat Reporter: Franck Tago I am using Spark 2.0.0 My investigation leads me to conclude that calling persist could prevent broadcast join from happening . Example Case1: No persist call var df1 =spark.range(100).select($"id".as("id1")) df1: org.apache.spark.sql.DataFrame = [id1: bigint] var df2 =spark.range(1000).select($"id".as("id2")) df2: org.apache.spark.sql.DataFrame = [id2: bigint] df1.join(df2 , $"id1" === $"id2" ).explain == Physical Plan == *BroadcastHashJoin [id1#117L], [id2#123L], Inner, BuildRight :- *Project [id#114L AS id1#117L] : +- *Range (0, 100, splits=2) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) +- *Project [id#120L AS id2#123L] +- *Range (0, 1000, splits=2) Case 2: persist call df1.persist.join(df2 , $"id1" === $"id2" ).explain 16/10/10 15:50:21 WARN CacheManager: Asked to cache already cached data. == Physical Plan == *SortMergeJoin [id1#3L], [id2#9L], Inner :- *Sort [id1#3L ASC], false, 0 : +- Exchange hashpartitioning(id1#3L, 10) : +- InMemoryTableScan [id1#3L] :: +- InMemoryRelation [id1#3L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) :: : +- *Project [id#0L AS id1#3L] :: : +- *Range (0, 100, splits=2) +- *Sort [id2#9L ASC], false, 0 +- Exchange hashpartitioning(id2#9L, 10) +- InMemoryTableScan [id2#9L] : +- InMemoryRelation [id2#9L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) : : +- *Project [id#6L AS id2#9L] : : +- *Range (0, 1000, splits=2) Why does the persist call prevent the broadcast join . My opinion is that it should not . I was made aware that the persist call is lazy and that might have something to do with it , but I still contend that it should not . Losing broadcast joins is really costly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17858) Provide option for Spark SQL to skip corrupt files
[ https://issues.apache.org/jira/browse/SPARK-17858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-17858: - Description: In Spark 2.0, corrupt files will fail a SQL query. However, the user may just want to skip corrupt files and still run the query. Another painful thing is the current exception doesn't contain the paths of corrupt files, makes the user hard to fix their files. Note: In Spark 1.6, Spark SQL always skip corrupt files because of SPARK-17850. was: In Spark 2.0, corrupt files will fail a job. However, the user may just want to skip corrupt files and still run the query. Another painful thing is the current exception doesn't contain the paths of corrupt files, makes the user hard to fix their files. > Provide option for Spark SQL to skip corrupt files > -- > > Key: SPARK-17858 > URL: https://issues.apache.org/jira/browse/SPARK-17858 > Project: Spark > Issue Type: Improvement >Reporter: Shixiong Zhu > > In Spark 2.0, corrupt files will fail a SQL query. However, the user may just > want to skip corrupt files and still run the query. > Another painful thing is the current exception doesn't contain the paths of > corrupt files, makes the user hard to fix their files. > Note: In Spark 1.6, Spark SQL always skip corrupt files because of > SPARK-17850. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17858) Provide option for Spark SQL to skip corrupt files
[ https://issues.apache.org/jira/browse/SPARK-17858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-17858: - Description: In Spark 2.0, corrupt files will fail a job. However, the user may just want to skip corrupt files and still run the query. Another painful thing is the current exception doesn't contain the paths of corrupt files, makes the user hard to fix their files. was:In Spark 2.0, corrupt files will fail a job. However, the user may not > Provide option for Spark SQL to skip corrupt files > -- > > Key: SPARK-17858 > URL: https://issues.apache.org/jira/browse/SPARK-17858 > Project: Spark > Issue Type: Improvement >Reporter: Shixiong Zhu > > In Spark 2.0, corrupt files will fail a job. However, the user may just want > to skip corrupt files and still run the query. > Another painful thing is the current exception doesn't contain the paths of > corrupt files, makes the user hard to fix their files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17858) Provide option for Spark SQL to skip corrupt files
[ https://issues.apache.org/jira/browse/SPARK-17858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-17858: - Issue Type: Improvement (was: Bug) > Provide option for Spark SQL to skip corrupt files > -- > > Key: SPARK-17858 > URL: https://issues.apache.org/jira/browse/SPARK-17858 > Project: Spark > Issue Type: Improvement >Reporter: Shixiong Zhu > > In Spark 2.0, corrupt files will fail a job. However, the user may not -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17858) Provide option for Spark SQL to skip corrupt files
[ https://issues.apache.org/jira/browse/SPARK-17858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-17858: - Description: In Spark 2.0, corrupt files will fail a job. However, the user may not > Provide option for Spark SQL to skip corrupt files > -- > > Key: SPARK-17858 > URL: https://issues.apache.org/jira/browse/SPARK-17858 > Project: Spark > Issue Type: Bug >Reporter: Shixiong Zhu > > In Spark 2.0, corrupt files will fail a job. However, the user may not -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17858) Provide option for Spark SQL to skip corrupt files
Shixiong Zhu created SPARK-17858: Summary: Provide option for Spark SQL to skip corrupt files Key: SPARK-17858 URL: https://issues.apache.org/jira/browse/SPARK-17858 Project: Spark Issue Type: Bug Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17850) HadoopRDD should not swallow EOFException
[ https://issues.apache.org/jira/browse/SPARK-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17850: Assignee: (was: Apache Spark) > HadoopRDD should not swallow EOFException > - > > Key: SPARK-17850 > URL: https://issues.apache.org/jira/browse/SPARK-17850 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.1 >Reporter: Shixiong Zhu > Labels: correctness > > The code in > https://github.com/apache/spark/blob/2bcd5d5ce3eaf0eb1600a12a2b55ddb40927533b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L256 > catches EOFException and mark RecordReader finished. However, in some cases, > RecordReader will throw EOFException to indicate the stream is corrupted. See > the following stack trace as an example: > {code} > Caused by: java.io.EOFException: Unexpected end of input stream > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:134) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Then HadoopRDD doesn't fail the job when files are corrupted (e.g., corrupted > gzip files). > Note: NewHadoopRDD doesn't have this issue. > This is reported by Bilal Aslam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17850) HadoopRDD should not swallow EOFException
[ https://issues.apache.org/jira/browse/SPARK-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17850: Assignee: Apache Spark > HadoopRDD should not swallow EOFException > - > > Key: SPARK-17850 > URL: https://issues.apache.org/jira/browse/SPARK-17850 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Apache Spark > Labels: correctness > > The code in > https://github.com/apache/spark/blob/2bcd5d5ce3eaf0eb1600a12a2b55ddb40927533b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L256 > catches EOFException and mark RecordReader finished. However, in some cases, > RecordReader will throw EOFException to indicate the stream is corrupted. See > the following stack trace as an example: > {code} > Caused by: java.io.EOFException: Unexpected end of input stream > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:134) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Then HadoopRDD doesn't fail the job when files are corrupted (e.g., corrupted > gzip files). > Note: NewHadoopRDD doesn't have this issue. > This is reported by Bilal Aslam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17850) HadoopRDD should not swallow EOFException
[ https://issues.apache.org/jira/browse/SPARK-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15563774#comment-15563774 ] Apache Spark commented on SPARK-17850: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/15422 > HadoopRDD should not swallow EOFException > - > > Key: SPARK-17850 > URL: https://issues.apache.org/jira/browse/SPARK-17850 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.1 >Reporter: Shixiong Zhu > Labels: correctness > > The code in > https://github.com/apache/spark/blob/2bcd5d5ce3eaf0eb1600a12a2b55ddb40927533b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L256 > catches EOFException and mark RecordReader finished. However, in some cases, > RecordReader will throw EOFException to indicate the stream is corrupted. See > the following stack trace as an example: > {code} > Caused by: java.io.EOFException: Unexpected end of input stream > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:134) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Then HadoopRDD doesn't fail the job when files are corrupted (e.g., corrupted > gzip files). > Note: NewHadoopRDD doesn't have this issue. > This is reported by Bilal Aslam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org