[jira] [Commented] (SPARK-8021) DataFrameReader/Writer in Python does not match Scala
[ https://issues.apache.org/jira/browse/SPARK-8021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568672#comment-14568672 ] Apache Spark commented on SPARK-8021: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/6578 DataFrameReader/Writer in Python does not match Scala - Key: SPARK-8021 URL: https://issues.apache.org/jira/browse/SPARK-8021 Project: Spark Issue Type: Sub-task Affects Versions: 1.4.0 Reporter: Michael Armbrust Assignee: Davies Liu Priority: Blocker When doing {{sqlContext.read.format(json).load(...)}} I get {{AttributeError: 'DataFrameReader' object has no attribute 'format'}}. These APIs should match up so that examples we give in documentation and slides can be used in any language. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8032) Make version checking in mllib/__init__.py more robust for version NumPy 1.10
Manoj Kumar created SPARK-8032: -- Summary: Make version checking in mllib/__init__.py more robust for version NumPy 1.10 Key: SPARK-8032 URL: https://issues.apache.org/jira/browse/SPARK-8032 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Manoj Kumar The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x 4, however `1.x` `1.4` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8021) DataFrameReader/Writer in Python does not match Scala
[ https://issues.apache.org/jira/browse/SPARK-8021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8021: --- Assignee: Apache Spark (was: Davies Liu) DataFrameReader/Writer in Python does not match Scala - Key: SPARK-8021 URL: https://issues.apache.org/jira/browse/SPARK-8021 Project: Spark Issue Type: Sub-task Affects Versions: 1.4.0 Reporter: Michael Armbrust Assignee: Apache Spark Priority: Blocker When doing {{sqlContext.read.format(json).load(...)}} I get {{AttributeError: 'DataFrameReader' object has no attribute 'format'}}. These APIs should match up so that examples we give in documentation and slides can be used in any language. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8021) DataFrameReader/Writer in Python does not match Scala
[ https://issues.apache.org/jira/browse/SPARK-8021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8021: --- Assignee: Davies Liu (was: Apache Spark) DataFrameReader/Writer in Python does not match Scala - Key: SPARK-8021 URL: https://issues.apache.org/jira/browse/SPARK-8021 Project: Spark Issue Type: Sub-task Affects Versions: 1.4.0 Reporter: Michael Armbrust Assignee: Davies Liu Priority: Blocker When doing {{sqlContext.read.format(json).load(...)}} I get {{AttributeError: 'DataFrameReader' object has no attribute 'format'}}. These APIs should match up so that examples we give in documentation and slides can be used in any language. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8034) spark-sql security authorization bug
nilone created SPARK-8034: - Summary: spark-sql security authorization bug Key: SPARK-8034 URL: https://issues.apache.org/jira/browse/SPARK-8034 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.1, 1.3.0, 1.2.1 Reporter: nilone I Try to use beeline to access thrift jdbc server for authorization test, and these params have added to the hive-site.xml : -- hive.security.authorization.enabled : true hive.security.authorization.createtable.owner.grants : select,alter,drop -- 1、cannot control select privilege : anyone can select any table created by other users,(all for spark1.1,spark1.2,spark1.3) 2、when create table under different beeline client with different user name, the server write wrong owner name into the hive metastore table 'TBLS', always write the name who is the first one make the create table operation .and cannot control drop ,alter privilege between users. (this bug is for version after spark1.2, spark1.1 is ok, ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8035) 2
[ https://issues.apache.org/jira/browse/SPARK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nilone closed SPARK-8035. - Resolution: Invalid 2 - Key: SPARK-8035 URL: https://issues.apache.org/jira/browse/SPARK-8035 Project: Spark Issue Type: Bug Reporter: nilone -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7980) Support SQLContext.range(end)
[ https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568720#comment-14568720 ] Animesh Baranawal commented on SPARK-7980: -- Regarding the python support for range, I am unable to check the functioning in pyspark... Even for the pre-defined range function in context.py, when I type the following in ./bin/pyspark: sqlContext.range(1, 7, 2).collect() I get the error: Traceback (most recent call last): File stdin, line 1, in module TypeError: range() takes exactly 2 arguments (4 given) Support SQLContext.range(end) - Key: SPARK-7980 URL: https://issues.apache.org/jira/browse/SPARK-7980 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin SQLContext.range should also allow only specifying the end position, similar to Python's own range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6816) Add SparkConf API to configure SparkR
[ https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567187#comment-14567187 ] Rick Moritz edited comment on SPARK-6816 at 6/2/15 8:55 AM: One current drawback with SparkR's configuration option is the inability to set driver VM-options. These are crucial, when attempting to run sparkR on a Hortonworks HDP, as both driver and appliation-master need to be aware of the hdp.version variable in order to resolve the classpath. While it is possible to pass this variable to the executors, there's no way to pass this option to the driver, excepting the following exploit/work-around: The SPARK_MEM variable can be abused to pass the required parameters to the driver's VM, by using String concatenation. Setting the variable to (e.g.) 512m -Dhdp.version=NNN appends the -D option to the -X option which is currently read from this environment variable. Adding a secondary variable to the System.env which gets parsed for JVM options would be far more obvious and less hacky, or by adding a separate environment list for the driver, extending what's currently available for executors. I'm adding this as a comment to this issue, since I believe it is sufficiently closely related not to warrant a separate issue. was (Author: rpcmoritz): One current drawback with SparkR's configuration option is the inability to set driver VM-options. These are crucial, when attempting to run sparkR on a Hortonworks HDP, as both driver and appliation-master need to be aware of the hdp.version variable in order to resolve the classpath. While it is possible to pass this variable to the executors, there's no way to pass this option to the driver, excepting the following exploit/work-around: The SPARK_MEM variable can be abused to pass the required parameters to the driver's VM, by using String concatenation. Setting the variable to (e.g.) 512m -Dhdp.version=NNN appends the -D option to the -X option which is currently read from this environment variable. Adding a secondary variable to the System.env which gets parsed for JVM options would be far more obvious and less hacky, or by adding a separate environment list for the driver, extending what's currently available for executors. I'm adding this as a comment to this issue, since I believe it is sufficiently closely related not to warrant a separate issue. Add SparkConf API to configure SparkR - Key: SPARK-6816 URL: https://issues.apache.org/jira/browse/SPARK-6816 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the only way to configure SparkR is to pass in arguments to sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python to make configuration easier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8038) PySpark SQL when functions is broken on Column
[ https://issues.apache.org/jira/browse/SPARK-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8038: --- Assignee: Apache Spark PySpark SQL when functions is broken on Column -- Key: SPARK-8038 URL: https://issues.apache.org/jira/browse/SPARK-8038 Project: Spark Issue Type: Bug Affects Versions: 1.4.0 Environment: Spark 1.4.0 RC3 Reporter: Olivier Girardot Assignee: Apache Spark Priority: Blocker {code} In [1]: df = sqlCtx.createDataFrame([(1, 1), (2, 2), (1, 2), (1, 2)], [key, value]) In [2]: from pyspark.sql import functions as F In [8]: df.select(df.key, F.when(df.key 1, 0).when(df.key == 0, 2).otherwise(1)).show() +---+-+ | key |CASE WHEN (key = 0) THEN 2 ELSE 1| +---+-+ | 1| 1| | 2| 1| | 1| 1| | 1| 1| +---+-+ {code} When in Scala I get the expected expression and behaviour : {code} scala val df = sqlContext.createDataFrame(List((1, 1), (2, 2), (1, 2), (1, 2))).toDF(key, value) scala import org.apache.spark.sql.functions._ scala df.select(df(key), when(df(key) 1, 0).when(df(key) === 2, 2).otherwise(1)).show() +---+---+ |key|CASE WHEN (key 1) THEN 0 WHEN (key = 2) THEN 2 ELSE 1| +---+---+ | 1| 1| | 2| 0| | 1| 1| | 1| 1| +---+---+ {code} This is coming from the column.py file with the Column class definition of **when** and the fix is coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8004) Spark does not enclose column names when fetchting from jdbc sources
[ https://issues.apache.org/jira/browse/SPARK-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8004: --- Assignee: Apache Spark Spark does not enclose column names when fetchting from jdbc sources Key: SPARK-8004 URL: https://issues.apache.org/jira/browse/SPARK-8004 Project: Spark Issue Type: Bug Components: SQL Reporter: Rene Treffer Assignee: Apache Spark Spark failes to load tables that have a keyword as column names Sample error: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 (TID 4322, localhost): com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'key,value FROM [XX]' {code} A correct query would have been {code} SELECT `key`.`value` FROM {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8004) Spark does not enclose column names when fetchting from jdbc sources
[ https://issues.apache.org/jira/browse/SPARK-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568666#comment-14568666 ] Apache Spark commented on SPARK-8004: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/6577 Spark does not enclose column names when fetchting from jdbc sources Key: SPARK-8004 URL: https://issues.apache.org/jira/browse/SPARK-8004 Project: Spark Issue Type: Bug Components: SQL Reporter: Rene Treffer Spark failes to load tables that have a keyword as column names Sample error: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 (TID 4322, localhost): com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'key,value FROM [XX]' {code} A correct query would have been {code} SELECT `key`.`value` FROM {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8004) Spark does not enclose column names when fetchting from jdbc sources
[ https://issues.apache.org/jira/browse/SPARK-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8004: --- Assignee: (was: Apache Spark) Spark does not enclose column names when fetchting from jdbc sources Key: SPARK-8004 URL: https://issues.apache.org/jira/browse/SPARK-8004 Project: Spark Issue Type: Bug Components: SQL Reporter: Rene Treffer Spark failes to load tables that have a keyword as column names Sample error: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 (TID 4322, localhost): com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'key,value FROM [XX]' {code} A correct query would have been {code} SELECT `key`.`value` FROM {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8035) 2
nilone created SPARK-8035: - Summary: 2 Key: SPARK-8035 URL: https://issues.apache.org/jira/browse/SPARK-8035 Project: Spark Issue Type: Bug Reporter: nilone -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568726#comment-14568726 ] Saisai Shao commented on SPARK-4352: Hi [~sandyr], I have a proposal based on ratio to calculate the node locality which can cover all the situation, even in the run-time of dynamic allocation, say if we have 300 tasks, 200 tasks with node a, b, c; and 100 tasks with node a, b, d. So the ratio of node locality is 300 : 300 : 200 : 100. Now we need to allocate 10 executors, so according to the ratio distribution, we will calculate out the best distribution of 10 executors based on the ratio above: 300 * 10 / 300 : 300 * 10 / 300 : 200 * 10 / 300 : 100 * 10 / 300 = 10 : 10 : 7 : 4, floor to get the integer. and requests: 4 executors: a, b, c, d 3 executors: a, b, c 3 executors: a, b The probability of a, b is highest, and d is lowest, basicly follow the distribution of data. If we request for 1 executor, this would be {{300 * 1 / 300 : 300 * 1 / 300 : 200 * 1 / 300 : 100 * 1 / 300 = 1 : 1 : 1 : 1 }}, so each node has a equal chance to allocate the executor. If {{task number = executor number * cores}} which means resource is over demanded, both above method and this ratio based method is OK, since they will by chance be the same, but ratio based implementation do not need to consider this special case, the algorithm is same for all the situation. If currently we already have some nodes with executors allocated, say for example on nodes a, b, c, d, currently is 3 : 3 : 0 : 0, and we still need to request for 10 executors, originally the ratio is 3 : 3 : 2 : 1, so we will get 10 executors on node a, b, c, d which is 3 : 3 : 2 : 2 by equal probability. And we already have 3 executors on a and b, so actually we only need 4 executors with c, d to satisfy the ratio, and finally left 6 for a, b, c, d to equally increase the executor number (since now the probability is already satisfied). What do you think about this algorithm, it's fairly general, one concern is that it does not take the core numbers into consideration. Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7980) Support SQLContext.range(end)
[ https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568722#comment-14568722 ] Animesh Baranawal commented on SPARK-7980: -- Regarding the python support for range, I am unable to check the functioning in pyspark... Even for the pre-defined range function in context.py, when I type the following in ./bin/pyspark: sqlContext.range(1, 7, 2).collect() I get the error: Traceback (most recent call last): File stdin, line 1, in module TypeError: range() takes exactly 2 arguments (4 given) Support SQLContext.range(end) - Key: SPARK-7980 URL: https://issues.apache.org/jira/browse/SPARK-7980 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin SQLContext.range should also allow only specifying the end position, similar to Python's own range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7980) Support SQLContext.range(end)
[ https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Animesh Baranawal updated SPARK-7980: - Comment: was deleted (was: Regarding the python support for range, I am unable to check the functioning in pyspark... Even for the pre-defined range function in context.py, when I type the following in ./bin/pyspark: sqlContext.range(1, 7, 2).collect() I get the error: Traceback (most recent call last): File stdin, line 1, in module TypeError: range() takes exactly 2 arguments (4 given)) Support SQLContext.range(end) - Key: SPARK-7980 URL: https://issues.apache.org/jira/browse/SPARK-7980 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin SQLContext.range should also allow only specifying the end position, similar to Python's own range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8037) Ignores files whose name starts with . while enumerating files in HadoopFsRelation
Cheng Lian created SPARK-8037: - Summary: Ignores files whose name starts with . while enumerating files in HadoopFsRelation Key: SPARK-8037 URL: https://issues.apache.org/jira/browse/SPARK-8037 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Temporary files like {{.DS_Store}} generated by Mac OS X finder may cause trouble for partition discovery. A directory whose layout looks like the following {noformat} find parquet_partitioned parquet_partitioned parquet_partitioned/._common_metadata.crc parquet_partitioned/._metadata.crc parquet_partitioned/._SUCCESS.crc parquet_partitioned/_common_metadata parquet_partitioned/_metadata parquet_partitioned/_SUCCESS parquet_partitioned/year=2014/.DS_Store parquet_partitioned/year=2014/month=9 parquet_partitioned/year=2014/month=9/.DS_Store parquet_partitioned/year=2014/month=9/day=1/.DS_Store parquet_partitioned/year=2014/month=9/day=1/.part-r-8.gz.parquet.crc parquet_partitioned/year=2014/month=9/day=1/part-r-8.gz.parquet parquet_partitioned/year=2015 parquet_partitioned/year=2015/month=10 parquet_partitioned/year=2015/month=10/day=25 parquet_partitioned/year=2015/month=10/day=25/.part-r-2.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=25/.part-r-4.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=25/part-r-2.gz.parquet parquet_partitioned/year=2015/month=10/day=25/part-r-4.gz.parquet parquet_partitioned/year=2015/month=10/day=26 parquet_partitioned/year=2015/month=10/day=26/.part-r-5.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=26/part-r-5.gz.parquet parquet_partitioned/year=2015/month=9 parquet_partitioned/year=2015/month=9/day=1 parquet_partitioned/year=2015/month=9/day=1/.part-r-7.gz.parquet.crc parquet_partitioned/year=2015/month=9/day=1/part-r-7.gz.parquet {noformat} causes exception like this: {noformat} scala val df = sqlContext.read.parquet(parquet_partitioned) java.lang.AssertionError: assertion failed: Conflicting partition column names detected: ArrayBuffer(year, month) ArrayBuffer(year) ArrayBuffer(year, month, day) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:189) at org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:87) at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:492) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:449) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:448) {noformat} This is because {{.DS_Store}} files are considered as a data file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6816) Add SparkConf API to configure SparkR
[ https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568807#comment-14568807 ] Rick Moritz commented on SPARK-6816: [~shivaram], I am integrating SparkR into an RStudio server (I would believe this to be a rather common use case), so using bin/SparkR won't work in this case, as far as I can tell. Thanks for the suggestion nonetheless. Add SparkConf API to configure SparkR - Key: SPARK-6816 URL: https://issues.apache.org/jira/browse/SPARK-6816 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the only way to configure SparkR is to pass in arguments to sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python to make configuration easier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8032) Make NumPy version checking in mllib/__init__.py
[ https://issues.apache.org/jira/browse/SPARK-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8032: --- Assignee: (was: Apache Spark) Make NumPy version checking in mllib/__init__.py Key: SPARK-8032 URL: https://issues.apache.org/jira/browse/SPARK-8032 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Manoj Kumar The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x 4, however `1.x` `1.4` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8034) spark-sql security authorization bug
[ https://issues.apache.org/jira/browse/SPARK-8034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nilone closed SPARK-8034. - Resolution: Invalid spark-sql security authorization bug Key: SPARK-8034 URL: https://issues.apache.org/jira/browse/SPARK-8034 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.1, 1.3.0, 1.3.1 Reporter: nilone I Try to use beeline to access thrift jdbc server for authorization test, and these params have added to the hive-site.xml : -- hive.security.authorization.enabled : true hive.security.authorization.createtable.owner.grants : select,alter,drop -- 1、cannot control select privilege : anyone can select any table created by other users,(all for spark1.1,spark1.2,spark1.3) 2、when create table under different beeline client with different user name, the server write wrong owner name into the hive metastore table 'TBLS', always write the name who is the first one make the create table operation .and cannot control drop ,alter privilege between users. (this bug is for version after spark1.2, spark1.1 is ok, ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8036) Ignores files whose name starts with . while enumerating files in HadoopFsRelation
Cheng Lian created SPARK-8036: - Summary: Ignores files whose name starts with . while enumerating files in HadoopFsRelation Key: SPARK-8036 URL: https://issues.apache.org/jira/browse/SPARK-8036 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Temporary files like {{.DS_Store}} generated by Mac OS X finder may cause trouble for partition discovery. A directory whose layout looks like the following {noformat} find parquet_partitioned parquet_partitioned parquet_partitioned/._common_metadata.crc parquet_partitioned/._metadata.crc parquet_partitioned/._SUCCESS.crc parquet_partitioned/_common_metadata parquet_partitioned/_metadata parquet_partitioned/_SUCCESS parquet_partitioned/year=2014/.DS_Store parquet_partitioned/year=2014/month=9 parquet_partitioned/year=2014/month=9/.DS_Store parquet_partitioned/year=2014/month=9/day=1/.DS_Store parquet_partitioned/year=2014/month=9/day=1/.part-r-8.gz.parquet.crc parquet_partitioned/year=2014/month=9/day=1/part-r-8.gz.parquet parquet_partitioned/year=2015 parquet_partitioned/year=2015/month=10 parquet_partitioned/year=2015/month=10/day=25 parquet_partitioned/year=2015/month=10/day=25/.part-r-2.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=25/.part-r-4.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=25/part-r-2.gz.parquet parquet_partitioned/year=2015/month=10/day=25/part-r-4.gz.parquet parquet_partitioned/year=2015/month=10/day=26 parquet_partitioned/year=2015/month=10/day=26/.part-r-5.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=26/part-r-5.gz.parquet parquet_partitioned/year=2015/month=9 parquet_partitioned/year=2015/month=9/day=1 parquet_partitioned/year=2015/month=9/day=1/.part-r-7.gz.parquet.crc parquet_partitioned/year=2015/month=9/day=1/part-r-7.gz.parquet {noformat} causes exception like this: {noformat} scala val df = sqlContext.read.parquet(parquet_partitioned) java.lang.AssertionError: assertion failed: Conflicting partition column names detected: ArrayBuffer(year, month) ArrayBuffer(year) ArrayBuffer(year, month, day) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:189) at org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:87) at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:492) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:449) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:448) {noformat}f This is because {{.DS_Store}} files are considered as a data file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8033) spark-sql thriftserver security authorization bugs!
[ https://issues.apache.org/jira/browse/SPARK-8033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8033. -- Resolution: Duplicate spark-sql thriftserver security authorization bugs! --- Key: SPARK-8033 URL: https://issues.apache.org/jira/browse/SPARK-8033 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.1, 1.3.0, 1.3.1 Reporter: nilone I Try to use beeline to access thrift jdbc server for authorization test, and these params have added to the hive-site.xml : -- hive.security.authorization.enabled : true hive.security.authorization.createtable.owner.grants : select,alter,drop -- 1、cannot control select privilege : anyone can select any table created by other users,(all for spark1.1,spark1.2,spark1.3) 2、when create table under different beeline client with different user name, the server write wrong owner name into the hive metastore table 'TBLS', always write the name who is the first one make the create table operation .and cannot control drop ,alter privilege between users. (this bug is for version after spark1.2, spark1.1 is ok, ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6988) Fix Spark SQL documentation for 1.3.x
[ https://issues.apache.org/jira/browse/SPARK-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568705#comment-14568705 ] Saurabh Santhosh commented on SPARK-6988: - Hey, Can someone update the Spark documentation as well. (For correct usage of Data Frames) https://spark.apache.org/docs/latest/sql-programming-guide.html Eg : DataFrame teenagers = sqlContext.sql(SELECT name FROM parquetFile WHERE age = 13 AND age = 19); ListString teenagerNames = teenagers.map(new FunctionRow, String() { public String call(Row row) { return Name: + row.getString(0); } }).collect(); to change teenagers.map to teenagers.javaRDD().map Fix Spark SQL documentation for 1.3.x - Key: SPARK-6988 URL: https://issues.apache.org/jira/browse/SPARK-6988 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Olivier Girardot Assignee: Olivier Girardot Priority: Minor Fix For: 1.3.2, 1.4.0 There are a few glitches regarding the DataFrame API usage in Java. The most important one being how to map a DataFrame result, using the javaRDD method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7993) Improve DataFrame.show() output
[ https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568717#comment-14568717 ] Akhil Thatipamula commented on SPARK-7993: -- I am planning to check whether the data type for a given column is primitive. And if it turns out to be non primitive, I am modifying the value of string['cell.toString']. Is that legitimate? Improve DataFrame.show() output --- Key: SPARK-7993 URL: https://issues.apache.org/jira/browse/SPARK-7993 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker Labels: starter 1. Each column should be at the minimum 3 characters wide. Right now if the widest value is 1, it is just 1 char wide, which looks ugly. Example below: 2. If a DataFrame have more than N number of rows (N = 20 by default for show), at the end we should display a message like only showing the top 20 rows. {code} +--+--+-+ | a| b|c| +--+--+-+ | 1| 2|3| | 1| 2|1| | 1| 2|3| | 3| 6|3| | 1| 2|3| | 5|10|1| | 1| 2|3| | 7|14|3| | 1| 2|3| | 9|18|1| | 1| 2|3| |11|22|3| | 1| 2|3| |13|26|1| | 1| 2|3| |15|30|3| | 1| 2|3| |17|34|1| | 1| 2|3| |19|38|3| +--+--+-+ only showing top 20 rows add this at the end {code} 3. For array values, instead of printing ArrayBuffer, we should just print square brackets: {code} +--+--+-+ | a_freqItems| b_freqItems| c_freqItems| +--+--+-+ |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)| +--+--+-+ {code} should be {code} +---+---+---+ |a_freqItems|b_freqItems|c_freqItems| +---+---+---+ |[11, 1]|[2, 22]| [1, 3]| +---+---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8011) DecimalType is not a datatype
[ https://issues.apache.org/jira/browse/SPARK-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568823#comment-14568823 ] Liang-Chi Hsieh commented on SPARK-8011: Try DecimalType.Unlimited? DecimalType is not a datatype - Key: SPARK-8011 URL: https://issues.apache.org/jira/browse/SPARK-8011 Project: Spark Issue Type: Bug Components: Java API, Spark Core Affects Versions: 1.3.1 Reporter: Bipin Roshan Nag When I run the following in spark-shell : StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) I get console:50: error: type mismatch; found : org.apache.spark.sql.types.DecimalType.type required: org.apache.spark.sql.types.DataType StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8038) PySpark SQL when functions is broken on Column
[ https://issues.apache.org/jira/browse/SPARK-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8038: --- Assignee: (was: Apache Spark) PySpark SQL when functions is broken on Column -- Key: SPARK-8038 URL: https://issues.apache.org/jira/browse/SPARK-8038 Project: Spark Issue Type: Bug Affects Versions: 1.4.0 Environment: Spark 1.4.0 RC3 Reporter: Olivier Girardot Priority: Blocker {code} In [1]: df = sqlCtx.createDataFrame([(1, 1), (2, 2), (1, 2), (1, 2)], [key, value]) In [2]: from pyspark.sql import functions as F In [8]: df.select(df.key, F.when(df.key 1, 0).when(df.key == 0, 2).otherwise(1)).show() +---+-+ | key |CASE WHEN (key = 0) THEN 2 ELSE 1| +---+-+ | 1| 1| | 2| 1| | 1| 1| | 1| 1| +---+-+ {code} When in Scala I get the expected expression and behaviour : {code} scala val df = sqlContext.createDataFrame(List((1, 1), (2, 2), (1, 2), (1, 2))).toDF(key, value) scala import org.apache.spark.sql.functions._ scala df.select(df(key), when(df(key) 1, 0).when(df(key) === 2, 2).otherwise(1)).show() +---+---+ |key|CASE WHEN (key 1) THEN 0 WHEN (key = 2) THEN 2 ELSE 1| +---+---+ | 1| 1| | 2| 0| | 1| 1| | 1| 1| +---+---+ {code} This is coming from the column.py file with the Column class definition of **when** and the fix is coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8038) PySpark SQL when functions is broken on Column
[ https://issues.apache.org/jira/browse/SPARK-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568836#comment-14568836 ] Apache Spark commented on SPARK-8038: - User 'ogirardot' has created a pull request for this issue: https://github.com/apache/spark/pull/6580 PySpark SQL when functions is broken on Column -- Key: SPARK-8038 URL: https://issues.apache.org/jira/browse/SPARK-8038 Project: Spark Issue Type: Bug Affects Versions: 1.4.0 Environment: Spark 1.4.0 RC3 Reporter: Olivier Girardot Priority: Blocker {code} In [1]: df = sqlCtx.createDataFrame([(1, 1), (2, 2), (1, 2), (1, 2)], [key, value]) In [2]: from pyspark.sql import functions as F In [8]: df.select(df.key, F.when(df.key 1, 0).when(df.key == 0, 2).otherwise(1)).show() +---+-+ | key |CASE WHEN (key = 0) THEN 2 ELSE 1| +---+-+ | 1| 1| | 2| 1| | 1| 1| | 1| 1| +---+-+ {code} When in Scala I get the expected expression and behaviour : {code} scala val df = sqlContext.createDataFrame(List((1, 1), (2, 2), (1, 2), (1, 2))).toDF(key, value) scala import org.apache.spark.sql.functions._ scala df.select(df(key), when(df(key) 1, 0).when(df(key) === 2, 2).otherwise(1)).show() +---+---+ |key|CASE WHEN (key 1) THEN 0 WHEN (key = 2) THEN 2 ELSE 1| +---+---+ | 1| 1| | 2| 0| | 1| 1| | 1| 1| +---+---+ {code} This is coming from the column.py file with the Column class definition of **when** and the fix is coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8037) Ignores files whose name starts with . while enumerating files in HadoopFsRelation
[ https://issues.apache.org/jira/browse/SPARK-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8037: --- Assignee: Cheng Lian (was: Apache Spark) Ignores files whose name starts with . while enumerating files in HadoopFsRelation Key: SPARK-8037 URL: https://issues.apache.org/jira/browse/SPARK-8037 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Temporary files like {{.DS_Store}} generated by Mac OS X finder may cause trouble for partition discovery. A directory whose layout looks like the following {noformat} find parquet_partitioned parquet_partitioned parquet_partitioned/._common_metadata.crc parquet_partitioned/._metadata.crc parquet_partitioned/._SUCCESS.crc parquet_partitioned/_common_metadata parquet_partitioned/_metadata parquet_partitioned/_SUCCESS parquet_partitioned/year=2014/.DS_Store parquet_partitioned/year=2014/month=9 parquet_partitioned/year=2014/month=9/.DS_Store parquet_partitioned/year=2014/month=9/day=1/.DS_Store parquet_partitioned/year=2014/month=9/day=1/.part-r-8.gz.parquet.crc parquet_partitioned/year=2014/month=9/day=1/part-r-8.gz.parquet parquet_partitioned/year=2015 parquet_partitioned/year=2015/month=10 parquet_partitioned/year=2015/month=10/day=25 parquet_partitioned/year=2015/month=10/day=25/.part-r-2.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=25/.part-r-4.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=25/part-r-2.gz.parquet parquet_partitioned/year=2015/month=10/day=25/part-r-4.gz.parquet parquet_partitioned/year=2015/month=10/day=26 parquet_partitioned/year=2015/month=10/day=26/.part-r-5.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=26/part-r-5.gz.parquet parquet_partitioned/year=2015/month=9 parquet_partitioned/year=2015/month=9/day=1 parquet_partitioned/year=2015/month=9/day=1/.part-r-7.gz.parquet.crc parquet_partitioned/year=2015/month=9/day=1/part-r-7.gz.parquet {noformat} causes exception like this: {noformat} scala val df = sqlContext.read.parquet(parquet_partitioned) java.lang.AssertionError: assertion failed: Conflicting partition column names detected: ArrayBuffer(year, month) ArrayBuffer(year) ArrayBuffer(year, month, day) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:189) at org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:87) at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:492) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:449) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:448) {noformat} This is because {{.DS_Store}} files are considered as a data file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8037) Ignores files whose name starts with . while enumerating files in HadoopFsRelation
[ https://issues.apache.org/jira/browse/SPARK-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8037: --- Assignee: Apache Spark (was: Cheng Lian) Ignores files whose name starts with . while enumerating files in HadoopFsRelation Key: SPARK-8037 URL: https://issues.apache.org/jira/browse/SPARK-8037 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Apache Spark Priority: Minor Temporary files like {{.DS_Store}} generated by Mac OS X finder may cause trouble for partition discovery. A directory whose layout looks like the following {noformat} find parquet_partitioned parquet_partitioned parquet_partitioned/._common_metadata.crc parquet_partitioned/._metadata.crc parquet_partitioned/._SUCCESS.crc parquet_partitioned/_common_metadata parquet_partitioned/_metadata parquet_partitioned/_SUCCESS parquet_partitioned/year=2014/.DS_Store parquet_partitioned/year=2014/month=9 parquet_partitioned/year=2014/month=9/.DS_Store parquet_partitioned/year=2014/month=9/day=1/.DS_Store parquet_partitioned/year=2014/month=9/day=1/.part-r-8.gz.parquet.crc parquet_partitioned/year=2014/month=9/day=1/part-r-8.gz.parquet parquet_partitioned/year=2015 parquet_partitioned/year=2015/month=10 parquet_partitioned/year=2015/month=10/day=25 parquet_partitioned/year=2015/month=10/day=25/.part-r-2.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=25/.part-r-4.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=25/part-r-2.gz.parquet parquet_partitioned/year=2015/month=10/day=25/part-r-4.gz.parquet parquet_partitioned/year=2015/month=10/day=26 parquet_partitioned/year=2015/month=10/day=26/.part-r-5.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=26/part-r-5.gz.parquet parquet_partitioned/year=2015/month=9 parquet_partitioned/year=2015/month=9/day=1 parquet_partitioned/year=2015/month=9/day=1/.part-r-7.gz.parquet.crc parquet_partitioned/year=2015/month=9/day=1/part-r-7.gz.parquet {noformat} causes exception like this: {noformat} scala val df = sqlContext.read.parquet(parquet_partitioned) java.lang.AssertionError: assertion failed: Conflicting partition column names detected: ArrayBuffer(year, month) ArrayBuffer(year) ArrayBuffer(year, month, day) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:189) at org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:87) at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:492) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:449) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:448) {noformat} This is because {{.DS_Store}} files are considered as a data file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8037) Ignores files whose name starts with . while enumerating files in HadoopFsRelation
[ https://issues.apache.org/jira/browse/SPARK-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568857#comment-14568857 ] Apache Spark commented on SPARK-8037: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/6581 Ignores files whose name starts with . while enumerating files in HadoopFsRelation Key: SPARK-8037 URL: https://issues.apache.org/jira/browse/SPARK-8037 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Temporary files like {{.DS_Store}} generated by Mac OS X finder may cause trouble for partition discovery. A directory whose layout looks like the following {noformat} find parquet_partitioned parquet_partitioned parquet_partitioned/._common_metadata.crc parquet_partitioned/._metadata.crc parquet_partitioned/._SUCCESS.crc parquet_partitioned/_common_metadata parquet_partitioned/_metadata parquet_partitioned/_SUCCESS parquet_partitioned/year=2014/.DS_Store parquet_partitioned/year=2014/month=9 parquet_partitioned/year=2014/month=9/.DS_Store parquet_partitioned/year=2014/month=9/day=1/.DS_Store parquet_partitioned/year=2014/month=9/day=1/.part-r-8.gz.parquet.crc parquet_partitioned/year=2014/month=9/day=1/part-r-8.gz.parquet parquet_partitioned/year=2015 parquet_partitioned/year=2015/month=10 parquet_partitioned/year=2015/month=10/day=25 parquet_partitioned/year=2015/month=10/day=25/.part-r-2.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=25/.part-r-4.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=25/part-r-2.gz.parquet parquet_partitioned/year=2015/month=10/day=25/part-r-4.gz.parquet parquet_partitioned/year=2015/month=10/day=26 parquet_partitioned/year=2015/month=10/day=26/.part-r-5.gz.parquet.crc parquet_partitioned/year=2015/month=10/day=26/part-r-5.gz.parquet parquet_partitioned/year=2015/month=9 parquet_partitioned/year=2015/month=9/day=1 parquet_partitioned/year=2015/month=9/day=1/.part-r-7.gz.parquet.crc parquet_partitioned/year=2015/month=9/day=1/part-r-7.gz.parquet {noformat} causes exception like this: {noformat} scala val df = sqlContext.read.parquet(parquet_partitioned) java.lang.AssertionError: assertion failed: Conflicting partition column names detected: ArrayBuffer(year, month) ArrayBuffer(year) ArrayBuffer(year, month, day) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:189) at org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:87) at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:492) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:449) at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:448) {noformat} This is because {{.DS_Store}} files are considered as a data file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6988) Fix Spark SQL documentation for 1.3.x
[ https://issues.apache.org/jira/browse/SPARK-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568704#comment-14568704 ] Saurabh Santhosh commented on SPARK-6988: - Hey Can someone update the spark documentation as well. https://spark.apache.org/docs/latest/sql-programming-guide.html DataFrame teenagers = sqlContext.sql(SELECT name FROM parquetFile WHERE age = 13 AND age = 19); ListString teenagerNames = teenagers.map(new FunctionRow, String() { public String call(Row row) { return Name: + row.getString(0); } }).collect(); to make it teenagers.javaRDD() Fix Spark SQL documentation for 1.3.x - Key: SPARK-6988 URL: https://issues.apache.org/jira/browse/SPARK-6988 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Olivier Girardot Assignee: Olivier Girardot Priority: Minor Fix For: 1.3.2, 1.4.0 There are a few glitches regarding the DataFrame API usage in Java. The most important one being how to map a DataFrame result, using the javaRDD method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-8033) spark-sql thriftserver security authorization bugs!
[ https://issues.apache.org/jira/browse/SPARK-8033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-8033: -- spark-sql thriftserver security authorization bugs! --- Key: SPARK-8033 URL: https://issues.apache.org/jira/browse/SPARK-8033 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.1, 1.3.0, 1.3.1 Reporter: nilone I Try to use beeline to access thrift jdbc server for authorization test, and these params have added to the hive-site.xml : -- hive.security.authorization.enabled : true hive.security.authorization.createtable.owner.grants : select,alter,drop -- 1、cannot control select privilege : anyone can select any table created by other users,(all for spark1.1,spark1.2,spark1.3) 2、when create table under different beeline client with different user name, the server write wrong owner name into the hive metastore table 'TBLS', always write the name who is the first one make the create table operation .and cannot control drop ,alter privilege between users. (this bug is for version after spark1.2, spark1.1 is ok, ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8033) spark-sql thriftserver security authorization bugs!
[ https://issues.apache.org/jira/browse/SPARK-8033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8033. -- Resolution: Fixed [~nilone] Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark You opened this twice and the JIRA isn't quite correct spark-sql thriftserver security authorization bugs! --- Key: SPARK-8033 URL: https://issues.apache.org/jira/browse/SPARK-8033 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.1, 1.3.0, 1.3.1 Reporter: nilone I Try to use beeline to access thrift jdbc server for authorization test, and these params have added to the hive-site.xml : -- hive.security.authorization.enabled : true hive.security.authorization.createtable.owner.grants : select,alter,drop -- 1、cannot control select privilege : anyone can select any table created by other users,(all for spark1.1,spark1.2,spark1.3) 2、when create table under different beeline client with different user name, the server write wrong owner name into the hive metastore table 'TBLS', always write the name who is the first one make the create table operation .and cannot control drop ,alter privilege between users. (this bug is for version after spark1.2, spark1.1 is ok, ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8038) PySpark SQL when functions is broken on Column
Olivier Girardot created SPARK-8038: --- Summary: PySpark SQL when functions is broken on Column Key: SPARK-8038 URL: https://issues.apache.org/jira/browse/SPARK-8038 Project: Spark Issue Type: Bug Affects Versions: 1.4.0 Environment: Spark 1.4.0 RC3 Reporter: Olivier Girardot Priority: Blocker {code} In [1]: df = sqlCtx.createDataFrame([(1, 1), (2, 2), (1, 2), (1, 2)], [key, value]) In [2]: from pyspark.sql import functions as F In [8]: df.select(df.key, F.when(df.key 1, 0).when(df.key == 0, 2).otherwise(1)).show() +---+-+ | key |CASE WHEN (key = 0) THEN 2 ELSE 1| +---+-+ | 1| 1| | 2| 1| | 1| 1| | 1| 1| +---+-+ {code} When in Scala I get the expected expression and behaviour : {code} scala val df = sqlContext.createDataFrame(List((1, 1), (2, 2), (1, 2), (1, 2))).toDF(key, value) scala import org.apache.spark.sql.functions._ scala df.select(df(key), when(df(key) 1, 0).when(df(key) === 2, 2).otherwise(1)).show() +---+---+ |key|CASE WHEN (key 1) THEN 0 WHEN (key = 2) THEN 2 ELSE 1| +---+---+ | 1| 1| | 2| 0| | 1| 1| | 1| 1| +---+---+ {code} This is coming from the column.py file with the Column class definition of **when** and the fix is coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done internally and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/] * Union of Graphs ( G ∪ H ) * Intersection of Graphs( G ∩ H) * Graph Join * Difference of Graphs(G – H) * Graph Complement * Line Graph ( L(G) ) This issue will be index of all these operators was: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done internally and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done internally and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/] * Union of Graphs ( G ∪ H ) * Intersection of Graphs( G ∩ H) * Graph Join * Difference of Graphs(G – H) * Graph Complement * Line Graph ( L(G) ) This issue will be index of all these operators -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7122) KafkaUtils.createDirectStream - unreasonable processing time in absence of load
[ https://issues.apache.org/jira/browse/SPARK-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568854#comment-14568854 ] Nicolas PHUNG commented on SPARK-7122: -- For _KafkaUtils.createStream_, jobs take between 13 ms and 0,3s. In details, stages take between 13ms and 0,3s with split into 1 to 3 tasks. From the streaming menu in Spark UI, processing time at 75th percentile is 112 ms and maximum is 358 ms. For _KafkaUtils.createDirectStream_, jobs take between 13 ms and 7s. In details, stages take between 13ms and 7s are split between 275 to 400 tasks. My kafka topic has 400 partitions that can explain the task split in _KafkaUtils.createDirectStream_. But I don't understand why it gets behind whereas _KafkaUtils.createStream_ can keep up for the same processing _foreachrdd_ (I mean reprocess all from the beginning + keep up to newer/recent events in Kafka). Of course, I'm using the same executors Spark configuration for both (core/ram). Or maybe I'm doing something wrong somewhere. KafkaUtils.createDirectStream - unreasonable processing time in absence of load --- Key: SPARK-7122 URL: https://issues.apache.org/jira/browse/SPARK-7122 Project: Spark Issue Type: Question Components: Streaming Affects Versions: 1.3.1 Environment: Spark Streaming 1.3.1, standalone mode running on just 1 box: Ubuntu 14.04.2 LTS, 4 cores, 8GB RAM, java version 1.8.0_40 Reporter: Platon Potapov Priority: Minor Attachments: 10.second.window.fast.job.txt, 5.second.window.slow.job.txt, SparkStreamingJob.scala attached is the complete source code of a test spark job. no external data generators are run - just the presence of a kafka topic named raw suffices. the spark job is run with no load whatsoever. http://localhost:4040/streaming is checked to obtain job processing duration. * in case the test contains the following transformation: {code} // dummy transformation val temperature = bytes.filter(_._1 == abc) val abc = temperature.window(Seconds(40), Seconds(5)) abc.print() {code} the median processing time is 3 seconds 80 ms * in case the test contains the following transformation: {code} // dummy transformation val temperature = bytes.filter(_._1 == abc) val abc = temperature.map(x = (1, x)) abc.print() {code} the median processing time is just 50 ms please explain why does the window transformation introduce such a growth of job duration? note: the result is the same regardless of the number of kafka topic partitions (I've tried 1 and 8) note2: the result is the same regardless of the window parameters (I've tried (20, 2) and (40, 5)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8032) Make NumPy version checking in mllib/__init__.py
[ https://issues.apache.org/jira/browse/SPARK-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-8032: --- Summary: Make NumPy version checking in mllib/__init__.py (was: Make version checking in mllib/__init__.py) Make NumPy version checking in mllib/__init__.py Key: SPARK-8032 URL: https://issues.apache.org/jira/browse/SPARK-8032 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Manoj Kumar The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x 4, however `1.x` `1.4` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8032) Make NumPy version checking in mllib/__init__.py
[ https://issues.apache.org/jira/browse/SPARK-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568679#comment-14568679 ] Apache Spark commented on SPARK-8032: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/6579 Make NumPy version checking in mllib/__init__.py Key: SPARK-8032 URL: https://issues.apache.org/jira/browse/SPARK-8032 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Manoj Kumar The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x 4, however `1.x` `1.4` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8032) Make NumPy version checking in mllib/__init__.py
[ https://issues.apache.org/jira/browse/SPARK-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8032: --- Assignee: Apache Spark Make NumPy version checking in mllib/__init__.py Key: SPARK-8032 URL: https://issues.apache.org/jira/browse/SPARK-8032 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Manoj Kumar Assignee: Apache Spark The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x 4, however `1.x` `1.4` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8023) Random Number Generation inconsistent in projections in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-8023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8023. Resolution: Fixed Fix Version/s: 1.4.0 Random Number Generation inconsistent in projections in DataFrame - Key: SPARK-8023 URL: https://issues.apache.org/jira/browse/SPARK-8023 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Burak Yavuz Assignee: Yin Huai Priority: Blocker Fix For: 1.4.0 to reproduce (in python): {code} df = sqlContext.range(0, 10).withColumn('uniform', rand(seed=10)) df.select('uniform', 'uniform' + 1) {code} You should see that the first column + 1 doesn't equal the second column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8033) spark-sql thriftserver security authorization bugs!
nilone created SPARK-8033: - Summary: spark-sql thriftserver security authorization bugs! Key: SPARK-8033 URL: https://issues.apache.org/jira/browse/SPARK-8033 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.1, 1.3.0, 1.2.1 Reporter: nilone I Try to use beeline to access thrift jdbc server for authorization test, and these params have added to the hive-site.xml : -- hive.security.authorization.enabled : true hive.security.authorization.createtable.owner.grants : select,alter,drop -- 1、cannot control select privilege : anyone can select any table created by other users,(all for spark1.1,spark1.2,spark1.3) 2、when create table under different beeline client with different user name, the server write wrong owner name into the hive metastore table 'TBLS', always write the name who is the first one make the create table operation .and cannot control drop ,alter privilege between users. (this bug is for version after spark1.2, spark1.1 is ok, ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568726#comment-14568726 ] Saisai Shao edited comment on SPARK-4352 at 6/2/15 9:23 AM: Hi [~sandyr], I have a proposal based on ratio to calculate the node locality which can cover all the situation, even in the run-time of dynamic allocation, say if we have 300 tasks, 200 tasks with node a, b, c; and 100 tasks with node a, b, d. So the ratio of node locality is 300 : 300 : 200 : 100. Now we need to allocate 10 executors, so according to the ratio distribution, we will calculate out the best distribution of 10 executors based on the ratio above: 300 * 10 / 300 : 300 * 10 / 300 : 200 * 10 / 300 : 100 * 10 / 300 = 10 : 10 : 7 : 4, floor to get the integer. and requests: 4 executors: a, b, c, d 3 executors: a, b, c 3 executors: a, b The probability of a, b is highest, and d is lowest, basically follow the distribution of data. If we request for 1 executor, this would be {{300 * 1 / 300 : 300 * 1 / 300 : 200 * 1 / 300 : 100 * 1 / 300 = 1 : 1 : 1 : 1 }}, so each node has a equal chance to allocate the executor. If {{task number = executor number * cores}} which means resource is over demanded, both above method and this ratio based method is OK, since they will by chance be the same, but ratio based implementation do not need to consider this special case, the algorithm is same for all the situation. If currently we already have some nodes with executors allocated, say for example on nodes a, b, c, d, currently is 3 : 3 : 0 : 0, and we still need to request for 10 executors, so ideally the ratio changes to 1 : 1 : 7 : 4 by equal probability. And we already have 3 executors on a and b, so actually we only need 4 executors, round the ratio to be 4 based (1 : 1 : 4 : 3), so the executor allocations changes to : a, b, c, d 1 c, d 2 c 1 and the left 6 executor requests a, b, c, d for equal chance. This will keep the optimal ratio as close to 3 : 3 : 2 : 1. What do you think about this algorithm, it's fairly general, one concern is that it does not take the core numbers into consideration. was (Author: jerryshao): Hi [~sandyr], I have a proposal based on ratio to calculate the node locality which can cover all the situation, even in the run-time of dynamic allocation, say if we have 300 tasks, 200 tasks with node a, b, c; and 100 tasks with node a, b, d. So the ratio of node locality is 300 : 300 : 200 : 100. Now we need to allocate 10 executors, so according to the ratio distribution, we will calculate out the best distribution of 10 executors based on the ratio above: 300 * 10 / 300 : 300 * 10 / 300 : 200 * 10 / 300 : 100 * 10 / 300 = 10 : 10 : 7 : 4, floor to get the integer. and requests: 4 executors: a, b, c, d 3 executors: a, b, c 3 executors: a, b The probability of a, b is highest, and d is lowest, basicly follow the distribution of data. If we request for 1 executor, this would be {{300 * 1 / 300 : 300 * 1 / 300 : 200 * 1 / 300 : 100 * 1 / 300 = 1 : 1 : 1 : 1 }}, so each node has a equal chance to allocate the executor. If {{task number = executor number * cores}} which means resource is over demanded, both above method and this ratio based method is OK, since they will by chance be the same, but ratio based implementation do not need to consider this special case, the algorithm is same for all the situation. If currently we already have some nodes with executors allocated, say for example on nodes a, b, c, d, currently is 3 : 3 : 0 : 0, and we still need to request for 10 executors, originally the ratio is 3 : 3 : 2 : 1, so we will get 10 executors on node a, b, c, d which is 3 : 3 : 2 : 2 by equal probability. And we already have 3 executors on a and b, so actually we only need 4 executors with c, d to satisfy the ratio, and finally left 6 for a, b, c, d to equally increase the executor number (since now the probability is already satisfied). What do you think about this algorithm, it's fairly general, one concern is that it does not take the core numbers into consideration. Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568839#comment-14568839 ] Saisai Shao commented on SPARK-4352: Hi [~steve_l], thanks a lot for your suggestions, I don't have strong background of Yarn, so I will try to understand your suggestion and change the code accordingly:). Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8032) Make version checking in mllib/__init__.py
[ https://issues.apache.org/jira/browse/SPARK-8032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-8032: --- Summary: Make version checking in mllib/__init__.py (was: Make version checking in mllib/__init__.py more robust for version NumPy 1.10) Make version checking in mllib/__init__.py -- Key: SPARK-8032 URL: https://issues.apache.org/jira/browse/SPARK-8032 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Manoj Kumar The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x 4, however `1.x` `1.4` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7680) Add a fake Receiver that generates random strings, useful for prototyping
[ https://issues.apache.org/jira/browse/SPARK-7680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564815#comment-14564815 ] Rohith Yeravothula edited comment on SPARK-7680 at 6/2/15 8:03 AM: --- written a dummy_receiver with onstart and onstop methods doing nothing and receive method will return a random string for every recursive call being made. please mention if any thing else to be added was (Author: rohith): can you please give some more details about it Add a fake Receiver that generates random strings, useful for prototyping - Key: SPARK-7680 URL: https://issues.apache.org/jira/browse/SPARK-7680 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6988) Fix Spark SQL documentation for 1.3.x
[ https://issues.apache.org/jira/browse/SPARK-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568798#comment-14568798 ] Sean Owen commented on SPARK-6988: -- [~Saurabh Santhosh] This isn't how you report new issues. This issue is closed. Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark You also need to report changes against master. This is already fixed. Fix Spark SQL documentation for 1.3.x - Key: SPARK-6988 URL: https://issues.apache.org/jira/browse/SPARK-6988 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Olivier Girardot Assignee: Olivier Girardot Priority: Minor Fix For: 1.3.2, 1.4.0 There are a few glitches regarding the DataFrame API usage in Java. The most important one being how to map a DataFrame result, using the javaRDD method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8015) flume-sink should not depend on Guava.
[ https://issues.apache.org/jira/browse/SPARK-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-8015. -- Resolution: Fixed Fix Version/s: 1.4.0 flume-sink should not depend on Guava. -- Key: SPARK-8015 URL: https://issues.apache.org/jira/browse/SPARK-8015 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor Fix For: 1.4.0 The flume-sink module, due to the shared shading code in our build, ends up depending on the {{org.spark-project}} Guava classes. That means users who deploy the sink in Flume will also need to provide those classes somehow, generally by also adding the Spark assembly, which means adding a whole bunch of other libraries to Flume, which may or may not cause other unforeseen problems. It's better to not have that dependency in the flume-sink module instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7894: - Target Version/s: (was: 1.5.0) Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph, union Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px,align=center! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. For vertex, it's quite nature to just make a union and remove those duplicate ones. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8048) Explicit partitionning of an RDD with 0 partition will yield empty outer join
Olivier Toupin created SPARK-8048: - Summary: Explicit partitionning of an RDD with 0 partition will yield empty outer join Key: SPARK-8048 URL: https://issues.apache.org/jira/browse/SPARK-8048 Project: Spark Issue Type: Bug Reporter: Olivier Toupin Priority: Minor Check this code = https://gist.github.com/anonymous/0f935915f2bc182841f0 Because of this = {{.partitionBy(new HashPartitioner(0))}} The join will return empty result. Here a normal expected behaviour would the join to crash, cause error, or to return unjoined results, but instead will yield an empty RDD. This a trivial exemple, but imagine: {{.partitionBy(new HashPartitioner(previous.partitions.length))}}. You join on an empty previous rdd, the lookup table is empty, Spark will you lose all your results, instead of returning unjoined results, and this without warnings or errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569495#comment-14569495 ] Joseph K. Bradley commented on SPARK-7893: -- [~andyyehoo] I'm removing the target version; that should really be set by committers, and I think a case needs to be made for each operation separately since they could have very different utility complexity. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Umbrella Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done internally and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/] * Union of Graphs ( G ∪ H ) * Intersection of Graphs( G ∩ H) * Graph Join * Difference of Graphs(G – H) * Graph Complement * Line Graph ( L(G) ) This issue will be index of all these operators -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7958) Failed StreamingContext.start() can leak active actors
[ https://issues.apache.org/jira/browse/SPARK-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7958: - Affects Version/s: (was: 1.4.0) 1.1.1 1.2.2 1.3.1 Failed StreamingContext.start() can leak active actors -- Key: SPARK-7958 URL: https://issues.apache.org/jira/browse/SPARK-7958 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.1, 1.2.2, 1.3.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Fix For: 1.4.0 StreamingContext.start() can throw exception because DStream.validateAtStart() fails (say, checkpoint directory not set for StateDStream). But by then JobScheduler, JobGenerator, and ReceiverTracker has already started, along with their actors. But those cannot be shutdown because the only way to do that is call StreamingContext.stop() which cannot be called as the context has not been marked as ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7541) Check model save/load for MLlib 1.4
[ https://issues.apache.org/jira/browse/SPARK-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569496#comment-14569496 ] Joseph K. Bradley commented on SPARK-7541: -- Oh, I see. That sounds good to do, thanks! Check model save/load for MLlib 1.4 --- Key: SPARK-7541 URL: https://issues.apache.org/jira/browse/SPARK-7541 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley Assignee: yuhao yang For each model which supports save/load methods, we need to verify: * These methods are tested in unit tests in Scala and Python (if save/load is supported in Python). * If a model's name, data members, or constructors have changed _at all_, then we likely need to support a new save/load format version. Different versions must be tested in unit tests to ensure backwards compatibility (i.e., verify we can load old model formats). * Examples in the programming guide should include save/load when available. It's important to try running each example in the guide whenever it is modified (since there are no automated tests). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8048) Explicit partitionning of an RDD with 0 partition will yield empty outer join
[ https://issues.apache.org/jira/browse/SPARK-8048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Toupin updated SPARK-8048: -- Affects Version/s: 1.3.1 Explicit partitionning of an RDD with 0 partition will yield empty outer join - Key: SPARK-8048 URL: https://issues.apache.org/jira/browse/SPARK-8048 Project: Spark Issue Type: Bug Affects Versions: 1.3.1 Reporter: Olivier Toupin Priority: Minor Check this code = https://gist.github.com/anonymous/0f935915f2bc182841f0 Because of this = {{.partitionBy(new HashPartitioner(0))}} The join will return empty result. Here a normal expected behaviour would the join to crash, cause error, or to return unjoined results, but instead will yield an empty RDD. This a trivial exemple, but imagine: {{.partitionBy(new HashPartitioner(previous.partitions.length))}}. You join on an empty previous rdd, the lookup table is empty, Spark will you lose all your results, instead of returning unjoined results, and this without warnings or errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7958) Failed StreamingContext.start() can leak active actors
[ https://issues.apache.org/jira/browse/SPARK-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7958: - Target Version/s: 1.4.0 (was: 1.4.1) Failed StreamingContext.start() can leak active actors -- Key: SPARK-7958 URL: https://issues.apache.org/jira/browse/SPARK-7958 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.1, 1.2.2, 1.3.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Fix For: 1.4.0 StreamingContext.start() can throw exception because DStream.validateAtStart() fails (say, checkpoint directory not set for StateDStream). But by then JobScheduler, JobGenerator, and ReceiverTracker has already started, along with their actors. But those cannot be shutdown because the only way to do that is call StreamingContext.stop() which cannot be called as the context has not been marked as ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7985) Remove fittingParamMap references. Update ML Doc Estimator, Transformer, and Param examples.
[ https://issues.apache.org/jira/browse/SPARK-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7985. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6514 [https://github.com/apache/spark/pull/6514] Remove fittingParamMap references. Update ML Doc Estimator, Transformer, and Param examples. Key: SPARK-7985 URL: https://issues.apache.org/jira/browse/SPARK-7985 Project: Spark Issue Type: Bug Components: Documentation, ML Reporter: Mike Dusenberry Priority: Minor Fix For: 1.4.0 Update ML Doc's Estimator, Transformer, and Param Scala Java examples to use model.extractParamMap instead of model.fittingParamMap, which no longer exists. Remove all other references to fittingParamMap throughout Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569623#comment-14569623 ] Sandy Ryza commented on SPARK-4352: --- In the case where the task number = executor number * cores, I think my earlier argument still stands. Any executor requests beyond the ones needed to satisfy our preferences should be submitted with locality preferences. This means we will be less likely to bunch up requests on particular nodes where executors are not needed. Consider the extreme case where we want to request 100 executors but only have a single task with locality preferences, for data on 3 nodes. Going purely by the ratio approach, we would end up requesting all 100 executors on those three nodes. For the other cases, your approach makes sense to me. Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569490#comment-14569490 ] Joseph K. Bradley commented on SPARK-7893: -- I guess I'm OK with keeping an umbrella JIRA since I like organization in JIRA. But we should make sure to consider justify each operation so that we prioritize ones which many users really need. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done internally and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/] * Union of Graphs ( G ∪ H ) * Intersection of Graphs( G ∩ H) * Graph Join * Difference of Graphs(G – H) * Graph Complement * Line Graph ( L(G) ) This issue will be index of all these operators -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7893: - Issue Type: Umbrella (was: Improvement) Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Umbrella Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done internally and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/] * Union of Graphs ( G ∪ H ) * Intersection of Graphs( G ∩ H) * Graph Join * Difference of Graphs(G – H) * Graph Complement * Line Graph ( L(G) ) This issue will be index of all these operators -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7890) Document that Spark 2.11 now supports Kafka
[ https://issues.apache.org/jira/browse/SPARK-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7890: - Fix Version/s: (was: 1.4.1) (was: 1.5.0) 1.4.0 Document that Spark 2.11 now supports Kafka --- Key: SPARK-7890 URL: https://issues.apache.org/jira/browse/SPARK-7890 Project: Spark Issue Type: Sub-task Components: Documentation Reporter: Patrick Wendell Assignee: Sean Owen Priority: Critical Fix For: 1.4.0 The building-spark.html page needs to be updated. It's a simple fix, just remove the caveat about Kafka. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8015) flume-sink should not depend on Guava.
[ https://issues.apache.org/jira/browse/SPARK-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-8015: - Assignee: Marcelo Vanzin flume-sink should not depend on Guava. -- Key: SPARK-8015 URL: https://issues.apache.org/jira/browse/SPARK-8015 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor Fix For: 1.4.0 The flume-sink module, due to the shared shading code in our build, ends up depending on the {{org.spark-project}} Guava classes. That means users who deploy the sink in Flume will also need to provide those classes somehow, generally by also adding the Spark assembly, which means adding a whole bunch of other libraries to Flume, which may or may not cause other unforeseen problems. It's better to not have that dependency in the flume-sink module instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7985) Remove fittingParamMap references. Update ML Doc Estimator, Transformer, and Param examples.
[ https://issues.apache.org/jira/browse/SPARK-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7985: - Assignee: Mike Dusenberry Remove fittingParamMap references. Update ML Doc Estimator, Transformer, and Param examples. Key: SPARK-7985 URL: https://issues.apache.org/jira/browse/SPARK-7985 Project: Spark Issue Type: Bug Components: Documentation, ML Reporter: Mike Dusenberry Assignee: Mike Dusenberry Priority: Minor Fix For: 1.4.0 Update ML Doc's Estimator, Transformer, and Param Scala Java examples to use model.extractParamMap instead of model.fittingParamMap, which no longer exists. Remove all other references to fittingParamMap throughout Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7893: - Target Version/s: (was: 1.5.0) Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Umbrella Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done internally and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/] * Union of Graphs ( G ∪ H ) * Intersection of Graphs( G ∩ H) * Graph Join * Difference of Graphs(G – H) * Graph Complement * Line Graph ( L(G) ) This issue will be index of all these operators -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5784) Add StatsDSink to MetricsSystem
[ https://issues.apache.org/jira/browse/SPARK-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams closed SPARK-5784. Resolution: Not A Problem Add StatsDSink to MetricsSystem --- Key: SPARK-5784 URL: https://issues.apache.org/jira/browse/SPARK-5784 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Ryan Williams Priority: Minor [StatsD|https://github.com/etsy/statsd/] is a common wrapper for Graphite; it would be useful to support sending metrics to StatsD in addition to [the existing Graphite support|https://github.com/apache/spark/blob/6a1be026cf37e4c8bf39133dfb4a73f7caedcc26/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala]. [readytalk/metrics-statsd|https://github.com/readytalk/metrics-statsd] is a StatsD adapter for the [dropwizard/metrics|https://github.com/dropwizard/metrics] library that Spark uses. The Maven repository at http://dl.bintray.com/readytalk/maven/ serves {{metrics-statsd}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5784) Add StatsDSink to MetricsSystem
[ https://issues.apache.org/jira/browse/SPARK-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569655#comment-14569655 ] Ryan Williams commented on SPARK-5784: -- [~varvind] seems like no; Spark packages would be a reasonable place to index / host a built version of this if someone wanted to do that! I didn't end up doing much with this myself. Add StatsDSink to MetricsSystem --- Key: SPARK-5784 URL: https://issues.apache.org/jira/browse/SPARK-5784 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Ryan Williams Priority: Minor [StatsD|https://github.com/etsy/statsd/] is a common wrapper for Graphite; it would be useful to support sending metrics to StatsD in addition to [the existing Graphite support|https://github.com/apache/spark/blob/6a1be026cf37e4c8bf39133dfb4a73f7caedcc26/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala]. [readytalk/metrics-statsd|https://github.com/readytalk/metrics-statsd] is a StatsD adapter for the [dropwizard/metrics|https://github.com/dropwizard/metrics] library that Spark uses. The Maven repository at http://dl.bintray.com/readytalk/maven/ serves {{metrics-statsd}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8050) Make Savable and Loader Java-friendly.
[ https://issues.apache.org/jira/browse/SPARK-8050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8050: --- Assignee: Xiangrui Meng (was: Apache Spark) Make Savable and Loader Java-friendly. -- Key: SPARK-8050 URL: https://issues.apache.org/jira/browse/SPARK-8050 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0, 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Should overload save/load to accept JavaSparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7991) Python DataFrame: support passing a list into describe
[ https://issues.apache.org/jira/browse/SPARK-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569650#comment-14569650 ] Amey Chaugule commented on SPARK-7991: -- [~rxin] : I'd like to work on this in case nobody else is. Python DataFrame: support passing a list into describe -- Key: SPARK-7991 URL: https://issues.apache.org/jira/browse/SPARK-7991 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter DataFrame.describe in Python takes a vararg, i.e. it can be invoked this way: {code} df.describe('col1', 'col2', 'col3') {code} Most of our DataFrame functions accept a list in addition to varargs. describe should do the same, i.e. it should also accept a Python list: {code} df.describe(['col1', 'col2', 'col3']) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8049) OneVsRest's output includes a temp column
[ https://issues.apache.org/jira/browse/SPARK-8049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8049: --- Assignee: Apache Spark (was: Xiangrui Meng) OneVsRest's output includes a temp column - Key: SPARK-8049 URL: https://issues.apache.org/jira/browse/SPARK-8049 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Apache Spark The temp accumulator column mbc$acc is included in the output which should be removed with withoutColumn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8049) OneVsRest's output includes a temp column
[ https://issues.apache.org/jira/browse/SPARK-8049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569708#comment-14569708 ] Apache Spark commented on SPARK-8049: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/6592 OneVsRest's output includes a temp column - Key: SPARK-8049 URL: https://issues.apache.org/jira/browse/SPARK-8049 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng The temp accumulator column mbc$acc is included in the output which should be removed with withoutColumn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8049) OneVsRest's output includes a temp column
[ https://issues.apache.org/jira/browse/SPARK-8049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8049: --- Assignee: Xiangrui Meng (was: Apache Spark) OneVsRest's output includes a temp column - Key: SPARK-8049 URL: https://issues.apache.org/jira/browse/SPARK-8049 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng The temp accumulator column mbc$acc is included in the output which should be removed with withoutColumn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8050) Make Savable and Loader Java-friendly.
[ https://issues.apache.org/jira/browse/SPARK-8050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8050: --- Assignee: Apache Spark (was: Xiangrui Meng) Make Savable and Loader Java-friendly. -- Key: SPARK-8050 URL: https://issues.apache.org/jira/browse/SPARK-8050 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0, 1.4.0 Reporter: Xiangrui Meng Assignee: Apache Spark Priority: Minor Should overload save/load to accept JavaSparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569671#comment-14569671 ] Greg Senia commented on SPARK-5159: --- SparkSQLThriftServer does not adhere to hive.server2.enable.doAS even though it seems to implement HiveServer2's thrift service. Are there plans to implement this feature as without this feature SparkSQL ThriftServer seems to be a bit useless in a secure kerberos environment where the spark/hive user does not have access to the data directly due to audit reasons.. Thrift server does not respect hive.server2.enable.doAs=true Key: SPARK-5159 URL: https://issues.apache.org/jira/browse/SPARK-5159 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andrew Ray I'm currently testing the spark sql thrift server on a kerberos secured cluster in YARN mode. Currently any user can access any table regardless of HDFS permissions as all data is read as the hive user. In HiveServer2 the property hive.server2.enable.doAs=true causes all access to be done as the submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8051) StringIndexerModel (and other models) shouldn't complain if the input column is missing.
Xiangrui Meng created SPARK-8051: Summary: StringIndexerModel (and other models) shouldn't complain if the input column is missing. Key: SPARK-8051 URL: https://issues.apache.org/jira/browse/SPARK-8051 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng If a transformer is not used during transformation, it should keep silent if the input column is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8014) DataFrame.write.mode(error).save(...) should not scan the output folder
[ https://issues.apache.org/jira/browse/SPARK-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-8014. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6583 [https://github.com/apache/spark/pull/6583] DataFrame.write.mode(error).save(...) should not scan the output folder - Key: SPARK-8014 URL: https://issues.apache.org/jira/browse/SPARK-8014 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Jianshi Huang Assignee: Cheng Lian Fix For: 1.4.0 When saving a DataFrame with {{ErrorIfExists}} as save mode, we shouldn't do metadata discovery if the destination folder exists. This also applies to {{SaveMode.Overwrite}} and {{SaveMode.Ignore}}. To reproduce this issue, we may make an empty directory {{/tmp/foo}} and leave an empty file {{bar}} there, then execute the following code in Spark shell: {code} import sqlContext._ import sqlContext.implicits._ Seq(1 - a).toDF(i, s).write.format(parquet).mode(error).save(file:///tmp/foo) {code} From the exception stack trace we can see that metadata discovery code path is executed: {noformat} java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small) at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152) at org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135) ... Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small) at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:408) at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:228) at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:224) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8050) Make Savable and Loader Java-friendly.
Xiangrui Meng created SPARK-8050: Summary: Make Savable and Loader Java-friendly. Key: SPARK-8050 URL: https://issues.apache.org/jira/browse/SPARK-8050 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0, 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Should overload save/load to accept JavaSparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8049) OneVsRest's output includes a temp column
Xiangrui Meng created SPARK-8049: Summary: OneVsRest's output includes a temp column Key: SPARK-8049 URL: https://issues.apache.org/jira/browse/SPARK-8049 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng The temp accumulator column mbc$acc is included in the output which should be removed with withoutColumn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6164) CrossValidatorModel should keep stats from fitting
[ https://issues.apache.org/jira/browse/SPARK-6164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569714#comment-14569714 ] Leah McGuire commented on SPARK-6164: - I fixed the merge conflict. Should be good to go now. CrossValidatorModel should keep stats from fitting -- Key: SPARK-6164 URL: https://issues.apache.org/jira/browse/SPARK-6164 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Priority: Minor CrossValidator computes stats for each (model, fold) pair, but they are thrown out by the created model. CrossValidatorModel should keep this info and expose it to users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8057) Call TaskAttemptContext.getTaskAttemptID using Reflection
Shixiong Zhu created SPARK-8057: --- Summary: Call TaskAttemptContext.getTaskAttemptID using Reflection Key: SPARK-8057 URL: https://issues.apache.org/jira/browse/SPARK-8057 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shixiong Zhu Someone may use the Spark core jar in the maven repo with hadoop 1. SPARK-2075 has already resolved the compatibility issue to support it. But SparkHadoopMapRedUtil.commitTask broke it recently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8057) Call TaskAttemptContext.getTaskAttemptID using Reflection
[ https://issues.apache.org/jira/browse/SPARK-8057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-8057: Affects Version/s: 1.3.1 Call TaskAttemptContext.getTaskAttemptID using Reflection - Key: SPARK-8057 URL: https://issues.apache.org/jira/browse/SPARK-8057 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Shixiong Zhu Someone may use the Spark core jar in the maven repo with hadoop 1. SPARK-2075 has already resolved the compatibility issue to support it. But SparkHadoopMapRedUtil.commitTask broke it recently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8057) Call TaskAttemptContext.getTaskAttemptID using Reflection
[ https://issues.apache.org/jira/browse/SPARK-8057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8057: --- Assignee: (was: Apache Spark) Call TaskAttemptContext.getTaskAttemptID using Reflection - Key: SPARK-8057 URL: https://issues.apache.org/jira/browse/SPARK-8057 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Shixiong Zhu Someone may use the Spark core jar in the maven repo with hadoop 1. SPARK-2075 has already resolved the compatibility issue to support it. But SparkHadoopMapRedUtil.commitTask broke it recently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8057) Call TaskAttemptContext.getTaskAttemptID using Reflection
[ https://issues.apache.org/jira/browse/SPARK-8057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570043#comment-14570043 ] Apache Spark commented on SPARK-8057: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/6599 Call TaskAttemptContext.getTaskAttemptID using Reflection - Key: SPARK-8057 URL: https://issues.apache.org/jira/browse/SPARK-8057 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Shixiong Zhu Someone may use the Spark core jar in the maven repo with hadoop 1. SPARK-2075 has already resolved the compatibility issue to support it. But SparkHadoopMapRedUtil.commitTask broke it recently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8057) Call TaskAttemptContext.getTaskAttemptID using Reflection
[ https://issues.apache.org/jira/browse/SPARK-8057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8057: --- Assignee: Apache Spark Call TaskAttemptContext.getTaskAttemptID using Reflection - Key: SPARK-8057 URL: https://issues.apache.org/jira/browse/SPARK-8057 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Reporter: Shixiong Zhu Assignee: Apache Spark Someone may use the Spark core jar in the maven repo with hadoop 1. SPARK-2075 has already resolved the compatibility issue to support it. But SparkHadoopMapRedUtil.commitTask broke it recently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8058) Add tests for SPARK-7853 and SPARK-8020
Yin Huai created SPARK-8058: --- Summary: Add tests for SPARK-7853 and SPARK-8020 Key: SPARK-8058 URL: https://issues.apache.org/jira/browse/SPARK-8058 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Yin Huai This jira is used to track the work of adding tests for SPARK-7853 (make sure {{spark-shell}} with and without {{--jars}} works with the isolated class loader) and SPARK-8020 (we are using correct metastore versions and jars setting to initialize {{metadataHive}}). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8059) Reduce latency between executor requests and RM heartbeat
Marcelo Vanzin created SPARK-8059: - Summary: Reduce latency between executor requests and RM heartbeat Key: SPARK-8059 URL: https://issues.apache.org/jira/browse/SPARK-8059 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Priority: Minor This is a follow up to SPARK-7533. On top of the changes done as part of that issue, we could reduce allocation latency by waking up the allocation thread when the driver send new requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8040) Remove Debian specific loopback address setting code
[ https://issues.apache.org/jira/browse/SPARK-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570093#comment-14570093 ] Yuta Kurosaki commented on SPARK-8040: -- I am sorry that I couldn't explain well. With that situations, I wrote my-pc.local 127.0.0.1 into /tec/hosts but this was ignored. Do you think this is correct behavior ? Now I found Issue [SPARK-4389]. It seems like related to this. Can I re-create Issue when this resolved ? Remove Debian specific loopback address setting code Key: SPARK-8040 URL: https://issues.apache.org/jira/browse/SPARK-8040 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Yuta Kurosaki Priority: Minor This Issue related to core/src/main/scala/org/apache/spark/util/Utils.scala. Method findLocalInetAddress should not return non-loopback address when SPARK_LOCAL_IP not set. This implementation may cause Error. Mainly develop environment, Interface IP address changed occasionally when spark running. But this implementation does not follow this change. So, I suggest simple behaviors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8055) Spark Launcher Improvements
Marcelo Vanzin created SPARK-8055: - Summary: Spark Launcher Improvements Key: SPARK-8055 URL: https://issues.apache.org/jira/browse/SPARK-8055 Project: Spark Issue Type: Umbrella Components: Spark Core Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Filing a bug to track different enhancements to the Spark launcher library. Please file sub-tasks for each particular enhancement instead of tagging PRs with this bug's number. After some discussion in the mailing list, people have requested different enhancements to the library. I'll try to capture those here but feel free to add more in the comments. - Missing information about the launched Spark application. Currently the library returns an opaque Process object that doesn't have a lot of Spark-related functionality. It would be useful to get at least some information about the underlying process; in the very least the application ID of the actual Spark job. Other useful information could be, for example, the current status of the submitted job. - Ability to control the underlying application. The Process object only allows you to kill the underlying application. It would be better to have better application-level APIs to try to stop the application more cleanly (e.g. by asking the cluster manager to kill it, or by stopping the SparkContext in client mode). - Ability to run Spark applications in the same JVM. This could potentially be done today for cluster mode apps without getting bit by the limitations of SparkContext. In the long run, it would be nice to also support client mode apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8055) Spark Launcher Improvements
[ https://issues.apache.org/jira/browse/SPARK-8055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1457#comment-1457 ] Marcelo Vanzin commented on SPARK-8055: --- /cc [~klmarkey] [~chester.c...@webwarecorp.com] Spark Launcher Improvements --- Key: SPARK-8055 URL: https://issues.apache.org/jira/browse/SPARK-8055 Project: Spark Issue Type: Umbrella Components: Spark Core Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Filing a bug to track different enhancements to the Spark launcher library. Please file sub-tasks for each particular enhancement instead of tagging PRs with this bug's number. After some discussion in the mailing list, people have requested different enhancements to the library. I'll try to capture those here but feel free to add more in the comments. - Missing information about the launched Spark application. Currently the library returns an opaque Process object that doesn't have a lot of Spark-related functionality. It would be useful to get at least some information about the underlying process; in the very least the application ID of the actual Spark job. Other useful information could be, for example, the current status of the submitted job. - Ability to control the underlying application. The Process object only allows you to kill the underlying application. It would be better to have better application-level APIs to try to stop the application more cleanly (e.g. by asking the cluster manager to kill it, or by stopping the SparkContext in client mode). - Ability to run Spark applications in the same JVM. This could potentially be done today for cluster mode apps without getting bit by the limitations of SparkContext. In the long run, it would be nice to also support client mode apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8026) Add Column.alias to Scala/Java API
[ https://issues.apache.org/jira/browse/SPARK-8026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8026. Resolution: Fixed Fix Version/s: 1.4.0 Add Column.alias to Scala/Java API -- Key: SPARK-8026 URL: https://issues.apache.org/jira/browse/SPARK-8026 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.4.0 To be consistent with the Python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8059) Reduce latency between executor requests and RM heartbeat
[ https://issues.apache.org/jira/browse/SPARK-8059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570073#comment-14570073 ] Apache Spark commented on SPARK-8059: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/6600 Reduce latency between executor requests and RM heartbeat - Key: SPARK-8059 URL: https://issues.apache.org/jira/browse/SPARK-8059 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Priority: Minor This is a follow up to SPARK-7533. On top of the changes done as part of that issue, we could reduce allocation latency by waking up the allocation thread when the driver send new requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8056) Design an easier way to construct schema for both Scala and Python
Reynold Xin created SPARK-8056: -- Summary: Design an easier way to construct schema for both Scala and Python Key: SPARK-8056 URL: https://issues.apache.org/jira/browse/SPARK-8056 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin StructType is fairly hard to construct, especially in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8049) OneVsRest's output includes a temp column
[ https://issues.apache.org/jira/browse/SPARK-8049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8049: - Fix Version/s: 1.5.0 1.4.1 OneVsRest's output includes a temp column - Key: SPARK-8049 URL: https://issues.apache.org/jira/browse/SPARK-8049 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.1, 1.5.0 The temp accumulator column mbc$acc is included in the output which should be removed with withoutColumn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7558) Log test name when starting and finishing each test
[ https://issues.apache.org/jira/browse/SPARK-7558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570033#comment-14570033 ] Apache Spark commented on SPARK-7558: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/6598 Log test name when starting and finishing each test --- Key: SPARK-7558 URL: https://issues.apache.org/jira/browse/SPARK-7558 Project: Spark Issue Type: Sub-task Components: Tests Reporter: Patrick Wendell Assignee: Andrew Or Fix For: 1.5.0 Right now it's really tough to interpret testing output because logs for different tests are interspersed in the same unit-tests.log file. This makes it particularly hard to diagnose flaky tests. This would be much easier if we logged the test name before and after every test (e.g. Starting test XX, Finished test XX). Then you could get right to the logs. I think one way to do this might be to create a custom test fixture that logs the test class name and then mix that into every test suite /cc [~joshrosen] for his superb knowledge of Scalatest. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8059) Reduce latency between executor requests and RM heartbeat
[ https://issues.apache.org/jira/browse/SPARK-8059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8059: --- Assignee: (was: Apache Spark) Reduce latency between executor requests and RM heartbeat - Key: SPARK-8059 URL: https://issues.apache.org/jira/browse/SPARK-8059 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Priority: Minor This is a follow up to SPARK-7533. On top of the changes done as part of that issue, we could reduce allocation latency by waking up the allocation thread when the driver send new requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7879) KMeans API for spark.ml Pipelines
[ https://issues.apache.org/jira/browse/SPARK-7879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570089#comment-14570089 ] Yu Ishikawa commented on SPARK-7879: I will implement it. KMeans API for spark.ml Pipelines - Key: SPARK-7879 URL: https://issues.apache.org/jira/browse/SPARK-7879 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Create a K-Means API for the spark.ml Pipelines API. This should wrap the existing KMeans implementation in spark.mllib. This should be the first clustering method added to Pipelines, and it will be important to consider [SPARK-7610] and think about designing the clustering API. We do not have to have abstractions from the beginning (and probably should not) but should think far enough ahead so we can add abstractions later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7991) Python DataFrame: support passing a list into describe
[ https://issues.apache.org/jira/browse/SPARK-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569788#comment-14569788 ] Reynold Xin commented on SPARK-7991: Please go ahead. This one should be simple. Python DataFrame: support passing a list into describe -- Key: SPARK-7991 URL: https://issues.apache.org/jira/browse/SPARK-7991 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter DataFrame.describe in Python takes a vararg, i.e. it can be invoked this way: {code} df.describe('col1', 'col2', 'col3') {code} Most of our DataFrame functions accept a list in addition to varargs. describe should do the same, i.e. it should also accept a Python list: {code} df.describe(['col1', 'col2', 'col3']) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8053) ElementwiseProduct scalingVec param name should match between ml,mllib
Joseph K. Bradley created SPARK-8053: Summary: ElementwiseProduct scalingVec param name should match between ml,mllib Key: SPARK-8053 URL: https://issues.apache.org/jira/browse/SPARK-8053 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor spark.mllib's ElementwiseProduct uses scalingVector spark.ml's ElementwiseProduct uses scalingVec We should make them match. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8054) Java compatibility fixes for MLlib 1.4
[ https://issues.apache.org/jira/browse/SPARK-8054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569900#comment-14569900 ] Apache Spark commented on SPARK-8054: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/6562 Java compatibility fixes for MLlib 1.4 -- Key: SPARK-8054 URL: https://issues.apache.org/jira/browse/SPARK-8054 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley See [SPARK-7529] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7529) Java compatibility check for MLlib 1.4
[ https://issues.apache.org/jira/browse/SPARK-7529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7529. -- Resolution: Fixed Fix Version/s: 1.4.0 I'm marking this Fixed. Since the PR with the fixes will go into 1.4.1 and 1.5, this JIRA is complete. Java compatibility check for MLlib 1.4 -- Key: SPARK-7529 URL: https://issues.apache.org/jira/browse/SPARK-7529 Project: Spark Issue Type: Sub-task Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Joseph K. Bradley Fix For: 1.4.0 Check Java compatibility for MLlib 1.4. We should create separate JIRAs for each possible issue. Checking compatibility means: * comparing with the Scala doc * verifying that Java docs are not messed up by Scala type incompatibilities. Some items to look out for are: ** Check for generic Object types where Java cannot understand complex Scala types. ** Check Scala objects (especially with nesting!) carefully. ** Check for uses of Scala and Java enumerations, which can show up oddly in the other language's doc. * If needed for complex issues, create small Java unit tests which execute each method. (The correctness can be checked in Scala.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2551) Cleanup FilteringParquetRowInputFormat
[ https://issues.apache.org/jira/browse/SPARK-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569912#comment-14569912 ] Thomas Omans commented on SPARK-2551: - Wanted to chime in since I upgraded parquet re: SPARK-7743 After looking at the PARQUET-16 issue it looks like the pull request https://github.com/apache/parquet-mr/pull/17 made by [~liancheng] (the reporter of PARQUET-16) closed the PR as resolved by https://github.com/apache/parquet-mr/pull/45 (which is included in the 1.7.0 upgrade). That means that these reflection hacks should be ready for removal, or that the PARQUET-16 ticket should be closed at the very least ;) Cleanup FilteringParquetRowInputFormat -- Key: SPARK-2551 URL: https://issues.apache.org/jira/browse/SPARK-2551 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Priority: Minor To workaround [PARQUET-16|https://issues.apache.org/jira/browse/PARQUET-16] and fix [SPARK-2119|https://issues.apache.org/jira/browse/SPARK-2119], we did some reflection hacks in {{FilteringParquetRowInputFormat}}. This should be cleaned up once PARQUET-16 is fixed. A PR for PARQUET-16 is [here|https://github.com/apache/incubator-parquet-mr/pull/17]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8051) StringIndexerModel (and other models) shouldn't complain if the input column is missing.
[ https://issues.apache.org/jira/browse/SPARK-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8051: --- Assignee: Apache Spark (was: Xiangrui Meng) StringIndexerModel (and other models) shouldn't complain if the input column is missing. Key: SPARK-8051 URL: https://issues.apache.org/jira/browse/SPARK-8051 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Apache Spark If a transformer is not used during transformation, it should keep silent if the input column is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8051) StringIndexerModel (and other models) shouldn't complain if the input column is missing.
[ https://issues.apache.org/jira/browse/SPARK-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569796#comment-14569796 ] Apache Spark commented on SPARK-8051: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/6595 StringIndexerModel (and other models) shouldn't complain if the input column is missing. Key: SPARK-8051 URL: https://issues.apache.org/jira/browse/SPARK-8051 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng If a transformer is not used during transformation, it should keep silent if the input column is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org