[jira] [Commented] (SPARK-10180) JDBCRDD does not process EqualNullSafe filter.
[ https://issues.apache.org/jira/browse/SPARK-10180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743339#comment-14743339 ] Apache Spark commented on SPARK-10180: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/8743 > JDBCRDD does not process EqualNullSafe filter. > -- > > Key: SPARK-10180 > URL: https://issues.apache.org/jira/browse/SPARK-10180 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Minor > > Simply {{JDBCRelation}} passes EqualNullSafe (source.filter) but > {{compileFilter()}} in {{JDBCRDD}} does not apply this. > It would be a single-line update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10588) Saving a DataFrame containing only nulls to JSON doesn't work
Cheng Lian created SPARK-10588: -- Summary: Saving a DataFrame containing only nulls to JSON doesn't work Key: SPARK-10588 URL: https://issues.apache.org/jira/browse/SPARK-10588 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Snippets to reproduce this issue: {noformat} val path = "file:///tmp/spark/null" // A single row containing a single null double, saving to JSON, wrong sqlContext. range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0"). write.mode("overwrite").json(path) sqlContext.read.json(path).show() ++ || ++ || ++ // Two rows each containing a single null double, saving to JSON, wrong sqlContext. range(2).selectExpr("CAST(NULL AS DOUBLE) AS c0"). write.mode("overwrite").json(path) sqlContext.read.json(path).show() ++ || ++ || || ++ // A single row containing two null doubles, saving to JSON, wrong sqlContext. range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0", "CAST(NULL AS DOUBLE) AS c1"). write.mode("overwrite").json(path) sqlContext.read.json(path).show() ++ || ++ || ++ // A single row containing a single null double, saving to Parquet, OK sqlContext. range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0"). write.mode("overwrite").parquet(path) sqlContext.read.parquet(path).show() ++ | d| ++ |null| ++ // Two rows, one containing a single null double, one containing non-null double, saving to JSON, OK sqlContext. range(2).selectExpr("IF(id % 2 = 0, CAST(NULL AS DOUBLE), id) AS c0"). write.mode("overwrite").json(path) sqlContext.read.json(path).show() ++ | d| ++ |null| | 1.0| ++ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10587) In pyspark, toDF() dosen't exsist in RDD object
[ https://issues.apache.org/jira/browse/SPARK-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743334#comment-14743334 ] Sean Owen commented on SPARK-10587: --- It's in {{python/pyspark/sql/context.py}}. Are you sure your imports are in order? This is probably a question for user@, not a JIRA at this point. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > In pyspark, toDF() dosen't exsist in RDD object > --- > > Key: SPARK-10587 > URL: https://issues.apache.org/jira/browse/SPARK-10587 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: SemiCoder > > I can't find toDF() function in RDD. > In pyspark.mllib.linalg.distributed , the IndexedRowMatrix.__init__() > require the rows should be an RDD and execute rows.toDF() but actually the > RDD in pyspark dosen't have toDF() function -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10589) Add defense against external site framing
[ https://issues.apache.org/jira/browse/SPARK-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10589: Assignee: Apache Spark (was: Sean Owen) > Add defense against external site framing > - > > Key: SPARK-10589 > URL: https://issues.apache.org/jira/browse/SPARK-10589 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Minor > > This came up as a minor point during a security audit using a common scanning > tool: It's best if Spark UIs try to actively defend against certain types of > frame-related vulnerabilities by setting X-Frame-Options. See > https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet > Easy PR coming ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2960) Spark executables fail to start via symlinks
[ https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743415#comment-14743415 ] Danil Mironov commented on SPARK-2960: -- The title of the issue is not that misleading, when one doesn't get spark-something running after 'spark-something' was typed in, it's commonly known as 'spark executables fail to start'. Following the symlinks does fix the issue in hand. Having executables {quote}configured by {{SPARK_HOME}} and/or {{SPARK_CONF_DIR}} {quote} would be a nice solution, I'd vote for that. This implies scripts treating those configurations as read-only and quitting early and loudly if the latter is missing or crippled. That's some rework though, not a bug to fix. > Spark executables fail to start via symlinks > > > Key: SPARK-2960 > URL: https://issues.apache.org/jira/browse/SPARK-2960 > Project: Spark > Issue Type: Bug > Components: Deploy >Reporter: Shay Rojansky >Priority: Minor > > The current scripts (e.g. pyspark) fail to run when they are executed via > symlinks. A common Linux scenario would be to have Spark installed somewhere > (e.g. /opt) and have a symlink to it in /usr/bin. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10589) Add defense against external site framing
Sean Owen created SPARK-10589: - Summary: Add defense against external site framing Key: SPARK-10589 URL: https://issues.apache.org/jira/browse/SPARK-10589 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.5.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor This came up as a minor point during a security audit using a common scanning tool: It's best if Spark UIs try to actively defend against certain types of frame-related vulnerabilities by setting X-Frame-Options. See https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet Easy PR coming ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743421#comment-14743421 ] Apache Spark commented on SPARK-1537: - User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/8744 > Add integration with Yarn's Application Timeline Server > --- > > Key: SPARK-1537 > URL: https://issues.apache.org/jira/browse/SPARK-1537 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Marcelo Vanzin > Attachments: SPARK-1537.txt, spark-1573.patch > > > It would be nice to have Spark integrate with Yarn's Application Timeline > Server (see YARN-321, YARN-1530). This would allow users running Spark on > Yarn to have a single place to go for all their history needs, and avoid > having to manage a separate service (Spark's built-in server). > At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, > although there is still some ongoing work. But the basics are there, and I > wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10577) [PySpark] DataFrame hint for broadcast join
[ https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743426#comment-14743426 ] Jian Feng Zhang commented on SPARK-10577: - I'd like to take this to create a pull request. > [PySpark] DataFrame hint for broadcast join > --- > > Key: SPARK-10577 > URL: https://issues.apache.org/jira/browse/SPARK-10577 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.0 >Reporter: Maciej BryĆski > Labels: starter > > As in https://issues.apache.org/jira/browse/SPARK-8300 > there should by possibility to add hint for broadcast join in: > - Pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10589) Add defense against external site framing
[ https://issues.apache.org/jira/browse/SPARK-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10589: Assignee: Sean Owen (was: Apache Spark) > Add defense against external site framing > - > > Key: SPARK-10589 > URL: https://issues.apache.org/jira/browse/SPARK-10589 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > This came up as a minor point during a security audit using a common scanning > tool: It's best if Spark UIs try to actively defend against certain types of > frame-related vulnerabilities by setting X-Frame-Options. See > https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet > Easy PR coming ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10589) Add defense against external site framing
[ https://issues.apache.org/jira/browse/SPARK-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743470#comment-14743470 ] Apache Spark commented on SPARK-10589: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/8745 > Add defense against external site framing > - > > Key: SPARK-10589 > URL: https://issues.apache.org/jira/browse/SPARK-10589 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > This came up as a minor point during a security audit using a common scanning > tool: It's best if Spark UIs try to actively defend against certain types of > frame-related vulnerabilities by setting X-Frame-Options. See > https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet > Easy PR coming ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
[ https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rustam Aliyev updated SPARK-7442: - Comment: was deleted (was: Hit this bug today. It basically makes Spark on AWS useless for many scenarios. Please prioritise.) > Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access > - > > Key: SPARK-7442 > URL: https://issues.apache.org/jira/browse/SPARK-7442 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.1 > Environment: OS X >Reporter: Nicholas Chammas > > # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads > page|http://spark.apache.org/downloads.html]. > # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}} > # Fire up PySpark and try reading from S3 with something like this: > {code}sc.textFile('s3n://bucket/file_*').count(){code} > # You will get an error like this: > {code}py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.io.IOException: No FileSystem for scheme: s3n{code} > {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 > works. > It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 > that doesn't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4815) ThriftServer use only one SessionState to run sql using hive
[ https://issues.apache.org/jira/browse/SPARK-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743555#comment-14743555 ] Joseph Fourny commented on SPARK-4815: -- Is this really fixed? I am on Spark 1.5.0 (rc3) and I see very little isolation between JDBC connections to the ThriftServer. For example, "SET X=Y" or "USE DATABASE X" on one connection immediately affects all other connections. This is extremely undesirable behavior. Was there a regression at some point? > ThriftServer use only one SessionState to run sql using hive > - > > Key: SPARK-4815 > URL: https://issues.apache.org/jira/browse/SPARK-4815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: guowei > > ThriftServer use only one SessionState to run sql using hive, though it from > different hive sessions. > This will make mistakes: > For example, one user run "use database" in one beeline client. the database > in other beeline change too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743531#comment-14743531 ] Steve Loughran commented on SPARK-2356: --- The original JIRA here is just that there's an error being printed out; in that specific example it is just noise. You can set the log in log4j to tell it not to log anything in {{org.apache.hadoop.util.Shell}} and you won't see this text. The other issues people are finding are actual problems; Hadoop and the libraries underneath are trying to load WINUTILS.EXE for real work -and failing > Exception: Could not locate executable null\bin\winutils.exe in the Hadoop > --- > > Key: SPARK-2356 > URL: https://issues.apache.org/jira/browse/SPARK-2356 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 1.0.0 >Reporter: Kostiantyn Kudriavtsev >Priority: Critical > > I'm trying to run some transformation on Spark, it works fine on cluster > (YARN, linux machines). However, when I'm trying to run it on local machine > (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file > from local filesystem): > {code} > 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the > hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Hadoop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) > at org.apache.hadoop.util.Shell.(Shell.java:326) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:76) > at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) > at org.apache.hadoop.security.Groups.(Groups.java:77) > at > org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) > at > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) > at > org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36) > at > org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109) > at > org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) > at org.apache.spark.SparkContext.(SparkContext.scala:228) > at org.apache.spark.SparkContext.(SparkContext.scala:97) > {code} > It's happened because Hadoop config is initialized each time when spark > context is created regardless is hadoop required or not. > I propose to add some special flag to indicate if hadoop config is required > (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6961) Cannot save data to parquet files when executing from Windows from a Maven Project
[ https://issues.apache.org/jira/browse/SPARK-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743537#comment-14743537 ] Steve Loughran commented on SPARK-6961: --- Well, its an installation-side issue in that "if it isn't there you can fix it with a re-installation". The fact that things are failing with an utterly useless error message, that is very much a code-side issue. HADOOP-10775 is going to add extra checks and a link to a wiki entry (https://wiki.apache.org/hadoop/WindowsProblems) with some advice. One troublespot there is that code is often just referencing a field (which is set to null on a load failure); the patch will have to make sure we switch to exception-raising getters as needed, and that the callers handle the raised exceptions properly. > Cannot save data to parquet files when executing from Windows from a Maven > Project > -- > > Key: SPARK-6961 > URL: https://issues.apache.org/jira/browse/SPARK-6961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Bogdan Niculescu >Priority: Critical > > I have setup a project where I am trying to save a DataFrame into a parquet > file. My project is a Maven one with Spark 1.3.0 and Scala 2.11.5 : > {code:xml} > 1.3.0 > > org.apache.spark > spark-core_2.11 > ${spark.version} > > > org.apache.spark > spark-sql_2.11 > ${spark.version} > > {code} > A simple version of my code that reproduces consistently the problem that I > am seeing is : > {code} > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkConf, SparkContext} > case class Person(name: String, age: Int) > object DataFrameTest extends App { > val conf = new SparkConf().setMaster("local[4]").setAppName("DataFrameTest") > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val persons = List(Person("a", 1), Person("b", 2)) > val rdd = sc.parallelize(persons) > val dataFrame = sqlContext.createDataFrame(rdd) > dataFrame.saveAsParquetFile("test.parquet") > } > {code} > All the time the exception that I am getting is : > {code} > Exception in thread "main" java.lang.NullPointerException > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:404) > at org.apache.hadoop.util.Shell.run(Shell.java:379) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:678) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:661) > at > org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639) > at > org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:772) > at > parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:409) > at > parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:401) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:443) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.prepareMetadata(newParquet.scala:240) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:256) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251) > at > org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:370) > at > org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96) > at > org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125) > at
[jira] [Commented] (SPARK-10550) SQLListener error constructing extended SQLContext
[ https://issues.apache.org/jira/browse/SPARK-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743572#comment-14743572 ] shao lo commented on SPARK-10550: - There are parts that are marked as experimental. This is not in that category. The reason to make a class have protected access is exactly to promote extension. > SQLListener error constructing extended SQLContext > --- > > Key: SPARK-10550 > URL: https://issues.apache.org/jira/browse/SPARK-10550 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: shao lo >Priority: Minor > > With spark 1.4.1 I was able to created a custom SQLContext class. With spark > 1.5.0, I now get an error calling the super class constructor. The problem > is related to this new code that was added between 1.4.1 and 1.5.0 > // `listener` should be only used in the driver > @transient private[sql] val listener = new SQLListener(this) > sparkContext.addSparkListener(listener) > sparkContext.ui.foreach(new SQLTab(this, _)) > ..which generates > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.execution.ui.SQLListener.(SQLListener.scala:34) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:77) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
[ https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743539#comment-14743539 ] Rustam Aliyev commented on SPARK-7442: -- Hit this bug today. It basically makes Spark on AWS useless for many scenarios. Please prioritise. > Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access > - > > Key: SPARK-7442 > URL: https://issues.apache.org/jira/browse/SPARK-7442 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.1 > Environment: OS X >Reporter: Nicholas Chammas > > # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads > page|http://spark.apache.org/downloads.html]. > # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}} > # Fire up PySpark and try reading from S3 with something like this: > {code}sc.textFile('s3n://bucket/file_*').count(){code} > # You will get an error like this: > {code}py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.io.IOException: No FileSystem for scheme: s3n{code} > {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 > works. > It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 > that doesn't work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10590) Spark with YARN build is broken
Kevin Tsai created SPARK-10590: -- Summary: Spark with YARN build is broken Key: SPARK-10590 URL: https://issues.apache.org/jira/browse/SPARK-10590 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Environment: CentOS 6.5 Maven 3.3.3 Hadoop 2.6.0 Spark 1.5.0 Reporter: Kevin Tsai Hi, After upgrade to v1.5.0 and trying to build it. It shows: [ERROR] missing or invalid dependency detected while loading class file 'WebUI.class' It was working on Spark 1.4.1 Build command: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11 -DskipTests clean package Hope it helps. Regards, Kevin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7012) Add support for NOT NULL modifier for column definitions on DDLParser
[ https://issues.apache.org/jira/browse/SPARK-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743500#comment-14743500 ] Apache Spark commented on SPARK-7012: - User 'sabhyankar' has created a pull request for this issue: https://github.com/apache/spark/pull/8746 > Add support for NOT NULL modifier for column definitions on DDLParser > - > > Key: SPARK-7012 > URL: https://issues.apache.org/jira/browse/SPARK-7012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Santiago M. Mola >Priority: Minor > Labels: easyfix > > Add support for NOT NULL modifier for column definitions on DDLParser. This > would add support for the following syntax: > CREATE TEMPORARY TABLE (field INTEGER NOT NULL) ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7012) Add support for NOT NULL modifier for column definitions on DDLParser
[ https://issues.apache.org/jira/browse/SPARK-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7012: --- Assignee: Apache Spark > Add support for NOT NULL modifier for column definitions on DDLParser > - > > Key: SPARK-7012 > URL: https://issues.apache.org/jira/browse/SPARK-7012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Santiago M. Mola >Assignee: Apache Spark >Priority: Minor > Labels: easyfix > > Add support for NOT NULL modifier for column definitions on DDLParser. This > would add support for the following syntax: > CREATE TEMPORARY TABLE (field INTEGER NOT NULL) ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7012) Add support for NOT NULL modifier for column definitions on DDLParser
[ https://issues.apache.org/jira/browse/SPARK-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7012: --- Assignee: (was: Apache Spark) > Add support for NOT NULL modifier for column definitions on DDLParser > - > > Key: SPARK-7012 > URL: https://issues.apache.org/jira/browse/SPARK-7012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Santiago M. Mola >Priority: Minor > Labels: easyfix > > Add support for NOT NULL modifier for column definitions on DDLParser. This > would add support for the following syntax: > CREATE TEMPORARY TABLE (field INTEGER NOT NULL) ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10590) Spark with YARN build is broken
[ https://issues.apache.org/jira/browse/SPARK-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Tsai updated SPARK-10590: --- Environment: CentOS 6.5 Oracle JDK 1.7.0_75 Maven 3.3.3 Hadoop 2.6.0 Spark 1.5.0 was: CentOS 6.5 Maven 3.3.3 Hadoop 2.6.0 Spark 1.5.0 > Spark with YARN build is broken > --- > > Key: SPARK-10590 > URL: https://issues.apache.org/jira/browse/SPARK-10590 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 > Environment: CentOS 6.5 > Oracle JDK 1.7.0_75 > Maven 3.3.3 > Hadoop 2.6.0 > Spark 1.5.0 >Reporter: Kevin Tsai > > Hi, After upgrade to v1.5.0 and trying to build it. > It shows: > [ERROR] missing or invalid dependency detected while loading class file > 'WebUI.class' > It was working on Spark 1.4.1 > Build command: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive > -Phive-thriftserver -Dscala-2.11 -DskipTests clean package > Hope it helps. > Regards, > Kevin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10458) Would like to know if a given Spark Context is stopped or currently stopping
[ https://issues.apache.org/jira/browse/SPARK-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743777#comment-14743777 ] Apache Spark commented on SPARK-10458: -- User 'kmadhugit' has created a pull request for this issue: https://github.com/apache/spark/pull/8749 > Would like to know if a given Spark Context is stopped or currently stopping > > > Key: SPARK-10458 > URL: https://issues.apache.org/jira/browse/SPARK-10458 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Matt Cheah >Priority: Minor > > I ran into a case where a thread stopped a Spark Context, specifically when I > hit the "kill" link from the Spark standalone UI. There was no real way for > another thread to know that the context had stopped and thus should have > handled that accordingly. > Checking that the SparkEnv is null is one way, but that doesn't handle the > case where the context is in the midst of stopping, and stopping the context > may actually not be instantaneous - in my case for some reason the > DAGScheduler was taking a non-trivial amount of time to stop. > Implementation wise I'm more or less requesting the boolean value returned > from SparkContext.stopped.get() to be visible in some way. As long as we > return the value and not the Atomic Boolean itself (we wouldn't want anyone > to be setting this, after all!) it would help client applications check the > context's liveliness. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10458) Would like to know if a given Spark Context is stopped or currently stopping
[ https://issues.apache.org/jira/browse/SPARK-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10458: Assignee: Apache Spark > Would like to know if a given Spark Context is stopped or currently stopping > > > Key: SPARK-10458 > URL: https://issues.apache.org/jira/browse/SPARK-10458 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Matt Cheah >Assignee: Apache Spark >Priority: Minor > > I ran into a case where a thread stopped a Spark Context, specifically when I > hit the "kill" link from the Spark standalone UI. There was no real way for > another thread to know that the context had stopped and thus should have > handled that accordingly. > Checking that the SparkEnv is null is one way, but that doesn't handle the > case where the context is in the midst of stopping, and stopping the context > may actually not be instantaneous - in my case for some reason the > DAGScheduler was taking a non-trivial amount of time to stop. > Implementation wise I'm more or less requesting the boolean value returned > from SparkContext.stopped.get() to be visible in some way. As long as we > return the value and not the Atomic Boolean itself (we wouldn't want anyone > to be setting this, after all!) it would help client applications check the > context's liveliness. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10458) Would like to know if a given Spark Context is stopped or currently stopping
[ https://issues.apache.org/jira/browse/SPARK-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10458: Assignee: (was: Apache Spark) > Would like to know if a given Spark Context is stopped or currently stopping > > > Key: SPARK-10458 > URL: https://issues.apache.org/jira/browse/SPARK-10458 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Matt Cheah >Priority: Minor > > I ran into a case where a thread stopped a Spark Context, specifically when I > hit the "kill" link from the Spark standalone UI. There was no real way for > another thread to know that the context had stopped and thus should have > handled that accordingly. > Checking that the SparkEnv is null is one way, but that doesn't handle the > case where the context is in the midst of stopping, and stopping the context > may actually not be instantaneous - in my case for some reason the > DAGScheduler was taking a non-trivial amount of time to stop. > Implementation wise I'm more or less requesting the boolean value returned > from SparkContext.stopped.get() to be visible in some way. As long as we > return the value and not the Atomic Boolean itself (we wouldn't want anyone > to be setting this, after all!) it would help client applications check the > context's liveliness. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10550) SQLListener error constructing extended SQLContext
[ https://issues.apache.org/jira/browse/SPARK-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743642#comment-14743642 ] Sean Owen commented on SPARK-10550: --- It's marked {{protected[sql]}} which means it is not accessible outside {{org.apache.spark.sql}}. It can't be an API as such, not even 'experimental'. You're kind of at your own risk if you're trying to access things like this, as they may change from version to version. (It ends up being merely "protected" in the bytecode since the JVM has no similar notion of "protected with respect to a package" though.) This is why I'm not sure this can be considered a 'bug' as I understand what you're trying to do. > SQLListener error constructing extended SQLContext > --- > > Key: SPARK-10550 > URL: https://issues.apache.org/jira/browse/SPARK-10550 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: shao lo >Priority: Minor > > With spark 1.4.1 I was able to created a custom SQLContext class. With spark > 1.5.0, I now get an error calling the super class constructor. The problem > is related to this new code that was added between 1.4.1 and 1.5.0 > // `listener` should be only used in the driver > @transient private[sql] val listener = new SQLListener(this) > sparkContext.addSparkListener(listener) > sparkContext.ui.foreach(new SQLTab(this, _)) > ..which generates > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.execution.ui.SQLListener.(SQLListener.scala:34) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:77) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10590) Spark with YARN build is broken
[ https://issues.apache.org/jira/browse/SPARK-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743666#comment-14743666 ] Sean Owen commented on SPARK-10590: --- Did you run the script to set up the build for Scala 2.11 first? otherwise this probably won't work indeed. > Spark with YARN build is broken > --- > > Key: SPARK-10590 > URL: https://issues.apache.org/jira/browse/SPARK-10590 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 > Environment: CentOS 6.5 > Oracle JDK 1.7.0_75 > Maven 3.3.3 > Hadoop 2.6.0 > Spark 1.5.0 >Reporter: Kevin Tsai > > Hi, After upgrade to v1.5.0 and trying to build it. > It shows: > [ERROR] missing or invalid dependency detected while loading class file > 'WebUI.class' > It was working on Spark 1.4.1 > Build command: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive > -Phive-thriftserver -Dscala-2.11 -DskipTests clean package > Hope it helps. > Regards, > Kevin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10588) Saving a DataFrame containing only nulls to JSON doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743770#comment-14743770 ] Yin Huai commented on SPARK-10588: -- This is an expected behavior. When we write a row out, we skip those null values, which is pretty useful to save space when writing sparse data to json. One possible way to address this issue is to write null values only for the first row generated by a writer. > Saving a DataFrame containing only nulls to JSON doesn't work > - > > Key: SPARK-10588 > URL: https://issues.apache.org/jira/browse/SPARK-10588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian > > Snippets to reproduce this issue: > {noformat} > val path = "file:///tmp/spark/null" > // A single row containing a single null double, saving to JSON, wrong > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > ++ > // Two rows each containing a single null double, saving to JSON, wrong > sqlContext. > range(2).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > || > ++ > // A single row containing two null doubles, saving to JSON, wrong > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0", "CAST(NULL AS DOUBLE) AS > c1"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > ++ > // A single row containing a single null double, saving to Parquet, OK > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").parquet(path) > sqlContext.read.parquet(path).show() > ++ > | d| > ++ > |null| > ++ > // Two rows, one containing a single null double, one containing non-null > double, saving to JSON, OK > sqlContext. > range(2).selectExpr("IF(id % 2 = 0, CAST(NULL AS DOUBLE), id) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > | d| > ++ > |null| > | 1.0| > ++ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10588) Saving a DataFrame containing only nulls to JSON doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10588: - Priority: Minor (was: Major) > Saving a DataFrame containing only nulls to JSON doesn't work > - > > Key: SPARK-10588 > URL: https://issues.apache.org/jira/browse/SPARK-10588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Priority: Minor > > Snippets to reproduce this issue: > {noformat} > val path = "file:///tmp/spark/null" > // A single row containing a single null double, saving to JSON, wrong > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > ++ > // Two rows each containing a single null double, saving to JSON, wrong > sqlContext. > range(2).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > || > ++ > // A single row containing two null doubles, saving to JSON, wrong > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0", "CAST(NULL AS DOUBLE) AS > c1"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > ++ > // A single row containing a single null double, saving to Parquet, OK > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").parquet(path) > sqlContext.read.parquet(path).show() > ++ > | d| > ++ > |null| > ++ > // Two rows, one containing a single null double, one containing non-null > double, saving to JSON, OK > sqlContext. > range(2).selectExpr("IF(id % 2 = 0, CAST(NULL AS DOUBLE), id) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > | d| > ++ > |null| > | 1.0| > ++ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10585) only copy data once when generate unsafe projection
[ https://issues.apache.org/jira/browse/SPARK-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743677#comment-14743677 ] Apache Spark commented on SPARK-10585: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/8747 > only copy data once when generate unsafe projection > --- > > Key: SPARK-10585 > URL: https://issues.apache.org/jira/browse/SPARK-10585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > When we have nested struct, array or map, we will create a byte buffer for > each of them, and copy data to the buffer first, then copy them to the final > row buffer. We can save the first copy and directly copy data to final row > buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10585) only copy data once when generate unsafe projection
[ https://issues.apache.org/jira/browse/SPARK-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10585: Assignee: Apache Spark > only copy data once when generate unsafe projection > --- > > Key: SPARK-10585 > URL: https://issues.apache.org/jira/browse/SPARK-10585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > > When we have nested struct, array or map, we will create a byte buffer for > each of them, and copy data to the buffer first, then copy them to the final > row buffer. We can save the first copy and directly copy data to final row > buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10585) only copy data once when generate unsafe projection
[ https://issues.apache.org/jira/browse/SPARK-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10585: Assignee: (was: Apache Spark) > only copy data once when generate unsafe projection > --- > > Key: SPARK-10585 > URL: https://issues.apache.org/jira/browse/SPARK-10585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > When we have nested struct, array or map, we will create a byte buffer for > each of them, and copy data to the buffer first, then copy them to the final > row buffer. We can save the first copy and directly copy data to final row > buffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns
[ https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743759#comment-14743759 ] Shivaram Venkataraman commented on SPARK-9325: -- Thanks [~felixcheung] for investigating into this. I see the problem that we need a handle to the DataFrame in order to be able to collect a column. I can think of couple of ways to solve this: One is to save an optional handle to the DataFrame in the R side and then if the handle is available we will support collect. i.e. if the column was created using some other method (say col("name") then we won't support collect). The other is to add a method on the Scala side which can return the data frame handle or do the selection for us if the column is resolved -- [~davies] or [~rxin] might be able to comment more on this. > Support `collect` on DataFrame columns > -- > > Key: SPARK-9325 > URL: https://issues.apache.org/jira/browse/SPARK-9325 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > This is to support code of the form > ``` > ages <- collect(df$Age) > ``` > Right now `df$Age` returns a Column, which has no functions supported. > Similarly we might consider supporting `head(df$Age)` etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6417) Add Linear Programming algorithm
[ https://issues.apache.org/jira/browse/SPARK-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743896#comment-14743896 ] Ehsan Mohyedin Kermani commented on SPARK-6417: --- Thank you Joseph for the advice! I have started with the starter kit and fixed some annotations to get a sense of contributing to Spark. I am going to work on the LP implementations and perhaps submit it as a package. Regards > Add Linear Programming algorithm > - > > Key: SPARK-6417 > URL: https://issues.apache.org/jira/browse/SPARK-6417 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang > Labels: features > > Linear programming is the problem of finding a vector x that minimizes a > linear function fTx subject to linear constraints: > minxfTx > such that one or more of the following hold: A·x †b, Aeq·x = beq, l †x †u. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10590) Spark with YARN build is broken
[ https://issues.apache.org/jira/browse/SPARK-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743813#comment-14743813 ] Kevin Tsai commented on SPARK-10590: Hi Sean, The result is same as previous when I build it after installed the Scala 2.11.7 Here is the result: ... [ERROR] missing or invalid dependency detected while loading class file 'WebUI.class'. Could not access term jetty in value org.eclipse, because it (or its dependencies) are missing. Check your build definition for missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.) A full rebuild may help if 'WebUI.class' was compiled against an incompatible version of org.eclipse. [WARNING] 22 warnings found [ERROR] two errors found. [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 7.360 s] [INFO] Spark Project Core . SUCCESS [05:41 min] [INFO] Spark Project Bagel SUCCESS [ 40.951 s] [INFO] Spark Project GraphX ... SUCCESS [01:41 min] [INFO] Spark Project ML Library ... SUCCESS [04:05 min] [INFO] Spark Project Tools SUCCESS [ 20.053 s] [INFO] Spark Project Networking ... SUCCESS [ 10.914 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 6.852 s] [INFO] Spark Project Streaming SUCCESS [02:38 min] [INFO] Spark Project Catalyst . SUCCESS [03:16 min] [INFO] Spark Project SQL .. FAILURE [01:22 min] > Spark with YARN build is broken > --- > > Key: SPARK-10590 > URL: https://issues.apache.org/jira/browse/SPARK-10590 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 > Environment: CentOS 6.5 > Oracle JDK 1.7.0_75 > Maven 3.3.3 > Hadoop 2.6.0 > Spark 1.5.0 >Reporter: Kevin Tsai > > Hi, After upgrade to v1.5.0 and trying to build it. > It shows: > [ERROR] missing or invalid dependency detected while loading class file > 'WebUI.class' > It was working on Spark 1.4.1 > Build command: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive > -Phive-thriftserver -Dscala-2.11 -DskipTests clean package > Hope it helps. > Regards, > Kevin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10579) Extend statistical functions: Add Cardinality/Quantiles/Quartiles/Median in Statistics , e.g. for columns
[ https://issues.apache.org/jira/browse/SPARK-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743819#comment-14743819 ] Joseph K. Bradley commented on SPARK-10579: --- A lot of this functionality is being added to DataFrames instead. I'd recommend examining what DataFrames provides (and what JIRAs are there) & opening up JIRAs as need for each function you're interested in. I'll close this for now but will keep watching. Thanks! > Extend statistical functions: Add Cardinality/Quantiles/Quartiles/Median in > Statistics , e.g. for columns > - > > Key: SPARK-10579 > URL: https://issues.apache.org/jira/browse/SPARK-10579 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Narine Kokhlikyan >Priority: Minor > Fix For: 1.6.0 > > Original Estimate: 120h > Remaining Estimate: 120h > > Hi everyone, > I think it would be good to extend statistical functions in mllib package, by > adding Cardinality/Quantiles/Quartiles/Median for the columns, as many other > ml and statistical libraries already have it. I couldn't find it in mllib > package, hence would like to suggest it. > Since this is my first time working with jira, I'd truly appreciate if > someone could review this and let me know what do you think. > Also, I'd really like to work on it and looking forward to hearing from you! > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10579) Extend statistical functions: Add Cardinality/Quantiles/Quartiles/Median in Statistics , e.g. for columns
[ https://issues.apache.org/jira/browse/SPARK-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-10579. - Resolution: Won't Fix > Extend statistical functions: Add Cardinality/Quantiles/Quartiles/Median in > Statistics , e.g. for columns > - > > Key: SPARK-10579 > URL: https://issues.apache.org/jira/browse/SPARK-10579 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Narine Kokhlikyan >Priority: Minor > Fix For: 1.6.0 > > Original Estimate: 120h > Remaining Estimate: 120h > > Hi everyone, > I think it would be good to extend statistical functions in mllib package, by > adding Cardinality/Quantiles/Quartiles/Median for the columns, as many other > ml and statistical libraries already have it. I couldn't find it in mllib > package, hence would like to suggest it. > Since this is my first time working with jira, I'd truly appreciate if > someone could review this and let me know what do you think. > Also, I'd really like to work on it and looking forward to hearing from you! > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10588) Saving a DataFrame containing only nulls to JSON doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743840#comment-14743840 ] Apache Spark commented on SPARK-10588: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/8750 > Saving a DataFrame containing only nulls to JSON doesn't work > - > > Key: SPARK-10588 > URL: https://issues.apache.org/jira/browse/SPARK-10588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Priority: Minor > > Snippets to reproduce this issue: > {noformat} > val path = "file:///tmp/spark/null" > // A single row containing a single null double, saving to JSON, wrong > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > ++ > // Two rows each containing a single null double, saving to JSON, wrong > sqlContext. > range(2).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > || > ++ > // A single row containing two null doubles, saving to JSON, wrong > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0", "CAST(NULL AS DOUBLE) AS > c1"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > ++ > // A single row containing a single null double, saving to Parquet, OK > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").parquet(path) > sqlContext.read.parquet(path).show() > ++ > | d| > ++ > |null| > ++ > // Two rows, one containing a single null double, one containing non-null > double, saving to JSON, OK > sqlContext. > range(2).selectExpr("IF(id % 2 = 0, CAST(NULL AS DOUBLE), id) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > | d| > ++ > |null| > | 1.0| > ++ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10588) Saving a DataFrame containing only nulls to JSON doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10588: Assignee: (was: Apache Spark) > Saving a DataFrame containing only nulls to JSON doesn't work > - > > Key: SPARK-10588 > URL: https://issues.apache.org/jira/browse/SPARK-10588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Priority: Minor > > Snippets to reproduce this issue: > {noformat} > val path = "file:///tmp/spark/null" > // A single row containing a single null double, saving to JSON, wrong > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > ++ > // Two rows each containing a single null double, saving to JSON, wrong > sqlContext. > range(2).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > || > ++ > // A single row containing two null doubles, saving to JSON, wrong > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0", "CAST(NULL AS DOUBLE) AS > c1"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > ++ > // A single row containing a single null double, saving to Parquet, OK > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").parquet(path) > sqlContext.read.parquet(path).show() > ++ > | d| > ++ > |null| > ++ > // Two rows, one containing a single null double, one containing non-null > double, saving to JSON, OK > sqlContext. > range(2).selectExpr("IF(id % 2 = 0, CAST(NULL AS DOUBLE), id) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > | d| > ++ > |null| > | 1.0| > ++ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10588) Saving a DataFrame containing only nulls to JSON doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10588: Assignee: Apache Spark > Saving a DataFrame containing only nulls to JSON doesn't work > - > > Key: SPARK-10588 > URL: https://issues.apache.org/jira/browse/SPARK-10588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Apache Spark >Priority: Minor > > Snippets to reproduce this issue: > {noformat} > val path = "file:///tmp/spark/null" > // A single row containing a single null double, saving to JSON, wrong > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > ++ > // Two rows each containing a single null double, saving to JSON, wrong > sqlContext. > range(2).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > || > ++ > // A single row containing two null doubles, saving to JSON, wrong > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0", "CAST(NULL AS DOUBLE) AS > c1"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > || > ++ > || > ++ > // A single row containing a single null double, saving to Parquet, OK > sqlContext. > range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0"). > write.mode("overwrite").parquet(path) > sqlContext.read.parquet(path).show() > ++ > | d| > ++ > |null| > ++ > // Two rows, one containing a single null double, one containing non-null > double, saving to JSON, OK > sqlContext. > range(2).selectExpr("IF(id % 2 = 0, CAST(NULL AS DOUBLE), id) AS c0"). > write.mode("overwrite").json(path) > sqlContext.read.json(path).show() > ++ > | d| > ++ > |null| > | 1.0| > ++ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10591) False negative in QueryTest.checkAnswer
Cheng Lian created SPARK-10591: -- Summary: False negative in QueryTest.checkAnswer Key: SPARK-10591 URL: https://issues.apache.org/jira/browse/SPARK-10591 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.5.0, 1.4.1, 1.3.1, 1.2.2, 1.1.1, 1.0.2 Reporter: Cheng Lian # For double and float, `NaN == NaN` is always `false` # `checkAnswer` doesn't handle `Map` properly. For example: {noformat} scala> Map(1 -> 2, 2 -> 1).toString res0: String = Map(1 -> 2, 2 -> 1) scala> Map(2 -> 1, 1 -> 2).toString res1: String = Map(2 -> 1, 1 -> 2) {noformat} We can't rely on `toString` to compare `Map` instances. Need to update `checkAnswer` to special case `NaN` and `Map`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10573) IndexToString transformSchema adds output field as DoubleType
[ https://issues.apache.org/jira/browse/SPARK-10573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743825#comment-14743825 ] Joseph K. Bradley commented on SPARK-10573: --- I think your assessment is correct. Would you mind sending a PR? Thanks! > IndexToString transformSchema adds output field as DoubleType > - > > Key: SPARK-10573 > URL: https://issues.apache.org/jira/browse/SPARK-10573 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.0 >Reporter: Nick Pritchard > > Reproducible example: > {code} > val stage = new IndexToString().setInputCol("input").setOutputCol("output") > val inSchema = StructType(Seq(StructField("input", DoubleType))) > val outSchema = stage.transformSchema(inSchema) > assert(outSchema("output").dataType == StringType) > {code} > The root cause seems to be that it uses {{NominalAttribute.toStructField}} > which assumes {{DoubleType}}. It would probably be better to just use > {{SchemaUtils.appendColumn}} and explicitly set the data type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10578) pyspark.ml.classification.RandomForestClassifer does not return `rawPrediction` column
[ https://issues.apache.org/jira/browse/SPARK-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-10578. --- Resolution: Fixed Assignee: Joseph K. Bradley Fix Version/s: 1.5.0 [~viirya] Yep, thanks for pointing out the right link! > pyspark.ml.classification.RandomForestClassifer does not return > `rawPrediction` column > -- > > Key: SPARK-10578 > URL: https://issues.apache.org/jira/browse/SPARK-10578 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.4.0, 1.4.1 > Environment: CentOS, PySpark 1.4.1, Scala 2.10 >Reporter: Karen Yin-Yee Ng >Assignee: Joseph K. Bradley > Fix For: 1.5.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > To use `pyspark.ml.classification.RandomForestClassifer` with > `BinaryClassificationEvaluator`, a column called `rawPrediction` needs to be > returned by the `RandomForestClassifer`. > The PySpark documentation example of `logisticsRegression`outputs the > `rawPrediction` column but not `RandomForestClassifier`. > Therefore, one is unable to use `RandomForestClassifier` with the evaluator > nor put it in a pipeline with cross validation. > A relevant piece of code showing how to reproduce the bug can be found at: > https://gist.github.com/karenyyng/cf61ae655b032f754bfb > A relevant post due to this possible bug can also be found at: > http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-running-CrossValidator-with-RandomForestClassifier-on-dataset-td23791.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743834#comment-14743834 ] Joseph K. Bradley commented on SPARK-10574: --- I agree that switching to MurmurHash3 is a good idea. As far as backwards compatibility, I feel like the best thing we can do is to provide a new parameter which lets the user choose the hashing method. I would vote for having it default to MurmurHash3, with an option to switch to the old hashing method (but with proper warnings). We have not really made promises about backwards compatibility for HashingTF, but we will need to start making such promises after adding save/load for Pipelines. We can include a release note about this change. > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Priority: Critical > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, > [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] > which is also included in the standard Scala libraries and is the hashing > choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If > Spark users apply {{HashingTF}} only to very short, dictionary-like strings > the hashing function choice will not be a big problem but why have an > implementation in MLlib with this limitation when there is a better > implementation readily available in the standard Scala library? > Switching to MurmurHash3 solves both problems. If there is agreement that > this is a good change, I can prepare a PR. > Note that changing the hash function would mean that models saved with a > previous version would have to be re-trained. This introduces a problem > that's orthogonal to breaking changes in APIs: breaking changes related to > artifacts, e.g., a saved model, produced by a previous version. Is there a > policy or best practice currently in effect about this? If not, perhaps we > should come up with a few simple rules about how we communicate these in > release notes, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10578) pyspark.ml.classification.RandomForestClassifer does not return `rawPrediction` column
[ https://issues.apache.org/jira/browse/SPARK-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743831#comment-14743831 ] Karen Yin-Yee Ng commented on SPARK-10578: -- Thanks [~josephkb] and [~viirya] for the quick response. > pyspark.ml.classification.RandomForestClassifer does not return > `rawPrediction` column > -- > > Key: SPARK-10578 > URL: https://issues.apache.org/jira/browse/SPARK-10578 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.4.0, 1.4.1 > Environment: CentOS, PySpark 1.4.1, Scala 2.10 >Reporter: Karen Yin-Yee Ng >Assignee: Joseph K. Bradley > Fix For: 1.5.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > To use `pyspark.ml.classification.RandomForestClassifer` with > `BinaryClassificationEvaluator`, a column called `rawPrediction` needs to be > returned by the `RandomForestClassifer`. > The PySpark documentation example of `logisticsRegression`outputs the > `rawPrediction` column but not `RandomForestClassifier`. > Therefore, one is unable to use `RandomForestClassifier` with the evaluator > nor put it in a pipeline with cross validation. > A relevant piece of code showing how to reproduce the bug can be found at: > https://gist.github.com/karenyyng/cf61ae655b032f754bfb > A relevant post due to this possible bug can also be found at: > http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-running-CrossValidator-with-RandomForestClassifier-on-dataset-td23791.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10599) Decrease communication in BlockMatrix multiply and increase performance
[ https://issues.apache.org/jira/browse/SPARK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-10599: Description: The BlockMatrix multiply sends each block to all the corresponding columns of the right BlockMatrix, even though there might not be any corresponding block to multiply with. Some optimizations we can perform are: - Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled - Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition > Decrease communication in BlockMatrix multiply and increase performance > --- > > Key: SPARK-10599 > URL: https://issues.apache.org/jira/browse/SPARK-10599 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Burak Yavuz > > The BlockMatrix multiply sends each block to all the corresponding columns of > the right BlockMatrix, even though there might not be any corresponding block > to multiply with. > Some optimizations we can perform are: > - Simulate the multiplication on the driver, and figure out which blocks > actually need to be shuffled > - Send the block once to a partition, and join inside the partition rather > than sending multiple copies to the same partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10600) SparkSQL - Support for Not Exists in a Correlated Subquery
Richard Garris created SPARK-10600: -- Summary: SparkSQL - Support for Not Exists in a Correlated Subquery Key: SPARK-10600 URL: https://issues.apache.org/jira/browse/SPARK-10600 Project: Spark Issue Type: Improvement Reporter: Richard Garris Spark SQL currently does not support NOT EXISTS clauses (e.g. SELECT * FROM TABLE_A WHERE NOT EXISTS ( SELECT 1 FROM TABLE_B where TABLE_B.id = TABLE_A.id) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10597) MultivariateOnlineSummarizer for weighted instances
[ https://issues.apache.org/jira/browse/SPARK-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-10597: Description: MultivariateOnlineSummarizer for weighted instances is implemented as private API for SPARK-7685. In SPARK-7685, the online numerical stable version of unbiased estimation of variance defined by the reliability weights: [[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]] is implemented, but we would like to make it as public api since there are different use-cases. Currently, `count` will return the actual number of instances, and ignores instance weights, but `numNonzeros` will return the weighted # of nonzeros. We need to decide the behavior of them before making it public. was: MultivariateOnlineSummarizer for weighted instances is implemented as private API for #SPARK-7685. In #SPARK-7685, the online numerical stable version of unbiased estimation of variance defined by the reliability weights: [[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]] is implemented, but we would like to make it as public api since there are different use-cases. Currently, `count` will return the actual number of instances, and ignores instance weights, but `numNonzeros` will return the weighted # of nonzeros. We need to decide the behavior of them before making it public. > MultivariateOnlineSummarizer for weighted instances > --- > > Key: SPARK-10597 > URL: https://issues.apache.org/jira/browse/SPARK-10597 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.5.0 >Reporter: DB Tsai > > MultivariateOnlineSummarizer for weighted instances is implemented as private > API for SPARK-7685. > In SPARK-7685, the online numerical stable version of unbiased estimation of > variance defined by the reliability weights: > [[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]] > is implemented, but we would like to make it as public api since there are > different use-cases. > Currently, `count` will return the actual number of instances, and ignores > instance weights, but `numNonzeros` will return the weighted # of nonzeros. > We need to decide the behavior of them before making it public. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10597) MultivariateOnlineSummarizer for weighted instances
DB Tsai created SPARK-10597: --- Summary: MultivariateOnlineSummarizer for weighted instances Key: SPARK-10597 URL: https://issues.apache.org/jira/browse/SPARK-10597 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.5.0 Reporter: DB Tsai MultivariateOnlineSummarizer for weighted instances is implemented as private API for #SPARK-7685. In #SPARK-7685, the online numerical stable version of unbiased estimation of variance defined by the reliability weights: [[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]] is implemented, but we would like to make it as public api since there are different use-cases. Currently, `count` will return the actual number of instances, and ignores instance weights, but `numNonzeros` will return the weighted # of nonzeros. We need to decide the behavior of them before making it public. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10594) ApplicationMaster "--help" references the removed "--num-executors" option
[ https://issues.apache.org/jira/browse/SPARK-10594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10594: Assignee: (was: Apache Spark) > ApplicationMaster "--help" references the removed "--num-executors" option > -- > > Key: SPARK-10594 > URL: https://issues.apache.org/jira/browse/SPARK-10594 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: Erick Tryzelaar >Priority: Trivial > Attachments: > 0001-SPARK-10594-YARN-Remove-reference-to-num-executors.patch, > 0002-SPARK-10594-YARN-Document-ApplicationMaster-properti.patch > > > The issue SPARK-9092 and commit > [738f35|https://github.com/apache/spark/commit/738f353988dbf02704bd63f5e35d94402c59ed79] > removed the {{ApplicationMaster}} commandline argument {{--num-executors}}, > but it's help message still references the > [argument|https://github.com/apache/spark/blob/738f353988dbf02704bd63f5e35d94402c59ed79/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala#L108]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10593) sql lateral view same name gives wrong value
[ https://issues.apache.org/jira/browse/SPARK-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-10593: -- Assignee: Davies Liu > sql lateral view same name gives wrong value > > > Key: SPARK-10593 > URL: https://issues.apache.org/jira/browse/SPARK-10593 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > This query will return wrong result: > {code} > select > insideLayer1.json as json_insideLayer1, > insideLayer2.json as json_insideLayer2 > from (select '1' id) creatives > lateral view json_tuple('{"layer1": {"layer2": "text inside layer 2"}}', > 'layer1') insideLayer1 as json > lateral view json_tuple(insideLayer1.json, 'layer2') insideLayer2 as json > {code} > It got > {code} > ( {"layer2": "text inside layer 2"}, {"layer2": "text inside layer 2"}) > {code} > instead of > {code} > ( {"layer2": "text inside layer 2"}, text inside layer 2) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits
[ https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10598: -- Description: (was: (Have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark please -- a number of these fields weren't quite right)) (Have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark please -- a number of these fields weren't quite right) > RoutingTablePartition toMessage method refers to bytes instead of bits > -- > > Key: SPARK-10598 > URL: https://issues.apache.org/jira/browse/SPARK-10598 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.1, 1.5.0 >Reporter: Robin East >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits
[ https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10598: -- Affects Version/s: (was: 1.4.0) Target Version/s: (was: 1.5.0) Priority: Trivial (was: Minor) Fix Version/s: (was: 1.5.1) Description: (Have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark please -- a number of these fields weren't quite right) > RoutingTablePartition toMessage method refers to bytes instead of bits > -- > > Key: SPARK-10598 > URL: https://issues.apache.org/jira/browse/SPARK-10598 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.1, 1.5.0 >Reporter: Robin East >Priority: Trivial > > (Have a look at > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > please -- a number of these fields weren't quite right) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns
[ https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744170#comment-14744170 ] Reynold Xin commented on SPARK-9325: Do you want to support collect(df$Age + 1) ? > Support `collect` on DataFrame columns > -- > > Key: SPARK-9325 > URL: https://issues.apache.org/jira/browse/SPARK-9325 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > This is to support code of the form > ``` > ages <- collect(df$Age) > ``` > Right now `df$Age` returns a Column, which has no functions supported. > Similarly we might consider supporting `head(df$Age)` etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10563) SparkContext's local properties should be cloned when inherited
[ https://issues.apache.org/jira/browse/SPARK-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10563: -- Target Version/s: 1.6.0, 1.5.1 (was: 1.6.0) > SparkContext's local properties should be cloned when inherited > --- > > Key: SPARK-10563 > URL: https://issues.apache.org/jira/browse/SPARK-10563 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > Currently, when a child thread inherits local properties from the parent > thread, it gets a reference of the parent's local properties and uses them as > default values. > The problem, however, is that when the parent changes the value of an > existing property, the changes are reflected in the child thread! This has > very confusing semantics, especially in streaming. > {code} > private val localProperties = new InheritableThreadLocal[Properties] { > override protected def childValue(parent: Properties): Properties = new > Properties(parent) > override protected def initialValue(): Properties = new Properties() > } > {code} > Instead, we should make a clone of the parent properties rather than passing > in a mutable reference. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10599) Decrease communication in BlockMatrix multiply and increase performance
Burak Yavuz created SPARK-10599: --- Summary: Decrease communication in BlockMatrix multiply and increase performance Key: SPARK-10599 URL: https://issues.apache.org/jira/browse/SPARK-10599 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10599) Decrease communication in BlockMatrix multiply and increase performance
[ https://issues.apache.org/jira/browse/SPARK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10599: Assignee: Apache Spark > Decrease communication in BlockMatrix multiply and increase performance > --- > > Key: SPARK-10599 > URL: https://issues.apache.org/jira/browse/SPARK-10599 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Burak Yavuz >Assignee: Apache Spark > > The BlockMatrix multiply sends each block to all the corresponding columns of > the right BlockMatrix, even though there might not be any corresponding block > to multiply with. > Some optimizations we can perform are: > - Simulate the multiplication on the driver, and figure out which blocks > actually need to be shuffled > - Send the block once to a partition, and join inside the partition rather > than sending multiple copies to the same partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits
[ https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10598: -- Assignee: Robin East > RoutingTablePartition toMessage method refers to bytes instead of bits > -- > > Key: SPARK-10598 > URL: https://issues.apache.org/jira/browse/SPARK-10598 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.1, 1.5.0 >Reporter: Robin East >Assignee: Robin East >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6981) [SQL] SparkPlanner and QueryExecution should be factored out from SQLContext
[ https://issues.apache.org/jira/browse/SPARK-6981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6981. - Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 6356 [https://github.com/apache/spark/pull/6356] > [SQL] SparkPlanner and QueryExecution should be factored out from SQLContext > > > Key: SPARK-6981 > URL: https://issues.apache.org/jira/browse/SPARK-6981 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0, 1.4.0 >Reporter: Edoardo Vacchi >Priority: Minor > Fix For: 1.6.0 > > > In order to simplify extensibility with new strategies from third-parties, it > should be better to factor SparkPlanner and QueryExecution in their own > classes. Dependent types add additional, unnecessary complexity; besides, > HiveContext would benefit from this change as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7040) Explore receiver-less DStream for Flume
[ https://issues.apache.org/jira/browse/SPARK-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744162#comment-14744162 ] Tathagata Das commented on SPARK-7040: -- I am not sure how Direct API can be built for Flume as Flume does not have any offsets or sequence numbers (correct me if I am wrong about this) to refer to the exact ranges of records / events. I am closing this JIRA for now, please reopen it if you think this is still relevant. > Explore receiver-less DStream for Flume > --- > > Key: SPARK-7040 > URL: https://issues.apache.org/jira/browse/SPARK-7040 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Hari Shreedharan > > I am thinking about repurposing the FlumePollingInputDStream to make it more > parallel and pull in data like the DirectKafkaDStream. Since Flume does not > have a unique way of identifying specific event offsets, this will not be > once-only. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10593) sql lateral view same name gives wrong value
[ https://issues.apache.org/jira/browse/SPARK-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10593: Assignee: Apache Spark > sql lateral view same name gives wrong value > > > Key: SPARK-10593 > URL: https://issues.apache.org/jira/browse/SPARK-10593 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > This query will return wrong result: > {code} > select > insideLayer1.json as json_insideLayer1, > insideLayer2.json as json_insideLayer2 > from (select '1' id) creatives > lateral view json_tuple('{"layer1": {"layer2": "text inside layer 2"}}', > 'layer1') insideLayer1 as json > lateral view json_tuple(insideLayer1.json, 'layer2') insideLayer2 as json > {code} > It got > {code} > ( {"layer2": "text inside layer 2"}, {"layer2": "text inside layer 2"}) > {code} > instead of > {code} > ( {"layer2": "text inside layer 2"}, text inside layer 2) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10593) sql lateral view same name gives wrong value
[ https://issues.apache.org/jira/browse/SPARK-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10593: Assignee: (was: Apache Spark) > sql lateral view same name gives wrong value > > > Key: SPARK-10593 > URL: https://issues.apache.org/jira/browse/SPARK-10593 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu > > This query will return wrong result: > {code} > select > insideLayer1.json as json_insideLayer1, > insideLayer2.json as json_insideLayer2 > from (select '1' id) creatives > lateral view json_tuple('{"layer1": {"layer2": "text inside layer 2"}}', > 'layer1') insideLayer1 as json > lateral view json_tuple(insideLayer1.json, 'layer2') insideLayer2 as json > {code} > It got > {code} > ( {"layer2": "text inside layer 2"}, {"layer2": "text inside layer 2"}) > {code} > instead of > {code} > ( {"layer2": "text inside layer 2"}, text inside layer 2) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10593) sql lateral view same name gives wrong value
[ https://issues.apache.org/jira/browse/SPARK-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744242#comment-14744242 ] Apache Spark commented on SPARK-10593: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/8755 > sql lateral view same name gives wrong value > > > Key: SPARK-10593 > URL: https://issues.apache.org/jira/browse/SPARK-10593 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu > > This query will return wrong result: > {code} > select > insideLayer1.json as json_insideLayer1, > insideLayer2.json as json_insideLayer2 > from (select '1' id) creatives > lateral view json_tuple('{"layer1": {"layer2": "text inside layer 2"}}', > 'layer1') insideLayer1 as json > lateral view json_tuple(insideLayer1.json, 'layer2') insideLayer2 as json > {code} > It got > {code} > ( {"layer2": "text inside layer 2"}, {"layer2": "text inside layer 2"}) > {code} > instead of > {code} > ( {"layer2": "text inside layer 2"}, text inside layer 2) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits
Robin East created SPARK-10598: -- Summary: RoutingTablePartition toMessage method refers to bytes instead of bits Key: SPARK-10598 URL: https://issues.apache.org/jira/browse/SPARK-10598 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.5.0, 1.4.1, 1.4.0 Reporter: Robin East Priority: Minor Fix For: 1.5.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10522) Nanoseconds part of Timestamp should be positive in parquet
[ https://issues.apache.org/jira/browse/SPARK-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10522. Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Issue resolved by pull request 8674 [https://github.com/apache/spark/pull/8674] > Nanoseconds part of Timestamp should be positive in parquet > --- > > Key: SPARK-10522 > URL: https://issues.apache.org/jira/browse/SPARK-10522 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu > Fix For: 1.6.0, 1.5.1 > > > If Timestamp is before unix epoch, the nanosecond part will be negative, Hive > can't read that back correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6981) [SQL] SparkPlanner and QueryExecution should be factored out from SQLContext
[ https://issues.apache.org/jira/browse/SPARK-6981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6981: - Assignee: Edoardo Vacchi > [SQL] SparkPlanner and QueryExecution should be factored out from SQLContext > > > Key: SPARK-6981 > URL: https://issues.apache.org/jira/browse/SPARK-6981 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0, 1.4.0 >Reporter: Edoardo Vacchi >Assignee: Edoardo Vacchi >Priority: Minor > Fix For: 1.6.0 > > > In order to simplify extensibility with new strategies from third-parties, it > should be better to factor SparkPlanner and QueryExecution in their own > classes. Dependent types add additional, unnecessary complexity; besides, > HiveContext would benefit from this change as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10543) Peak Execution Memory Quantile should be Per-task Basis
[ https://issues.apache.org/jira/browse/SPARK-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10543. --- Resolution: Fixed Assignee: Sen Fang Fix Version/s: 1.5.1 1.6.0 Target Version/s: 1.6.0, 1.5.1 > Peak Execution Memory Quantile should be Per-task Basis > --- > > Key: SPARK-10543 > URL: https://issues.apache.org/jira/browse/SPARK-10543 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Sen Fang >Assignee: Sen Fang >Priority: Minor > Fix For: 1.6.0, 1.5.1 > > > Currently the Peak Execution Memory quantiles seem to be cumulative rather > than per task basis. For example, I have seen a value of 2TB in one of my > jobs on the quantile metric but each individual task shows less than 1GB on > the bottom table. > [~andrewor14] In your PR https://github.com/apache/spark/pull/7770, the > screenshot shows the Max Peak Execution Memory of 792.5KB while in the bottom > it's about 50KB per task (unless your workload is skewed) > The fix seems straightforward that we use the `update` rather than `value` > from the accumulable. I'm happy to provide a PR if people agree this is the > right behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10317) start-history-server.sh CLI parsing incompatible with HistoryServer's arg parsing
[ https://issues.apache.org/jira/browse/SPARK-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744451#comment-14744451 ] Apache Spark commented on SPARK-10317: -- User 'rekhajoshm' has created a pull request for this issue: https://github.com/apache/spark/pull/8758 > start-history-server.sh CLI parsing incompatible with HistoryServer's arg > parsing > - > > Key: SPARK-10317 > URL: https://issues.apache.org/jira/browse/SPARK-10317 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Steve Loughran >Priority: Trivial > > The history server has its argument parsing class in > {{HistoryServerArguments}}. However, this doesn't get involved in the > {{start-history-server.sh}} codepath where the $0 arg is assigned to > {{spark.history.fs.logDirectory}} and all other arguments discarded (e.g > {{--property-file}}. > This stops the other options being usable from this script -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10317) start-history-server.sh CLI parsing incompatible with HistoryServer's arg parsing
[ https://issues.apache.org/jira/browse/SPARK-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10317: Assignee: (was: Apache Spark) > start-history-server.sh CLI parsing incompatible with HistoryServer's arg > parsing > - > > Key: SPARK-10317 > URL: https://issues.apache.org/jira/browse/SPARK-10317 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Steve Loughran >Priority: Trivial > > The history server has its argument parsing class in > {{HistoryServerArguments}}. However, this doesn't get involved in the > {{start-history-server.sh}} codepath where the $0 arg is assigned to > {{spark.history.fs.logDirectory}} and all other arguments discarded (e.g > {{--property-file}}. > This stops the other options being usable from this script -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10317) start-history-server.sh CLI parsing incompatible with HistoryServer's arg parsing
[ https://issues.apache.org/jira/browse/SPARK-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10317: Assignee: Apache Spark > start-history-server.sh CLI parsing incompatible with HistoryServer's arg > parsing > - > > Key: SPARK-10317 > URL: https://issues.apache.org/jira/browse/SPARK-10317 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Steve Loughran >Assignee: Apache Spark >Priority: Trivial > > The history server has its argument parsing class in > {{HistoryServerArguments}}. However, this doesn't get involved in the > {{start-history-server.sh}} codepath where the $0 arg is assigned to > {{spark.history.fs.logDirectory}} and all other arguments discarded (e.g > {{--property-file}}. > This stops the other options being usable from this script -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10603) Univariate statistics as UDAFs: multi-pass continuous stats
Joseph K. Bradley created SPARK-10603: - Summary: Univariate statistics as UDAFs: multi-pass continuous stats Key: SPARK-10603 URL: https://issues.apache.org/jira/browse/SPARK-10603 Project: Spark Issue Type: Sub-task Components: ML, SQL Reporter: Joseph K. Bradley See parent JIRA for more details. This subtask covers statistics for continuous values requiring multiple passes over the data, such as median and quantiles. This JIRA is an umbrella. For individual stats, please create and link a new JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10573) IndexToString transformSchema adds output field as DoubleType
[ https://issues.apache.org/jira/browse/SPARK-10573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10573. --- Resolution: Fixed Fix Version/s: 1.6.0 > IndexToString transformSchema adds output field as DoubleType > - > > Key: SPARK-10573 > URL: https://issues.apache.org/jira/browse/SPARK-10573 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.0 >Reporter: Nick Pritchard >Assignee: Nick Pritchard > Fix For: 1.6.0 > > > Reproducible example: > {code} > val stage = new IndexToString().setInputCol("input").setOutputCol("output") > val inSchema = StructType(Seq(StructField("input", DoubleType))) > val outSchema = stage.transformSchema(inSchema) > assert(outSchema("output").dataType == StringType) > {code} > The root cause seems to be that it uses {{NominalAttribute.toStructField}} > which assumes {{DoubleType}}. It would probably be better to just use > {{SchemaUtils.appendColumn}} and explicitly set the data type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns
[ https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744254#comment-14744254 ] Davies Liu commented on SPARK-9325: --- I would -1 on this. I'm worried that once we have collect(Column)/head(Column), users will ask for count(Column)/first(Column)/Sum(Column)/Avg(Column), then it's hard to tell which one should be in or not. Adding APIs in R is harder than Scala/Java/Python (because of namespace), we should be more careful on it. > Support `collect` on DataFrame columns > -- > > Key: SPARK-9325 > URL: https://issues.apache.org/jira/browse/SPARK-9325 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > This is to support code of the form > ``` > ages <- collect(df$Age) > ``` > Right now `df$Age` returns a Column, which has no functions supported. > Similarly we might consider supporting `head(df$Age)` etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10587) In pyspark, toDF() dosen't exsist in RDD object
[ https://issues.apache.org/jira/browse/SPARK-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10587. --- Resolution: Not A Problem > In pyspark, toDF() dosen't exsist in RDD object > --- > > Key: SPARK-10587 > URL: https://issues.apache.org/jira/browse/SPARK-10587 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: SemiCoder > > I can't find toDF() function in RDD. > In pyspark.mllib.linalg.distributed , the IndexedRowMatrix.__init__() > require the rows should be an RDD and execute rows.toDF() but actually the > RDD in pyspark dosen't have toDF() function -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats
Joseph K. Bradley created SPARK-10602: - Summary: Univariate statistics as UDAFs: single-pass continuous stats Key: SPARK-10602 URL: https://issues.apache.org/jira/browse/SPARK-10602 Project: Spark Issue Type: Sub-task Components: ML, SQL Reporter: Joseph K. Bradley See parent JIRA for more details. This subtask covers statistics for continuous values requiring a single pass over the data, such as min and max. This JIRA is an umbrella. For individual stats, please create and link a new JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10591) False negative in QueryTest.checkAnswer
[ https://issues.apache.org/jira/browse/SPARK-10591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10591: --- Description: # For double and float, {{NaN == NaN}} is always {{false}} # {{checkAnswer}} doesn't handle {{Map\[K, V\]}} properly. For example: {noformat} scala> Map(1 -> 2, 2 -> 1).toString res0: String = Map(1 -> 2, 2 -> 1) scala> Map(2 -> 1, 1 -> 2).toString res1: String = Map(2 -> 1, 1 -> 2) {noformat} We can't rely on {{toString}} to compare {{Map\[K, V\]}} instances. Need to update {{checkAnswer}} to special case {{NaN}} and {{Map\[K, V\]}}. was: # For double and float, `NaN == NaN` is always `false` # `checkAnswer` doesn't handle `Map` properly. For example: {noformat} scala> Map(1 -> 2, 2 -> 1).toString res0: String = Map(1 -> 2, 2 -> 1) scala> Map(2 -> 1, 1 -> 2).toString res1: String = Map(2 -> 1, 1 -> 2) {noformat} We can't rely on `toString` to compare `Map` instances. Need to update `checkAnswer` to special case `NaN` and `Map`. > False negative in QueryTest.checkAnswer > --- > > Key: SPARK-10591 > URL: https://issues.apache.org/jira/browse/SPARK-10591 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.1, 1.5.0 >Reporter: Cheng Lian > > # For double and float, {{NaN == NaN}} is always {{false}} > # {{checkAnswer}} doesn't handle {{Map\[K, V\]}} properly. For example: > {noformat} > scala> Map(1 -> 2, 2 -> 1).toString > res0: String = Map(1 -> 2, 2 -> 1) > scala> Map(2 -> 1, 1 -> 2).toString > res1: String = Map(2 -> 1, 1 -> 2) > {noformat} > We can't rely on {{toString}} to compare {{Map\[K, V\]}} instances. > Need to update {{checkAnswer}} to special case {{NaN}} and {{Map\[K, V\]}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns
[ https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744167#comment-14744167 ] Shivaram Venkataraman commented on SPARK-9325: -- Just `collect` and maybe `head`. This is just to show / preview what is in a column and also to convert columns to local vectors / lists that can be used as a vector. I don't think we want to support other functions on this. > Support `collect` on DataFrame columns > -- > > Key: SPARK-9325 > URL: https://issues.apache.org/jira/browse/SPARK-9325 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > This is to support code of the form > ``` > ages <- collect(df$Age) > ``` > Right now `df$Age` returns a Column, which has no functions supported. > Similarly we might consider supporting `head(df$Age)` etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10594) ApplicationMaster "--help" references the removed "--num-executors" option
[ https://issues.apache.org/jira/browse/SPARK-10594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10594: Assignee: Apache Spark > ApplicationMaster "--help" references the removed "--num-executors" option > -- > > Key: SPARK-10594 > URL: https://issues.apache.org/jira/browse/SPARK-10594 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: Erick Tryzelaar >Assignee: Apache Spark >Priority: Trivial > Attachments: > 0001-SPARK-10594-YARN-Remove-reference-to-num-executors.patch, > 0002-SPARK-10594-YARN-Document-ApplicationMaster-properti.patch > > > The issue SPARK-9092 and commit > [738f35|https://github.com/apache/spark/commit/738f353988dbf02704bd63f5e35d94402c59ed79] > removed the {{ApplicationMaster}} commandline argument {{--num-executors}}, > but it's help message still references the > [argument|https://github.com/apache/spark/blob/738f353988dbf02704bd63f5e35d94402c59ed79/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala#L108]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10594) ApplicationMaster "--help" references the removed "--num-executors" option
[ https://issues.apache.org/jira/browse/SPARK-10594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744209#comment-14744209 ] Apache Spark commented on SPARK-10594: -- User 'erickt' has created a pull request for this issue: https://github.com/apache/spark/pull/8754 > ApplicationMaster "--help" references the removed "--num-executors" option > -- > > Key: SPARK-10594 > URL: https://issues.apache.org/jira/browse/SPARK-10594 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: Erick Tryzelaar >Priority: Trivial > Attachments: > 0001-SPARK-10594-YARN-Remove-reference-to-num-executors.patch, > 0002-SPARK-10594-YARN-Document-ApplicationMaster-properti.patch > > > The issue SPARK-9092 and commit > [738f35|https://github.com/apache/spark/commit/738f353988dbf02704bd63f5e35d94402c59ed79] > removed the {{ApplicationMaster}} commandline argument {{--num-executors}}, > but it's help message still references the > [argument|https://github.com/apache/spark/blob/738f353988dbf02704bd63f5e35d94402c59ed79/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala#L108]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits
[ https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744275#comment-14744275 ] Apache Spark commented on SPARK-10598: -- User 'insidedctm' has created a pull request for this issue: https://github.com/apache/spark/pull/8756 > RoutingTablePartition toMessage method refers to bytes instead of bits > -- > > Key: SPARK-10598 > URL: https://issues.apache.org/jira/browse/SPARK-10598 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Robin East >Priority: Minor > Fix For: 1.5.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits
[ https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10598: Assignee: (was: Apache Spark) > RoutingTablePartition toMessage method refers to bytes instead of bits > -- > > Key: SPARK-10598 > URL: https://issues.apache.org/jira/browse/SPARK-10598 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Robin East >Priority: Minor > Fix For: 1.5.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits
[ https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10598: Assignee: Apache Spark > RoutingTablePartition toMessage method refers to bytes instead of bits > -- > > Key: SPARK-10598 > URL: https://issues.apache.org/jira/browse/SPARK-10598 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Robin East >Assignee: Apache Spark >Priority: Minor > Fix For: 1.5.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10575) Wrap RDD.takeSample with scope
[ https://issues.apache.org/jira/browse/SPARK-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10575: -- Affects Version/s: 1.4.0 Target Version/s: 1.6.0 > Wrap RDD.takeSample with scope > -- > > Key: SPARK-10575 > URL: https://issues.apache.org/jira/browse/SPARK-10575 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Vinod KC >Priority: Minor > > Remove return statements in RDD.takeSample and wrap it withScope -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits
[ https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744359#comment-14744359 ] Robin East commented on SPARK-10598: Apologies - have checked it out. You're referring to Fix and Target Version fields right? > RoutingTablePartition toMessage method refers to bytes instead of bits > -- > > Key: SPARK-10598 > URL: https://issues.apache.org/jira/browse/SPARK-10598 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.1, 1.5.0 >Reporter: Robin East >Assignee: Robin East >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10522) Nanoseconds part of Timestamp should be positive in parquet
[ https://issues.apache.org/jira/browse/SPARK-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10522: -- Assignee: Davies Liu > Nanoseconds part of Timestamp should be positive in parquet > --- > > Key: SPARK-10522 > URL: https://issues.apache.org/jira/browse/SPARK-10522 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.6.0, 1.5.1 > > > If Timestamp is before unix epoch, the nanosecond part will be negative, Hive > can't read that back correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10549) scala 2.11 spark on yarn with security - Repl doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10549. --- Resolution: Fixed Fix Version/s: 1.5.1 1.6.0 Target Version/s: 1.6.0, 1.5.1 > scala 2.11 spark on yarn with security - Repl doesn't work > -- > > Key: SPARK-10549 > URL: https://issues.apache.org/jira/browse/SPARK-10549 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Fix For: 1.6.0, 1.5.1 > > > Trying to run spark on secure yarn built with scala 2.11 results in failure > when trying to launch the spark shell. > ./bin/spark-shell --master yarn-client > Exception in thread "main" java.lang.ExceptionInInitializerError > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.Exception: Error: a secret key must be specified via the > spark.authenticate.secret config > at > org.apache.spark.SecurityManager.generateSecretKey(SecurityManager.scala:395) > at org.apache.spark.SecurityManager.(SecurityManager.scala:218) > at org.apache.spark.repl.Main$.(Main.scala:38) > at org.apache.spark.repl.Main$.(Main.scala) > The reason is because it create the SecurityManager before is sets the > SPARK_YARN_MODE to true. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7040) Explore receiver-less DStream for Flume
[ https://issues.apache.org/jira/browse/SPARK-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-7040. -- Resolution: Invalid > Explore receiver-less DStream for Flume > --- > > Key: SPARK-7040 > URL: https://issues.apache.org/jira/browse/SPARK-7040 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Hari Shreedharan > > I am thinking about repurposing the FlumePollingInputDStream to make it more > parallel and pull in data like the DirectKafkaDStream. Since Flume does not > have a unique way of identifying specific event offsets, this will not be > once-only. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10573) IndexToString transformSchema adds output field as DoubleType
[ https://issues.apache.org/jira/browse/SPARK-10573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10573: -- Fix Version/s: 1.5.1 > IndexToString transformSchema adds output field as DoubleType > - > > Key: SPARK-10573 > URL: https://issues.apache.org/jira/browse/SPARK-10573 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.0 >Reporter: Nick Pritchard >Assignee: Nick Pritchard > Fix For: 1.6.0, 1.5.1 > > > Reproducible example: > {code} > val stage = new IndexToString().setInputCol("input").setOutputCol("output") > val inSchema = StructType(Seq(StructField("input", DoubleType))) > val outSchema = stage.transformSchema(inSchema) > assert(outSchema("output").dataType == StringType) > {code} > The root cause seems to be that it uses {{NominalAttribute.toStructField}} > which assumes {{DoubleType}}. It would probably be better to just use > {{SchemaUtils.appendColumn}} and explicitly set the data type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10599) Decrease communication in BlockMatrix multiply and increase performance
[ https://issues.apache.org/jira/browse/SPARK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744322#comment-14744322 ] Apache Spark commented on SPARK-10599: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/8757 > Decrease communication in BlockMatrix multiply and increase performance > --- > > Key: SPARK-10599 > URL: https://issues.apache.org/jira/browse/SPARK-10599 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Burak Yavuz > > The BlockMatrix multiply sends each block to all the corresponding columns of > the right BlockMatrix, even though there might not be any corresponding block > to multiply with. > Some optimizations we can perform are: > - Simulate the multiplication on the driver, and figure out which blocks > actually need to be shuffled > - Send the block once to a partition, and join inside the partition rather > than sending multiple copies to the same partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10599) Decrease communication in BlockMatrix multiply and increase performance
[ https://issues.apache.org/jira/browse/SPARK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10599: Assignee: (was: Apache Spark) > Decrease communication in BlockMatrix multiply and increase performance > --- > > Key: SPARK-10599 > URL: https://issues.apache.org/jira/browse/SPARK-10599 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Burak Yavuz > > The BlockMatrix multiply sends each block to all the corresponding columns of > the right BlockMatrix, even though there might not be any corresponding block > to multiply with. > Some optimizations we can perform are: > - Simulate the multiplication on the driver, and figure out which blocks > actually need to be shuffled > - Send the block once to a partition, and join inside the partition rather > than sending multiple copies to the same partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10594) ApplicationMaster "--help" references the removed "--num-executors" option
[ https://issues.apache.org/jira/browse/SPARK-10594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10594. --- Resolution: Fixed Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > ApplicationMaster "--help" references the removed "--num-executors" option > -- > > Key: SPARK-10594 > URL: https://issues.apache.org/jira/browse/SPARK-10594 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: Erick Tryzelaar >Priority: Trivial > Fix For: 1.6.0 > > Attachments: > 0001-SPARK-10594-YARN-Remove-reference-to-num-executors.patch, > 0002-SPARK-10594-YARN-Document-ApplicationMaster-properti.patch > > > The issue SPARK-9092 and commit > [738f35|https://github.com/apache/spark/commit/738f353988dbf02704bd63f5e35d94402c59ed79] > removed the {{ApplicationMaster}} commandline argument {{--num-executors}}, > but it's help message still references the > [argument|https://github.com/apache/spark/blob/738f353988dbf02704bd63f5e35d94402c59ed79/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala#L108]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10594) ApplicationMaster "--help" references the removed "--num-executors" option
[ https://issues.apache.org/jira/browse/SPARK-10594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10594: -- Assignee: Erick Tryzelaar > ApplicationMaster "--help" references the removed "--num-executors" option > -- > > Key: SPARK-10594 > URL: https://issues.apache.org/jira/browse/SPARK-10594 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: Erick Tryzelaar >Assignee: Erick Tryzelaar >Priority: Trivial > Fix For: 1.6.0 > > Attachments: > 0001-SPARK-10594-YARN-Remove-reference-to-num-executors.patch, > 0002-SPARK-10594-YARN-Document-ApplicationMaster-properti.patch > > > The issue SPARK-9092 and commit > [738f35|https://github.com/apache/spark/commit/738f353988dbf02704bd63f5e35d94402c59ed79] > removed the {{ApplicationMaster}} commandline argument {{--num-executors}}, > but it's help message still references the > [argument|https://github.com/apache/spark/blob/738f353988dbf02704bd63f5e35d94402c59ed79/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala#L108]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9996) Create local nested loop join operator
[ https://issues.apache.org/jira/browse/SPARK-9996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-9996. -- Resolution: Fixed Fix Version/s: 1.6.0 > Create local nested loop join operator > -- > > Key: SPARK-9996 > URL: https://issues.apache.org/jira/browse/SPARK-9996 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9997) Create local Expand operator
[ https://issues.apache.org/jira/browse/SPARK-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-9997. -- Resolution: Fixed Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Create local Expand operator > > > Key: SPARK-9997 > URL: https://issues.apache.org/jira/browse/SPARK-9997 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744508#comment-14744508 ] Joseph K. Bradley commented on SPARK-8418: -- Apologies for being AWOL! I'd definitely appreciate help with designing this improvement. For API (Vector vs. Map): I prefer sticking with a Vector API. I see the appeal of keeping columns separate, but DataFrames are not yet meant to handle too many columns (hundreds at most, I'd say). We can still keep feature names and metadata using ML attributes (which describe each feature in Vector columns in DataFrames). For sharing code, we should definitely do option 2. For backwards compatibility, we should not modify current Params, but we could add a new one for multiple inputs (and check for conflicting settings when running). I would hope we could share code in this multi-value transformation so that each transformer only needs to specify how to transform a single value. I hope we can do this, rather than implementing option 1 as the default. Would you mind sketching up a quick design doc? That should help clarify the different options and help us choose a simple but flexible API. If you'd like to follow existing examples, here are some ones you could look at: * Classification threshold (shorter doc): [https://docs.google.com/document/d/1nV6m7sqViHkEpawelq1S5_QLWWAouSlv81eiEEjKuJY/edit?usp=sharing] * R-like stats for model (long doc): [https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing] These items we've discussed can be sketched out in the doc. After you link it from this JIRA, others can give you feedback on this JIRA (better than on the doc since some people have trouble viewing Google docs). Thanks very much! > Add single- and multi-value support to ML Transformers > -- > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10604) Univariate statistics as UDAFs: categorical stats
Joseph K. Bradley created SPARK-10604: - Summary: Univariate statistics as UDAFs: categorical stats Key: SPARK-10604 URL: https://issues.apache.org/jira/browse/SPARK-10604 Project: Spark Issue Type: Sub-task Components: ML, SQL Reporter: Joseph K. Bradley See parent JIRA for more details. This subtask covers statistics for categorical values, such as number of categories or mode. This JIRA is an umbrella. For individual stats, please create and link a new JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns
[ https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744175#comment-14744175 ] Shivaram Venkataraman commented on SPARK-9325: -- Hmm not necessarily. If `df$newAge <- df$Age + 1; collect(df$newAge)` works that is fine. (the first line already works btw !) > Support `collect` on DataFrame columns > -- > > Key: SPARK-9325 > URL: https://issues.apache.org/jira/browse/SPARK-9325 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > This is to support code of the form > ``` > ages <- collect(df$Age) > ``` > Right now `df$Age` returns a Column, which has no functions supported. > Similarly we might consider supporting `head(df$Age)` etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10563) SparkContext's local properties should be cloned when inherited
[ https://issues.apache.org/jira/browse/SPARK-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744291#comment-14744291 ] Apache Spark commented on SPARK-10563: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/8721 > SparkContext's local properties should be cloned when inherited > --- > > Key: SPARK-10563 > URL: https://issues.apache.org/jira/browse/SPARK-10563 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > Currently, when a child thread inherits local properties from the parent > thread, it gets a reference of the parent's local properties and uses them as > default values. > The problem, however, is that when the parent changes the value of an > existing property, the changes are reflected in the child thread! This has > very confusing semantics, especially in streaming. > {code} > private val localProperties = new InheritableThreadLocal[Properties] { > override protected def childValue(parent: Properties): Properties = new > Properties(parent) > override protected def initialValue(): Properties = new Properties() > } > {code} > Instead, we should make a clone of the parent properties rather than passing > in a mutable reference. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10575) Wrap RDD.takeSample with scope
[ https://issues.apache.org/jira/browse/SPARK-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-10575: -- Assignee: Vinod KC > Wrap RDD.takeSample with scope > -- > > Key: SPARK-10575 > URL: https://issues.apache.org/jira/browse/SPARK-10575 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Vinod KC >Assignee: Vinod KC >Priority: Minor > > Remove return statements in RDD.takeSample and wrap it withScope -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10601) Spark SQL - Support for MINUS
Richard Garris created SPARK-10601: -- Summary: Spark SQL - Support for MINUS Key: SPARK-10601 URL: https://issues.apache.org/jira/browse/SPARK-10601 Project: Spark Issue Type: Improvement Reporter: Richard Garris Spark SQL does not current supported SQL Minus -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10587) In pyspark, toDF() dosen't exsist in RDD object
[ https://issues.apache.org/jira/browse/SPARK-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744643#comment-14744643 ] SemiCoder commented on SPARK-10587: --- It's not my code, it's code in latest released version. In fact, when I want to create a IndexedRowMatrix , I give a parameter "rows" , and the __init__ method will check whether it's an RDD , if it's an RDD, it will call java function, and the one parameter of calljavafunction is "rows.toDF()" . However, toDF() doesn't exist in RDD. I know it exists in sqlcontext . I mean I think it is an error in python/pyspark/mllib/linalg/distriuted.py . Otherwise could you tell me how to create an RDD which has function toDF() to avoid this situation? > In pyspark, toDF() dosen't exsist in RDD object > --- > > Key: SPARK-10587 > URL: https://issues.apache.org/jira/browse/SPARK-10587 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: SemiCoder > > I can't find toDF() function in RDD. > In pyspark.mllib.linalg.distributed , the IndexedRowMatrix.__init__() > require the rows should be an RDD and execute rows.toDF() but actually the > RDD in pyspark dosen't have toDF() function -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org