[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491820#comment-14491820 ] Yijie Shen commented on SPARK-6859: --- I opened a JIRA ticket in Parquet: [PARQUET-251|https://issues.apache.org/jira/browse/PARQUET-251] Parquet File Binary column statistics error when reuse byte[] among rows Key: SPARK-6859 URL: https://issues.apache.org/jira/browse/SPARK-6859 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Yijie Shen Priority: Minor Suppose I create a dataRDD which extends RDD\[Row\], and each row is GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is reused among rows but has different content each time. When I convert it to a dataFrame and save it as Parquet File, the file's row group statistic(max min) of Binary column would be wrong. \\ \\ Here is the reason: In Parquet, BinaryStatistic just keep max min as parquet.io.api.Binary references, Spark sql would generate a new Binary backed by the same Array\[Byte\] passed from row. | |reference| |backed| | |max: Binary|--|ByteArrayBackedBinary|--|Array\[Byte\]| Therefore, each time parquet updating row group's statistic, max min would always refer to the same Array\[Byte\], which has new content each time. When parquet decides to save it into file, the last row's content would be saved as both max min. \\ \\ It seems it is a parquet bug because it's parquet's responsibility to update statistics correctly. But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6199) Support CTE
[ https://issues.apache.org/jira/browse/SPARK-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6199: --- Assignee: (was: Cheng Hao) Support CTE --- Key: SPARK-6199 URL: https://issues.apache.org/jira/browse/SPARK-6199 Project: Spark Issue Type: Improvement Components: SQL Reporter: haiyang Fix For: 1.4.0 Support CTE in SQLContext and HiveContext -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6875) Add support for Joda-time types
Patrick Grandjean created SPARK-6875: Summary: Add support for Joda-time types Key: SPARK-6875 URL: https://issues.apache.org/jira/browse/SPARK-6875 Project: Spark Issue Type: Improvement Components: SQL Reporter: Patrick Grandjean The need comes from the following use case: val objs: RDD[MyClass] = [...] val sqlC = new org.apache.spark.sql.SQLContext(sc) import sqlC._ objs.saveAsParquetFile(parquet) MyClass contains joda-time fields. When saving to parquet file, an exception is thrown (matchError in ScalaReflection.scala). Spark SQL supports java SQL date/time types. This request is to add support for Joda-time types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6849) The constructor of GradientDescent should be public
[ https://issues.apache.org/jira/browse/SPARK-6849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491861#comment-14491861 ] Guoqiang Li commented on SPARK-6849: [~srowen] https://github.com/cloudml/zen The constructor of GradientDescent should be public --- Key: SPARK-6849 URL: https://issues.apache.org/jira/browse/SPARK-6849 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1 Reporter: Guoqiang Li Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6849) The constructor of GradientDescent should be public
[ https://issues.apache.org/jira/browse/SPARK-6849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491887#comment-14491887 ] Joseph K. Bradley commented on SPARK-6849: -- It would be great to open up the optimization APIs, but I think we should clean them up before making them public. (Alternatively, we could make them public but mark them all as Experimental.) I hope we can figure out what cleanups are needed here: [https://issues.apache.org/jira/browse/SPARK-5256] The constructor of GradientDescent should be public --- Key: SPARK-6849 URL: https://issues.apache.org/jira/browse/SPARK-6849 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.1 Reporter: Guoqiang Li Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6545) Minor changes for CompactBuffer
[ https://issues.apache.org/jira/browse/SPARK-6545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491914#comment-14491914 ] Cheng Hao commented on SPARK-6545: -- Thank you [~srowen], we should close this for now, I will reopen it when I have more general idea for the updating. Minor changes for CompactBuffer --- Key: SPARK-6545 URL: https://issues.apache.org/jira/browse/SPARK-6545 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor HashedRelation should always return a Not null CompactBuffer, which will be helpful for the further improvement of Multiway Join -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6643) Python API for StandardScalerModel
[ https://issues.apache.org/jira/browse/SPARK-6643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6643: - Assignee: Kai Sasaki Python API for StandardScalerModel -- Key: SPARK-6643 URL: https://issues.apache.org/jira/browse/SPARK-6643 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.3.0 Reporter: Kai Sasaki Assignee: Kai Sasaki Priority: Minor Labels: mllib, python Fix For: 1.4.0 This is the sub-task of SPARK-6254. Wrap missing method for {{StandardScalerModel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6643) Python API for StandardScalerModel
[ https://issues.apache.org/jira/browse/SPARK-6643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6643. -- Resolution: Fixed Issue resolved by pull request 5310 [https://github.com/apache/spark/pull/5310] Python API for StandardScalerModel -- Key: SPARK-6643 URL: https://issues.apache.org/jira/browse/SPARK-6643 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.3.0 Reporter: Kai Sasaki Priority: Minor Labels: mllib, python Fix For: 1.4.0 This is the sub-task of SPARK-6254. Wrap missing method for {{StandardScalerModel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-765) Test suite should run Spark example programs
[ https://issues.apache.org/jira/browse/SPARK-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491941#comment-14491941 ] Yu Ishikawa commented on SPARK-765: --- [~joshrosen] sorry, one more thing. Are we allowed to add test suites for spark.examples? We are discussing deplicating static train() method in Scala/Java on SPARK-6682. I think it is a good timing to add test suites to spark.examples. Test suite should run Spark example programs Key: SPARK-765 URL: https://issues.apache.org/jira/browse/SPARK-765 Project: Spark Issue Type: New Feature Components: Examples Reporter: Josh Rosen The Spark test suite should also run each of the Spark example programs (the PySpark suite should do the same). This should be done through a shell script or other mechanism to simulate the environment setup used by end users that run those scripts. This would prevent problems like SPARK-764 from making it into releases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6765) Turn scalastyle on for test code
[ https://issues.apache.org/jira/browse/SPARK-6765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491765#comment-14491765 ] Apache Spark commented on SPARK-6765: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/5484 Turn scalastyle on for test code Key: SPARK-6765 URL: https://issues.apache.org/jira/browse/SPARK-6765 Project: Spark Issue Type: Improvement Components: Project Infra, Tests Reporter: Reynold Xin Assignee: Reynold Xin We should turn scalastyle on for test code. Test code should be as important as main code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491864#comment-14491864 ] Yi Zhou edited comment on SPARK-5791 at 4/13/15 2:57 AM: - We changed file format from ORC to Parquet and test based the latest spark code(1.4.0-SNAPSHOT). Got the result like below: Spark SQL(2m28s) vs. Hive (3m12s) was (Author: jameszhouyi): We changed file format from ORC to Parquet. Got the result like below: Spark SQL(2m28s) vs. Hive (3m12s) [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491864#comment-14491864 ] Yi Zhou commented on SPARK-5791: We changed file format from ORC to Parquet. Got the result like below: Spark SQL(2m28s) vs. Hive (3m12s) [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Attachments: Physcial_Plan_Hive.txt, Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set
[ https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491873#comment-14491873 ] Jack Hu commented on SPARK-6847: Hi, [~sowen] I tested more cases: # only change the {{newlist.headOption.orElse(oldstate)}} to {{Some(a)}}, the issue still exists # only change the streaming batch interval to {{2 seconds}}, keep the {{newlist.headOption.orElse(oldstate)}} and checkpoint interval 10 seconds, the issue does not exist. So this issue may related to the checkpoint interval and batch interval. Stack overflow on updateStateByKey which followed by a dstream with checkpoint set -- Key: SPARK-6847 URL: https://issues.apache.org/jira/browse/SPARK-6847 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Jack Hu Labels: StackOverflowError, Streaming The issue happens with the following sample code: uses {{updateStateByKey}} followed by a {{map}} with checkpoint interval 10 seconds {code} val sparkConf = new SparkConf().setAppName(test) val streamingContext = new StreamingContext(sparkConf, Seconds(10)) streamingContext.checkpoint(checkpoint) val source = streamingContext.socketTextStream(localhost, ) val updatedResult = source.map( (1,_)).updateStateByKey( (newlist : Seq[String], oldstate : Option[String]) = newlist.headOption.orElse(oldstate)) updatedResult.map(_._2) .checkpoint(Seconds(10)) .foreachRDD((rdd, t) = { println(Deep: + rdd.toDebugString.split(\n).length) println(t.toString() + : + rdd.collect.length) }) streamingContext.start() streamingContext.awaitTermination() {code} From the output, we can see that the dependency will be increasing time over time, the {{updateStateByKey}} never get check-pointed, and finally, the stack overflow will happen. Note: * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but not the {{updateStateByKey}} * If remove the {{checkpoint(Seconds(10))}} from the map result ( {{updatedResult.map(_._2)}} ), the stack overflow will not happen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set
[ https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491873#comment-14491873 ] Jack Hu edited comment on SPARK-6847 at 4/13/15 3:34 AM: - Hi, [~sowen] I tested more cases: # only change the {{newlist.headOption.orElse(oldstate)}} to {{Some(a)}}, the issue still exists # only change the streaming batch interval to {{2 seconds}}, keep the {{newlist.headOption.orElse(oldstate)}} and checkpoint interval 10 seconds, the issue does not exist. So this issue may be related to the checkpoint interval and batch interval. was (Author: jhu): Hi, [~sowen] I tested more cases: # only change the {{newlist.headOption.orElse(oldstate)}} to {{Some(a)}}, the issue still exists # only change the streaming batch interval to {{2 seconds}}, keep the {{newlist.headOption.orElse(oldstate)}} and checkpoint interval 10 seconds, the issue does not exist. So this issue may related to the checkpoint interval and batch interval. Stack overflow on updateStateByKey which followed by a dstream with checkpoint set -- Key: SPARK-6847 URL: https://issues.apache.org/jira/browse/SPARK-6847 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Jack Hu Labels: StackOverflowError, Streaming The issue happens with the following sample code: uses {{updateStateByKey}} followed by a {{map}} with checkpoint interval 10 seconds {code} val sparkConf = new SparkConf().setAppName(test) val streamingContext = new StreamingContext(sparkConf, Seconds(10)) streamingContext.checkpoint(checkpoint) val source = streamingContext.socketTextStream(localhost, ) val updatedResult = source.map( (1,_)).updateStateByKey( (newlist : Seq[String], oldstate : Option[String]) = newlist.headOption.orElse(oldstate)) updatedResult.map(_._2) .checkpoint(Seconds(10)) .foreachRDD((rdd, t) = { println(Deep: + rdd.toDebugString.split(\n).length) println(t.toString() + : + rdd.collect.length) }) streamingContext.start() streamingContext.awaitTermination() {code} From the output, we can see that the dependency will be increasing time over time, the {{updateStateByKey}} never get check-pointed, and finally, the stack overflow will happen. Note: * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but not the {{updateStateByKey}} * If remove the {{checkpoint(Seconds(10))}} from the map result ( {{updatedResult.map(_._2)}} ), the stack overflow will not happen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6151) schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size
[ https://issues.apache.org/jira/browse/SPARK-6151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491881#comment-14491881 ] Littlestar commented on SPARK-6151: --- The HDFS Block Size is set once when you first install Hadoop. blockSize can be changed when File create. FSDataOutputStream org.apache.hadoop.fs.FileSystem.create(Path f, boolean overwrite, int bufferSize, short replication, long blockSize) throws IOException schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size --- Key: SPARK-6151 URL: https://issues.apache.org/jira/browse/SPARK-6151 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1 Reporter: Littlestar Priority: Trivial How schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size. may be Configuration need. related question by others. http://apache-spark-user-list.1001560.n3.nabble.com/HDFS-block-size-for-parquet-output-tt21183.html http://qnalist.com/questions/5054892/spark-sql-parquet-and-impala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491905#comment-14491905 ] Yu Ishikawa commented on SPARK-6682: [~josephkb] sounds great. As you're suggesting, we should gradually tackle each algorithm one by one. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6562) DataFrame.na.replace value support
[ https://issues.apache.org/jira/browse/SPARK-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-6562. Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Reynold Xin DataFrame.na.replace value support -- Key: SPARK-6562 URL: https://issues.apache.org/jira/browse/SPARK-6562 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.4.0 Support replacing a set of values with another set of values (i.e. map join), similar to Pandas' replace. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6858) Register Java HashMap for SparkSqlSerializer
[ https://issues.apache.org/jira/browse/SPARK-6858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6858: --- Assignee: Liang-Chi Hsieh Register Java HashMap for SparkSqlSerializer Key: SPARK-6858 URL: https://issues.apache.org/jira/browse/SPARK-6858 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Trivial Fix For: 1.4.0 Since now kyro serializer is used for {{GeneralHashedRelation}} whether kyro is enabled or not, it is better to register Java {{HashMap}} in {{SparkSqlSerializer}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4760) ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files
[ https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4760. Resolution: Not A Problem ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files -- Key: SPARK-4760 URL: https://issues.apache.org/jira/browse/SPARK-4760 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Priority: Critical Fix For: 1.3.0 In a older Spark version built around Oct. 12, I was able to use ANALYZE TABLE table COMPUTE STATISTICS noscan to get estimated table size, which is important for optimizing joins. (I'm joining 15 small dimension tables, and this is crucial to me). In the more recent Spark builds, it fails to estimate the table size unless I remove noscan. Here's the statistics I got using DESC EXTENDED: old: parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166} new: parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} And I've tried turning off spark.sql.hive.convertMetastoreParquet in my spark-defaults.conf and the result is unaffected (in both versions). Looks like the Parquet support in new Hive (0.13.1) is broken? Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6611) Add support for INTEGER as synonym of INT to DDLParser
[ https://issues.apache.org/jira/browse/SPARK-6611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6611: --- Assignee: Santiago M. Mola Add support for INTEGER as synonym of INT to DDLParser -- Key: SPARK-6611 URL: https://issues.apache.org/jira/browse/SPARK-6611 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola Assignee: Santiago M. Mola Priority: Minor Fix For: 1.4.0 Add support for INTEGER as synonym of INT to DDLParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491834#comment-14491834 ] Kannan Rajah commented on SPARK-1529: - Thanks. FYI, I have pushed few more commits to my repo to handle all the TODOs and bug fixes. So you can track this branch for all the changes: https://github.com/rkannan82/spark/commits/dfs_shuffle Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1227) Diagnostics for ClassificationRegression
[ https://issues.apache.org/jira/browse/SPARK-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491895#comment-14491895 ] Joseph K. Bradley commented on SPARK-1227: -- I agree it will be nice to provide loss classes. Even though *Metrics classes exist already, loss classes might be nice as we provide more functionality for diagnosis during learning (e.g., for early stopping, model selection, etc.). Added link to related JIRA on optimization APIs. Diagnostics for ClassificationRegression - Key: SPARK-1227 URL: https://issues.apache.org/jira/browse/SPARK-1227 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Martin Jaggi Assignee: Martin Jaggi Currently, the attained objective function is not computed (for efficiency reasons, as one evaluation requires one full pass through the data). For diagnostics and comparing different algorithms, we should however provide this as a separate function (one MR). Doing this requires the loss and regularizer functions themselves, not only their gradients (which are currently in the Gradient class). How about adding the new function directly on the corresponding models in classification/* and regression/* ? Any thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491919#comment-14491919 ] Michael Kuhlen commented on SPARK-3727: --- Hello! I've implemented predictWithProbabilities() methods for DecisionTreeModel and treeEnsembleModels in scala. These methods return both the most likely class as well as the probabilities of each of the classes. As in scikit-learn, the probabilities are defined as the mean predicted class probabilities of the trees in the forest\[, where the\] class probability of a single tree is the fraction of samples of the same class in a leaf. ([sklearn.ensemble.RandomForestClassifier.predict_proba|http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba]) My approach was to modify the Predict class to hold the class probabilities for all classes (as opposed to just of the majority class), and I utilize these probabilities to determine the means over all trees. I believe this should work for GBTrees as well, since I'm taking care to weight the probabilities by the weight of each tree (=1.0 for RandomForest). Here's a [link to my fork|https://github.com/apache/spark/compare/master...mqk:master] showing my modifications. I would be happy to issue a pull request for these changes, if that would be of interest to the community. Although I haven't done so yet, I believe it should be straightforward to extend this to also calculate the variance of estimates for regression algorithms, as suggested in this ticket. Best, Mike DecisionTree, RandomForest: More prediction functionality - Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491863#comment-14491863 ] Patrick Wendell commented on SPARK-1529: Hey Kannan, We originally considered doing something like you are proposing, where we would change our filesystem interactions to all use a Hadoop FileSystem class and then we'd use Hadoop's LocalFileSystem. However, there were two issues: 1. We used POSIX API's that are not present in Hadoop. For instance, we use memory mapping in various places, FileChannel in the BlockObjectWriter, etc. 2. Using LocalFileSystem has a substantial performance overheads compared with our current code. So we'd have to write our own implementation of a Local filesystem. For this reason, we decided that our current shuffle machinery was fundamentally not usable for non-POSIX environments. So we decided that instead, we'd let people customize shuffle behavior at a higher level and we implemented the pluggable shuffle components. So you can create a shuffle manager that is specifically optimized for a particular environment (e.g. MapR). Did you consider implementing a MapR shuffle using that mechanism instead? You'd have to operate at a higher level, where you reason about shuffle records, etc. But you'd have a lot of flexibility to optimize within that. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4081) Categorical feature indexing
[ https://issues.apache.org/jira/browse/SPARK-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4081. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 3000 [https://github.com/apache/spark/pull/3000] Categorical feature indexing Key: SPARK-4081 URL: https://issues.apache.org/jira/browse/SPARK-4081 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Fix For: 1.4.0 DecisionTree and RandomForest require that categorical features and labels be indexed 0,1,2 There is currently no code to aid with indexing a dataset. This is a proposal for a helper class for computing indices (and also deciding which features to treat as categorical). Proposed functionality: * This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter. * This can also map categorical feature values to 0-based indices. Usage: {code} val myData1: RDD[Vector] = ... val myData2: RDD[Vector] = ... val datasetIndexer = new DatasetIndexer(maxCategories) datasetIndexer.fit(myData1) val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1) datasetIndexer.fit(myData2) val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2) val categoricalFeaturesInfo: Map[Double, Int] = datasetIndexer.getCategoricalFeatureIndexes() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6869) Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node
[ https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6869: - Component/s: PySpark Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node --- Key: SPARK-6869 URL: https://issues.apache.org/jira/browse/SPARK-6869 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Weizhong Priority: Minor From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so pass the PYTHONPATH(set in spark-env.sh) to executor so that executor python process can read pyspark file from local file system rather than from assembly jar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted
[ https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6870: - Component/s: YARN Catch InterruptedException when yarn application state monitor thread been interrupted -- Key: SPARK-6870 URL: https://issues.apache.org/jira/browse/SPARK-6870 Project: Spark Issue Type: Improvement Components: YARN Reporter: Weizhong Priority: Minor On PR #5305 we interrupt the monitor thread but forget to catch the InterruptedException, then in the log will print the stack info, so we need to catch it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491869#comment-14491869 ] Kannan Rajah commented on SPARK-1529: - [~pwendell] The default code path still uses the FileChannel, memory mapping techniques. I just provided an abstraction called FileSystem.scala (not Hadoop's FileSystem.java). LocalFileSystem.scala delegates the call to existing Spark code path that uses FileChannel. I am using Hadoop's RawLocalFileSystem class just to get an InputStream, OutputStream. And this internally also uses FileChannel. Please see RawLocalFileSystem.LocalFSFileInputStream. It is just a wrapper on java.io.FileInputStream. Going back to why I considered this approach. It will allow us to reuse all the logic currently used by SortShuffle code path. We would have to implement pretty much everything that's been done by Spark to do the shuffle on HDFS. We are in the processing of running some performance tests to understand the impact of the change. One of the main things we will be verifying is if there is any performance degradation introduced in the default code path and fix if there is any. Is this acceptable? Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6765) Turn scalastyle on for test code
[ https://issues.apache.org/jira/browse/SPARK-6765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491893#comment-14491893 ] Apache Spark commented on SPARK-6765: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/5486 Turn scalastyle on for test code Key: SPARK-6765 URL: https://issues.apache.org/jira/browse/SPARK-6765 Project: Spark Issue Type: Improvement Components: Project Infra, Tests Reporter: Reynold Xin Assignee: Reynold Xin We should turn scalastyle on for test code. Test code should be as important as main code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491891#comment-14491891 ] Joseph K. Bradley commented on SPARK-5256: -- Added link to [SPARK-1227], which discusses ML diagnostics and brings up the question of what loss functions should be provided as Loss classes rather than via the ClassificationMetrics and RegressionMetrics classes. Improving MLlib optimization APIs - Key: SPARK-5256 URL: https://issues.apache.org/jira/browse/SPARK-5256 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley *Goal*: Improve APIs for optimization *Motivation*: There have been several disjoint mentions of improving the optimization APIs to make them more pluggable, extensible, etc. This JIRA is a place to discuss what API changes are necessary for the long term, and to provide links to other relevant JIRAs. Eventually, I hope this leads to a design doc outlining: * current issues * requirements such as supporting many types of objective functions, optimization algorithms, and parameters to those algorithms * ideal API * breakdown of smaller JIRAs needed to achieve that API I will soon create an initial design doc, and I will try to watch this JIRA and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491906#comment-14491906 ] Yu Ishikawa commented on SPARK-6682: [~avulanov] thank you for your answer. And I understand SPARK-5256 blocks this issue. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4760) ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files
[ https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-4760: ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files -- Key: SPARK-4760 URL: https://issues.apache.org/jira/browse/SPARK-4760 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Priority: Critical Fix For: 1.3.0 In a older Spark version built around Oct. 12, I was able to use ANALYZE TABLE table COMPUTE STATISTICS noscan to get estimated table size, which is important for optimizing joins. (I'm joining 15 small dimension tables, and this is crucial to me). In the more recent Spark builds, it fails to estimate the table size unless I remove noscan. Here's the statistics I got using DESC EXTENDED: old: parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166} new: parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} And I've tried turning off spark.sql.hive.convertMetastoreParquet in my spark-defaults.conf and the result is unaffected (in both versions). Looks like the Parquet support in new Hive (0.13.1) is broken? Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6179) Support SHOW PRINCIPALS role_name;
[ https://issues.apache.org/jira/browse/SPARK-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6179: --- Assignee: Zhongshuai Pei Support SHOW PRINCIPALS role_name; Key: SPARK-6179 URL: https://issues.apache.org/jira/browse/SPARK-6179 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1 Reporter: Zhongshuai Pei Assignee: Zhongshuai Pei Fix For: 1.4.0 SHOW PRINCIPALS role_name; Lists all roles and users who belong to this role. Only the admin role has privilege for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6199) Support CTE
[ https://issues.apache.org/jira/browse/SPARK-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6199: --- Assignee: Cheng Hao Support CTE --- Key: SPARK-6199 URL: https://issues.apache.org/jira/browse/SPARK-6199 Project: Spark Issue Type: Improvement Components: SQL Reporter: haiyang Assignee: Cheng Hao Fix For: 1.4.0 Support CTE in SQLContext and HiveContext -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491877#comment-14491877 ] Ilya Ganelin commented on SPARK-6703: - Patrick - I can look into this. Thank you. Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6865) Decide on semantics for string identifiers in DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-6865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6865: --- Summary: Decide on semantics for string identifiers in DataFrame API (was: Decide on semantics for string identifiers in DataSource API) Decide on semantics for string identifiers in DataFrame API --- Key: SPARK-6865 URL: https://issues.apache.org/jira/browse/SPARK-6865 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Blocker There are two options: - Quoted Identifiers: meaning that the strings are treated as though they were in backticks in SQL. Any weird characters (spaces, or, etc) are considered part of the identifier. Kind of weird given that `*` is already a special identifier explicitly allowed by the API - Unquoted parsed identifiers: would allow users to specify things like tableAlias.* However, would also require explicit use of `backticks` for identifiers with weird characters in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6876) DataFrame.na.replace value support for Python
Reynold Xin created SPARK-6876: -- Summary: DataFrame.na.replace value support for Python Key: SPARK-6876 URL: https://issues.apache.org/jira/browse/SPARK-6876 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Scala/Java support is in. We should provide the Python version, similar to what Pandas supports. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6863) Formatted list broken on Hive compatibility section of SQL programming guide
[ https://issues.apache.org/jira/browse/SPARK-6863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6863: --- Assignee: Santiago M. Mola Formatted list broken on Hive compatibility section of SQL programming guide Key: SPARK-6863 URL: https://issues.apache.org/jira/browse/SPARK-6863 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.0 Reporter: Santiago M. Mola Assignee: Santiago M. Mola Priority: Trivial Fix For: 1.3.1, 1.4.0 Formatted list broken on Hive compatibility section of SQL programming guide. It does not appear as a list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3937) Unsafe memory access inside of Snappy library
[ https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491857#comment-14491857 ] Guoqiang Li commented on SPARK-3937: Get data: {code:none}wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdda.bz2{code} Get code: {code:none}git clone https://github.com/cloudml/zen.git{code} mvn -DskipTests clean package spark-defaults.conf: {code:none} spark.yarn.dist.archives hdfs://ns1:8020/input/lbs/recommend/toona/spark/conf spark.yarn.user.classpath.first true spark.cleaner.referenceTracking.blocking true spark.cleaner.referenceTracking.cleanCheckpoints true spark.cleaner.referenceTracking.blocking.shuffle true spark.yarn.historyServer.address 10dian71:18080 spark.executor.cores 2 spark.yarn.executor.memoryOverhead 1 spark.yarn.driver.memoryOverhead 1 spark.executor.instances 36 spark.rdd.compress true spark.executor.memory 4g spark.akka.frameSize 20 spark.akka.askTimeout120 spark.akka.timeout 120 spark.default.parallelism72 spark.locality.wait 1 spark.core.connection.ack.wait.timeout 360 spark.storage.memoryFraction 0.1 spark.broadcast.factory org.apache.spark.broadcast.TorrentBroadcastFactory spark.driver.maxResultSize 4000 #spark.shuffle.blockTransferService nio #spark.akka.heartbeat.interval 100 #spark.kryoserializer.buffer.max.mb 128 spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrator org.apache.spark.graphx.GraphKryoRegistrator #spark.kryo.registrator com.github.cloudml.zen.ml.clustering.LDAKryoRegistrator {code} Reproduce: {code:none}./bin/spark-shell --master yarn-client --driver-memory 8g --jars /opt/spark/classes/zen-assembly.jar{code} {code:none} import com.github.cloudml.zen.ml.regression.LogisticRegression import org.apache.spark.mllib.util.MLUtils import org.apache.spark.mllib.regression.LabeledPoint val dataSet = MLUtils.loadLibSVMFile(sc, /input/lbs/recommend/kdda/*).repartition(72).cache() val numIterations = 150 val stepSize = 0.1 val l1 = 0.0 val epsilon = 1e-6 val useAdaGrad = false LogisticRegression.trainMIS(dataSet, numIterations, stepSize, l1, epsilon, useAdaGrad) {code} Unsafe memory access inside of Snappy library - Key: SPARK-3937 URL: https://issues.apache.org/jira/browse/SPARK-3937 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.3.0 Reporter: Patrick Wendell This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't have much information about this other than the stack trace. However, it was concerning enough I figured I should post it. {code} java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code org.xerial.snappy.SnappyNative.rawUncompress(Native Method) org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) org.xerial.snappy.Snappy.uncompress(Snappy.java:480) org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712) java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
[jira] [Commented] (SPARK-6823) Add a model.matrix like capability to DataFrames (modelDataFrame)
[ https://issues.apache.org/jira/browse/SPARK-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491885#comment-14491885 ] Joseph K. Bradley commented on SPARK-6823: -- This sounds like it would be covered by the OneHotEncoder + VectorAssembler feature transformers: * [https://issues.apache.org/jira/browse/SPARK-5888] * [https://issues.apache.org/jira/browse/SPARK-5885] Do you think these belong within DataFrame (and that this JIRA should be for SQL instead of ML)? Add a model.matrix like capability to DataFrames (modelDataFrame) - Key: SPARK-6823 URL: https://issues.apache.org/jira/browse/SPARK-6823 Project: Spark Issue Type: New Feature Components: ML, SparkR Reporter: Shivaram Venkataraman Currently Mllib modeling tools work only with double data. However, data tables in practice often have a set of categorical fields (factors in R), that need to be converted to a set of 0/1 indicator variables (making the data actually used in a modeling algorithm completely numeric). In R, this is handled in modeling functions using the model.matrix function. Similar functionality needs to be available within Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5885) Add VectorAssembler
[ https://issues.apache.org/jira/browse/SPARK-5885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5885. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5196 [https://github.com/apache/spark/pull/5196] Add VectorAssembler --- Key: SPARK-5885 URL: https://issues.apache.org/jira/browse/SPARK-5885 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.0 `VectorAssembler` takes a list of columns (of type double/int/vector) and merge them into a single vector column. {code} val va = new VectorAssembler() .setInputCols(userFeatures, dayOfWeek, timeOfDay) .setOutputCol(features) {code} In the first version, it should be okay if it doesn't handle ML attributes (SPARK-4588). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5886) Add LabelIndexer
[ https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5886. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4735 [https://github.com/apache/spark/pull/4735] Add LabelIndexer Key: SPARK-5886 URL: https://issues.apache.org/jira/browse/SPARK-5886 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.0 `LabelIndexer` takes a column of labels (raw categories) and outputs an integer column with labels indexed by their frequency. {code} va li = new LabelIndexer() .setInputCol(country) .setOutputCol(countryIndex) {code} In the output column, we should store the label to index map as an ML attribute. The index should be ordered by frequency, where the most frequent label gets index 0, to enhance sparsity. We can discuss whether this should index multiple columns at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY
[ https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dean Chen updated SPARK-6868: - Comment: was deleted (was: https://github.com/apache/spark/pull/5477) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY --- Key: SPARK-6868 URL: https://issues.apache.org/jira/browse/SPARK-6868 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0 Reporter: Dean Chen Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png The stdout and stderr log links on the executor page will use the http:// prefix even if the node manager does not support http and only https via setting yarn.http.policy=HTTPS_ONLY. Unfortunately the unencrypted http link in that case does not return a 404 but a binary file containing random binary chars. This causes a lot of confusion for the end user since it seems like the log file exists and is just filled with garbage. (see attached screenshot) The fix is to prefix container log links with https:// instead of http:// if yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen here: https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY
Dean Chen created SPARK-6868: Summary: Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY Key: SPARK-6868 URL: https://issues.apache.org/jira/browse/SPARK-6868 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1, 1.2.0, 1.1.1, 1.1.0 Reporter: Dean Chen The stdout and stderr log links on the executor page will use the http:// prefix even if the node manager does not support http and only https via setting yarn.http.policy=HTTPS_ONLY. Unfortunately the unencrypted http link in that case does not return a 404 but a binary file containing random binary chars. This causes a lot of confusion for the end user since it seems like the log file exists and is just filled with garbage. (see attached screenshot) The fix is to prefix container log links with https:// instead of http:// if yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen here: https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6869) Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node
[ https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6869: --- Assignee: Apache Spark Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node --- Key: SPARK-6869 URL: https://issues.apache.org/jira/browse/SPARK-6869 Project: Spark Issue Type: Improvement Reporter: Weizhong Assignee: Apache Spark Priority: Minor From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so pass the PYTHONPATH(set in spark-env.sh) to executor so that executor python process can read pyspark file from local file system rather than from assembly jar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6869) Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node
[ https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491373#comment-14491373 ] Apache Spark commented on SPARK-6869: - User 'Sephiroth-Lin' has created a pull request for this issue: https://github.com/apache/spark/pull/5478 Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node --- Key: SPARK-6869 URL: https://issues.apache.org/jira/browse/SPARK-6869 Project: Spark Issue Type: Improvement Reporter: Weizhong Priority: Minor From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so pass the PYTHONPATH(set in spark-env.sh) to executor so that executor python process can read pyspark file from local file system rather than from assembly jar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6869) Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node
[ https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6869: --- Assignee: (was: Apache Spark) Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node --- Key: SPARK-6869 URL: https://issues.apache.org/jira/browse/SPARK-6869 Project: Spark Issue Type: Improvement Reporter: Weizhong Priority: Minor From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so pass the PYTHONPATH(set in spark-env.sh) to executor so that executor python process can read pyspark file from local file system rather than from assembly jar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted
Weizhong created SPARK-6870: --- Summary: Catch InterruptedException when yarn application state monitor thread been interrupted Key: SPARK-6870 URL: https://issues.apache.org/jira/browse/SPARK-6870 Project: Spark Issue Type: Improvement Reporter: Weizhong Priority: Minor On PR #5305 we interrupt the monitor thread but forget to catch the InterruptedException, then in the log will print the stack info, so we need to catch it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY
[ https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dean Chen updated SPARK-6868: - Attachment: Screen Shot 2015-04-11 at 11.49.21 PM.png Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY --- Key: SPARK-6868 URL: https://issues.apache.org/jira/browse/SPARK-6868 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0 Reporter: Dean Chen Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png The stdout and stderr log links on the executor page will use the http:// prefix even if the node manager does not support http and only https via setting yarn.http.policy=HTTPS_ONLY. Unfortunately the unencrypted http link in that case does not return a 404 but a binary file containing random binary chars. This causes a lot of confusion for the end user since it seems like the log file exists and is just filled with garbage. (see attached screenshot) The fix is to prefix container log links with https:// instead of http:// if yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen here: https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY
[ https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6868: --- Assignee: Apache Spark Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY --- Key: SPARK-6868 URL: https://issues.apache.org/jira/browse/SPARK-6868 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0 Reporter: Dean Chen Assignee: Apache Spark Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png The stdout and stderr log links on the executor page will use the http:// prefix even if the node manager does not support http and only https via setting yarn.http.policy=HTTPS_ONLY. Unfortunately the unencrypted http link in that case does not return a 404 but a binary file containing random binary chars. This causes a lot of confusion for the end user since it seems like the log file exists and is just filled with garbage. (see attached screenshot) The fix is to prefix container log links with https:// instead of http:// if yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen here: https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY
[ https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491354#comment-14491354 ] Dean Chen commented on SPARK-6868: -- https://github.com/apache/spark/pull/5477 Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY --- Key: SPARK-6868 URL: https://issues.apache.org/jira/browse/SPARK-6868 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0 Reporter: Dean Chen Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png The stdout and stderr log links on the executor page will use the http:// prefix even if the node manager does not support http and only https via setting yarn.http.policy=HTTPS_ONLY. Unfortunately the unencrypted http link in that case does not return a 404 but a binary file containing random binary chars. This causes a lot of confusion for the end user since it seems like the log file exists and is just filled with garbage. (see attached screenshot) The fix is to prefix container log links with https:// instead of http:// if yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen here: https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY
[ https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6868: --- Assignee: (was: Apache Spark) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY --- Key: SPARK-6868 URL: https://issues.apache.org/jira/browse/SPARK-6868 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0 Reporter: Dean Chen Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png The stdout and stderr log links on the executor page will use the http:// prefix even if the node manager does not support http and only https via setting yarn.http.policy=HTTPS_ONLY. Unfortunately the unencrypted http link in that case does not return a 404 but a binary file containing random binary chars. This causes a lot of confusion for the end user since it seems like the log file exists and is just filled with garbage. (see attached screenshot) The fix is to prefix container log links with https:// instead of http:// if yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen here: https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY
[ https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491355#comment-14491355 ] Apache Spark commented on SPARK-6868: - User 'deanchen' has created a pull request for this issue: https://github.com/apache/spark/pull/5477 Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY --- Key: SPARK-6868 URL: https://issues.apache.org/jira/browse/SPARK-6868 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0 Reporter: Dean Chen Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png The stdout and stderr log links on the executor page will use the http:// prefix even if the node manager does not support http and only https via setting yarn.http.policy=HTTPS_ONLY. Unfortunately the unencrypted http link in that case does not return a 404 but a binary file containing random binary chars. This causes a lot of confusion for the end user since it seems like the log file exists and is just filled with garbage. (see attached screenshot) The fix is to prefix container log links with https:// instead of http:// if yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen here: https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY
[ https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dean Chen updated SPARK-6868: - Component/s: (was: Spark Core) YARN Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY --- Key: SPARK-6868 URL: https://issues.apache.org/jira/browse/SPARK-6868 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0 Reporter: Dean Chen Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png The stdout and stderr log links on the executor page will use the http:// prefix even if the node manager does not support http and only https via setting yarn.http.policy=HTTPS_ONLY. Unfortunately the unencrypted http link in that case does not return a 404 but a binary file containing random binary chars. This causes a lot of confusion for the end user since it seems like the log file exists and is just filled with garbage. (see attached screenshot) The fix is to prefix container log links with https:// instead of http:// if yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen here: https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6869) Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node
Weizhong created SPARK-6869: --- Summary: Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node Key: SPARK-6869 URL: https://issues.apache.org/jira/browse/SPARK-6869 Project: Spark Issue Type: Improvement Reporter: Weizhong Priority: Minor From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so pass the PYTHONPATH(set in spark-env.sh) to executor so that executor python process can read pyspark file from local file system rather than from assembly jar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted
[ https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6870: --- Assignee: Apache Spark Catch InterruptedException when yarn application state monitor thread been interrupted -- Key: SPARK-6870 URL: https://issues.apache.org/jira/browse/SPARK-6870 Project: Spark Issue Type: Improvement Reporter: Weizhong Assignee: Apache Spark Priority: Minor On PR #5305 we interrupt the monitor thread but forget to catch the InterruptedException, then in the log will print the stack info, so we need to catch it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted
[ https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491380#comment-14491380 ] Apache Spark commented on SPARK-6870: - User 'Sephiroth-Lin' has created a pull request for this issue: https://github.com/apache/spark/pull/5479 Catch InterruptedException when yarn application state monitor thread been interrupted -- Key: SPARK-6870 URL: https://issues.apache.org/jira/browse/SPARK-6870 Project: Spark Issue Type: Improvement Reporter: Weizhong Priority: Minor On PR #5305 we interrupt the monitor thread but forget to catch the InterruptedException, then in the log will print the stack info, so we need to catch it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted
[ https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6870: --- Assignee: (was: Apache Spark) Catch InterruptedException when yarn application state monitor thread been interrupted -- Key: SPARK-6870 URL: https://issues.apache.org/jira/browse/SPARK-6870 Project: Spark Issue Type: Improvement Reporter: Weizhong Priority: Minor On PR #5305 we interrupt the monitor thread but forget to catch the InterruptedException, then in the log will print the stack info, so we need to catch it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6866) Cleanup duplicated dependency in pom.xml
[ https://issues.apache.org/jira/browse/SPARK-6866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6866: - Due Date: (was: 15/Apr/15) Priority: Trivial (was: Minor) Assignee: Guancheng Chen Cleanup duplicated dependency in pom.xml Key: SPARK-6866 URL: https://issues.apache.org/jira/browse/SPARK-6866 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Reporter: Guancheng Chen Assignee: Guancheng Chen Priority: Trivial Labels: build, maven Fix For: 1.4.0 It turns out launcher/pom.xml has duplicated scalatest dependency. We should remove it in this child pom.xml since it has already inherited the dependency from the parent pom.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6866) Cleanup duplicated dependency in pom.xml
[ https://issues.apache.org/jira/browse/SPARK-6866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6866. -- Resolution: Fixed Issue resolved by pull request 5476 [https://github.com/apache/spark/pull/5476] Cleanup duplicated dependency in pom.xml Key: SPARK-6866 URL: https://issues.apache.org/jira/browse/SPARK-6866 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Reporter: Guancheng Chen Priority: Minor Labels: build, maven Fix For: 1.4.0 It turns out launcher/pom.xml has duplicated scalatest dependency. We should remove it in this child pom.xml since it has already inherited the dependency from the parent pom.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk
[ https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491420#comment-14491420 ] Harsh Gupta commented on SPARK-761: --- [~aash] How do I do a compatibility check on API on which they talk ? Can you give a bit more specific detail on as to how to proceed . I can do it as a starter task to understand the core of spark functioning and that will get me going. Print a nicer error message when incompatible Spark binaries try to talk Key: SPARK-761 URL: https://issues.apache.org/jira/browse/SPARK-761 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Priority: Minor Labels: starter As a starter task, it would be good to audit the current behavior for different client - server pairs with respect to how exceptions occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6842) mvn -DskipTests clean package fails
[ https://issues.apache.org/jira/browse/SPARK-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sree Vaddi closed SPARK-6842. - build successful. mvn -DskipTests clean package fails --- Key: SPARK-6842 URL: https://issues.apache.org/jira/browse/SPARK-6842 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Environment: CentOS v7 Oracle JDK 8 w/ Unlimited Strength Crypto jars Reporter: Sree Vaddi Priority: Blocker Attachments: mvn.clean.package.log Fork on github $ git clone https://github.com/userid/spark.git $ cd spark $ mvn -DskipTests clean package ... ... wait 39 minutes === My diagnosis: By default, I am in 'master' branch. Usually, 'master' branches are highly volatile. May be I should try 'branch-1.3'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6871) WITH clause in CTE can not following another WITH clause
Liang-Chi Hsieh created SPARK-6871: -- Summary: WITH clause in CTE can not following another WITH clause Key: SPARK-6871 URL: https://issues.apache.org/jira/browse/SPARK-6871 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh For example, this sql query WITH q1 AS (SELECT * FROM testData) WITH q2 AS (SELECT * FROM q1) SELECT * FROM q2 should not be successfully parsed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6871) WITH clause in CTE can not following another WITH clause
[ https://issues.apache.org/jira/browse/SPARK-6871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491401#comment-14491401 ] Apache Spark commented on SPARK-6871: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/5480 WITH clause in CTE can not following another WITH clause Key: SPARK-6871 URL: https://issues.apache.org/jira/browse/SPARK-6871 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh For example, this sql query WITH q1 AS (SELECT * FROM testData) WITH q2 AS (SELECT * FROM q1) SELECT * FROM q2 should not be successfully parsed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6871) WITH clause in CTE can not following another WITH clause
[ https://issues.apache.org/jira/browse/SPARK-6871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6871: --- Assignee: (was: Apache Spark) WITH clause in CTE can not following another WITH clause Key: SPARK-6871 URL: https://issues.apache.org/jira/browse/SPARK-6871 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh For example, this sql query WITH q1 AS (SELECT * FROM testData) WITH q2 AS (SELECT * FROM q1) SELECT * FROM q2 should not be successfully parsed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6545) Minor changes for CompactBuffer
[ https://issues.apache.org/jira/browse/SPARK-6545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6545. -- Resolution: Won't Fix I think this is WontFix given https://github.com/apache/spark/pull/5199 but reopen if I misunderstood. Minor changes for CompactBuffer --- Key: SPARK-6545 URL: https://issues.apache.org/jira/browse/SPARK-6545 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor HashedRelation should always return a Not null CompactBuffer, which will be helpful for the further improvement of Multiway Join -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1303) Added discretization capability to MLlib.
[ https://issues.apache.org/jira/browse/SPARK-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1303. -- Resolution: Won't Fix Sounds like this should start outside MLlib: https://github.com/apache/spark/pull/216 Added discretization capability to MLlib. - Key: SPARK-1303 URL: https://issues.apache.org/jira/browse/SPARK-1303 Project: Spark Issue Type: New Feature Components: MLlib Reporter: LIDIAgroup Some time ago, we have commented with Ameet Talwalkar the possibilty of including both Feature Selection and Discretization algorithms to MLlib. In this patch we've implemented Entropy Minimization Discretization following the algorithm described in the paper Multi-interval discretization of continuous-valued attributes for classification learning by Fayyad and Irani (1993). This is one of the most used Discretizers and is already included in most libraries like Weka, etc. This can be used as base for FS algorims and the NaiveBayes already included in MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6864) Spark's Multilabel Classifier runs out of memory on small datasets
[ https://issues.apache.org/jira/browse/SPARK-6864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491437#comment-14491437 ] Sean Owen commented on SPARK-6864: -- I believe this is the *driver* process running out of memory. You have massive executors but the driver is probably still on 512MB of RAM. Try increasing that. I think everything else like your executors and data size is irrelevant then and orders of magnitude larger than is needed for this data set. Spark's Multilabel Classifier runs out of memory on small datasets -- Key: SPARK-6864 URL: https://issues.apache.org/jira/browse/SPARK-6864 Project: Spark Issue Type: Test Components: MLlib Affects Versions: 1.2.1 Environment: EC2 with 8-96 instances up to r3.4xlarge The test fails on every configuration Reporter: John Canny Fix For: 1.2.1 When trying to run Spark's MultiLabel classifier (LogisticRegressionWithLBFGS) on the RCV1 V2 dataset (about 0.5GB, 100 labels), the classifier runs out of memory. The number of tasks per executor doesnt seem to matter. It happens even with a single task per 120 GB executor. The dataset is the concatenation of the test files from the rcv1v2 (topics; full sets) group here: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html Here's the code: import org.apache.spark.SparkContext import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.optimization.L1Updater import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.util.MLUtils import scala.compat.Platform._ val nnodes = 8 val t0=currentTime // Load training data in LIBSVM format. val train = MLUtils.loadLibSVMFile(sc, s3n://bidmach/RCV1train.libsvm, true, 276544, nnodes) val test = MLUtils.loadLibSVMFile(sc, s3n://bidmach/RCV1test.libsvm, true, 276544, nnodes) val t1=currentTime; val lrAlg = new LogisticRegressionWithLBFGS() lrAlg.setNumClasses(100).optimizer. setNumIterations(10). setRegParam(1e-10). setUpdater(new L1Updater) // Run training algorithm to build the model val model = lrAlg.run(train) val t2=currentTime -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6871) WITH clause in CTE can not following another WITH clause
[ https://issues.apache.org/jira/browse/SPARK-6871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6871: --- Assignee: Apache Spark WITH clause in CTE can not following another WITH clause Key: SPARK-6871 URL: https://issues.apache.org/jira/browse/SPARK-6871 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Assignee: Apache Spark For example, this sql query WITH q1 AS (SELECT * FROM testData) WITH q2 AS (SELECT * FROM q1) SELECT * FROM q2 should not be successfully parsed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6677) pyspark.sql nondeterministic issue with row fields
[ https://issues.apache.org/jira/browse/SPARK-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491459#comment-14491459 ] Stefano Parmesan commented on SPARK-6677: - glad it helped! we're very eager to try it out pyspark.sql nondeterministic issue with row fields -- Key: SPARK-6677 URL: https://issues.apache.org/jira/browse/SPARK-6677 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Environment: spark version: spark-1.3.0-bin-hadoop2.4 python version: Python 2.7.6 operating system: MacOS, x86_64 x86_64 x86_64 GNU/Linux Reporter: Stefano Parmesan Assignee: Davies Liu Labels: pyspark, row, sql Fix For: 1.3.1, 1.4.0 The following issue happens only when running pyspark in the python interpreter, it works correctly with spark-submit. Reading two json files containing objects with a different structure leads sometimes to the definition of wrong Rows, where the fields of a file are used for the other one. I was able to write a sample code that reproduce this issue one out of three times; the code snippet is available at the following link, together with some (very simple) data samples: https://gist.github.com/armisael/e08bb4567d0a11efe2db -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6867) Dropout regularization
[ https://issues.apache.org/jira/browse/SPARK-6867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6867: - Target Version/s: (was: 1.4.0) Dropout regularization -- Key: SPARK-6867 URL: https://issues.apache.org/jira/browse/SPARK-6867 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Rakesh Chalasani Priority: Minor Linear models is MLLIB so far support no regularization, L1 and L2. Another more recently popularized method for regularization is dropout [http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf]. The dropout regularization basically randomly omit some of the input features at each iteration. Though this approach is particularly used in training deep networks, they could also be very useful on a linear models as if promotes adaptive regularization. This approach is particularly useful in NLP [http://papers.nips.cc/paper/4882-dropout-training-as-adaptive-regularization.pdf] and, because of its simplicity can be easily adopted for streaming linear models as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6843) Potential visibility problem for the state of Executor
[ https://issues.apache.org/jira/browse/SPARK-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6843. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5448 [https://github.com/apache/spark/pull/5448] Potential visibility problem for the state of Executor Key: SPARK-6843 URL: https://issues.apache.org/jira/browse/SPARK-6843 Project: Spark Issue Type: Bug Components: Spark Core Reporter: zhichao-li Priority: Minor Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6843) Potential visibility problem for the state of Executor
[ https://issues.apache.org/jira/browse/SPARK-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6843: - Priority: Trivial (was: Minor) Assignee: zhichao-li Potential visibility problem for the state of Executor Key: SPARK-6843 URL: https://issues.apache.org/jira/browse/SPARK-6843 Project: Spark Issue Type: Bug Components: Spark Core Reporter: zhichao-li Assignee: zhichao-li Priority: Trivial Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6842) mvn -DskipTests clean package fails
[ https://issues.apache.org/jira/browse/SPARK-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sree Vaddi updated SPARK-6842: -- Attachment: mvn.clean.package.log mvn package is successful on my machine, now. previously, i was working in vm with code on a shared file system from local host. instead, checked out code to vm local file system. no other changes. attached the build log. high level steps: (useful for newbies) install virtual box create new vm use centos v7 git clone mvn package mvn -DskipTests clean package fails --- Key: SPARK-6842 URL: https://issues.apache.org/jira/browse/SPARK-6842 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Environment: CentOS v7 Oracle JDK 8 w/ Unlimited Strength Crypto jars Reporter: Sree Vaddi Priority: Blocker Attachments: mvn.clean.package.log Fork on github $ git clone https://github.com/userid/spark.git $ cd spark $ mvn -DskipTests clean package ... ... wait 39 minutes === My diagnosis: By default, I am in 'master' branch. Usually, 'master' branches are highly volatile. May be I should try 'branch-1.3'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6151) schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size
[ https://issues.apache.org/jira/browse/SPARK-6151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491502#comment-14491502 ] Sree Vaddi commented on SPARK-6151: --- [~cnstar9988] The HDFS Block Size is set once when you first install Hadoop. It is possible to change the HDFS block size in your hadoop configuration and restart your hadoop for the change to take effect. (read literature and feel comfortable, before you make this change). Then, you can run saveAsParquetFile(). Which will now use the new HDFS block size. schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size --- Key: SPARK-6151 URL: https://issues.apache.org/jira/browse/SPARK-6151 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1 Reporter: Littlestar Priority: Trivial How schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size. may be Configuration need. related question by others. http://apache-spark-user-list.1001560.n3.nabble.com/HDFS-block-size-for-parquet-output-tt21183.html http://qnalist.com/questions/5054892/spark-sql-parquet-and-impala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6872) external sort need to copy
Adrian Wang created SPARK-6872: -- Summary: external sort need to copy Key: SPARK-6872 URL: https://issues.apache.org/jira/browse/SPARK-6872 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6873) Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements
Sean Owen created SPARK-6873: Summary: Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements Key: SPARK-6873 URL: https://issues.apache.org/jira/browse/SPARK-6873 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.3.1 Reporter: Sean Owen Priority: Minor As I mentioned, I've been seeing 4 test failures in Hive tests for a while, and actually it still affects master. I think it's a superficial problem that only turns up when running on Java 8, but still, would probably be an easy fix and good to fix. Specifically, here are four tests and the bit that fails the comparison, below. I tried to diagnose this but had trouble even finding where some of this occurs, like the list of synonyms? {code} - show_tblproperties *** FAILED *** Results do not match for show_tblproperties: ... !== HIVE - 2 row(s) == == CATALYST - 2 row(s) == !tmp truebar bar value !bar bar value tmp true (HiveComparisonTest.scala:391) {code} {code} - show_create_table_serde *** FAILED *** Results do not match for show_create_table_serde: ... WITH SERDEPROPERTIES ( WITH SERDEPROPERTIES ( ! 'serialization.format'='$', 'field.delim'=',', ! 'field.delim'=',') 'serialization.format'='$') {code} {code} - udf_std *** FAILED *** Results do not match for udf_std: ... !== HIVE - 2 row(s) == == CATALYST - 2 row(s) == std(x) - Returns the standard deviation of a set of numbers std(x) - Returns the standard deviation of a set of numbers !Synonyms: stddev_pop, stddev Synonyms: stddev, stddev_pop (HiveComparisonTest.scala:391) {code} {code} - udf_stddev *** FAILED *** Results do not match for udf_stddev: ... !== HIVE - 2 row(s) ==== CATALYST - 2 row(s) == stddev(x) - Returns the standard deviation of a set of numbers stddev(x) - Returns the standard deviation of a set of numbers !Synonyms: stddev_pop, stdSynonyms: std, stddev_pop (HiveComparisonTest.scala:391) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491575#comment-14491575 ] Cheng Lian commented on SPARK-6859: --- A better way can be defensive copy while inserting byte arrays to parquet, so that we don't suffer read performance regression. Parquet File Binary column statistics error when reuse byte[] among rows Key: SPARK-6859 URL: https://issues.apache.org/jira/browse/SPARK-6859 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Yijie Shen Priority: Minor Suppose I create a dataRDD which extends RDD\[Row\], and each row is GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is reused among rows but has different content each time. When I convert it to a dataFrame and save it as Parquet File, the file's row group statistic(max min) of Binary column would be wrong. \\ \\ Here is the reason: In Parquet, BinaryStatistic just keep max min as parquet.io.api.Binary references, Spark sql would generate a new Binary backed by the same Array\[Byte\] passed from row. | |reference| |backed| | |max: Binary|--|ByteArrayBackedBinary|--|Array\[Byte\]| Therefore, each time parquet updating row group's statistic, max min would always refer to the same Array\[Byte\], which has new content each time. When parquet decides to save it into file, the last row's content would be saved as both max min. \\ \\ It seems it is a parquet bug because it's parquet's responsibility to update statistics correctly. But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6872) external sort need to copy
[ https://issues.apache.org/jira/browse/SPARK-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491509#comment-14491509 ] Apache Spark commented on SPARK-6872: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/5481 external sort need to copy -- Key: SPARK-6872 URL: https://issues.apache.org/jira/browse/SPARK-6872 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6872) external sort need to copy
[ https://issues.apache.org/jira/browse/SPARK-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6872: --- Assignee: (was: Apache Spark) external sort need to copy -- Key: SPARK-6872 URL: https://issues.apache.org/jira/browse/SPARK-6872 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6872) external sort need to copy
[ https://issues.apache.org/jira/browse/SPARK-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6872: --- Assignee: Apache Spark external sort need to copy -- Key: SPARK-6872 URL: https://issues.apache.org/jira/browse/SPARK-6872 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream
[ https://issues.apache.org/jira/browse/SPARK-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6431. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5454 [https://github.com/apache/spark/pull/5454] Couldn't find leader offsets exception when creating KafkaDirectStream -- Key: SPARK-6431 URL: https://issues.apache.org/jira/browse/SPARK-6431 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Alberto Fix For: 1.4.0 When I try to create an InputDStream using the createDirectStream method of the KafkaUtils class and the kafka topic does not have any messages yet am getting the following error: org.apache.spark.SparkException: Couldn't find leader offsets for Set() org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leader offsets for Set() at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413) If I put a message in the topic before creating the DirectStream everything works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5107) A trick log info for the start of Receiver
[ https://issues.apache.org/jira/browse/SPARK-5107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491626#comment-14491626 ] Sree Vaddi commented on SPARK-5107: --- [~srowen] This may be closed. I could do, but I do not have edit permissions. A trick log info for the start of Receiver -- Key: SPARK-5107 URL: https://issues.apache.org/jira/browse/SPARK-5107 Project: Spark Issue Type: Improvement Components: Streaming Reporter: uncleGen Priority: Trivial Receiver will register itself whenever it begins to start. But, it is trick to log the same information. Especially, at the preStart(), it will also register itself. Just like the receiver has started twice. Just like: !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/3.JPG! We can log the information more clearly. Like the attempt times to start. Of course, nothing matters performance or use. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5364) HiveQL transform doesn't support the non output clause
[ https://issues.apache.org/jira/browse/SPARK-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491625#comment-14491625 ] Sree Vaddi commented on SPARK-5364: --- [~srowen] This may be closed. I could do, but I do not have edit permissions. HiveQL transform doesn't support the non output clause -- Key: SPARK-5364 URL: https://issues.apache.org/jira/browse/SPARK-5364 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Trivial This is a quick fix for query (in HiveContext) like: {panel} SELECT transform(key + 1, value) USING '/bin/cat' FROM src {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5364) HiveQL transform doesn't support the non output clause
[ https://issues.apache.org/jira/browse/SPARK-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5364. Resolution: Duplicate Fix Version/s: (was: 1.3.1) 1.3.0 HiveQL transform doesn't support the non output clause -- Key: SPARK-5364 URL: https://issues.apache.org/jira/browse/SPARK-5364 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Trivial Fix For: 1.3.0 This is a quick fix for query (in HiveContext) like: {panel} SELECT transform(key + 1, value) USING '/bin/cat' FROM src {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491558#comment-14491558 ] Cheng Lian commented on SPARK-6859: --- For 1.3 and prior versions, this issue isn't that serious, since strings are immutable. But in 1.4 we are adding mutable UTF8String ([PR #5350|https://github.com/apache/spark/pull/5350]). Parquet File Binary column statistics error when reuse byte[] among rows Key: SPARK-6859 URL: https://issues.apache.org/jira/browse/SPARK-6859 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Yijie Shen Priority: Minor Suppose I create a dataRDD which extends RDD\[Row\], and each row is GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is reused among rows but has different content each time. When I convert it to a dataFrame and save it as Parquet File, the file's row group statistic(max min) of Binary column would be wrong. \\ \\ Here is the reason: In Parquet, BinaryStatistic just keep max min as parquet.io.api.Binary references, Spark sql would generate a new Binary backed by the same Array\[Byte\] passed from row. | |reference| |backed| | |max: Binary|--|ByteArrayBackedBinary|--|Array\[Byte\]| Therefore, each time parquet updating row group's statistic, max min would always refer to the same Array\[Byte\], which has new content each time. When parquet decides to save it into file, the last row's content would be saved as both max min. \\ \\ It seems it is a parquet bug because it's parquet's responsibility to update statistics correctly. But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6873) Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements
[ https://issues.apache.org/jira/browse/SPARK-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491567#comment-14491567 ] Sean Owen commented on SPARK-6873: -- CC [~lian cheng] [~marmbrus] as I bet this would be fairly easy to diagnose for someone close to the query planner / catalyst bits. Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements -- Key: SPARK-6873 URL: https://issues.apache.org/jira/browse/SPARK-6873 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.3.1 Reporter: Sean Owen Priority: Minor As I mentioned, I've been seeing 4 test failures in Hive tests for a while, and actually it still affects master. I think it's a superficial problem that only turns up when running on Java 8, but still, would probably be an easy fix and good to fix. Specifically, here are four tests and the bit that fails the comparison, below. I tried to diagnose this but had trouble even finding where some of this occurs, like the list of synonyms? {code} - show_tblproperties *** FAILED *** Results do not match for show_tblproperties: ... !== HIVE - 2 row(s) == == CATALYST - 2 row(s) == !tmptruebar bar value !barbar value tmp true (HiveComparisonTest.scala:391) {code} {code} - show_create_table_serde *** FAILED *** Results do not match for show_create_table_serde: ... WITH SERDEPROPERTIES ( WITH SERDEPROPERTIES ( ! 'serialization.format'='$', 'field.delim'=',', ! 'field.delim'=',') 'serialization.format'='$') {code} {code} - udf_std *** FAILED *** Results do not match for udf_std: ... !== HIVE - 2 row(s) == == CATALYST - 2 row(s) == std(x) - Returns the standard deviation of a set of numbers std(x) - Returns the standard deviation of a set of numbers !Synonyms: stddev_pop, stddev Synonyms: stddev, stddev_pop (HiveComparisonTest.scala:391) {code} {code} - udf_stddev *** FAILED *** Results do not match for udf_stddev: ... !== HIVE - 2 row(s) ==== CATALYST - 2 row(s) == stddev(x) - Returns the standard deviation of a set of numbers stddev(x) - Returns the standard deviation of a set of numbers !Synonyms: stddev_pop, stdSynonyms: std, stddev_pop (HiveComparisonTest.scala:391) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream
[ https://issues.apache.org/jira/browse/SPARK-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6431: - Assignee: Cody Koeninger Couldn't find leader offsets exception when creating KafkaDirectStream -- Key: SPARK-6431 URL: https://issues.apache.org/jira/browse/SPARK-6431 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Alberto Assignee: Cody Koeninger Fix For: 1.4.0 When I try to create an InputDStream using the createDirectStream method of the KafkaUtils class and the kafka topic does not have any messages yet am getting the following error: org.apache.spark.SparkException: Couldn't find leader offsets for Set() org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leader offsets for Set() at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413) If I put a message in the topic before creating the DirectStream everything works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6874) Add support for SQL:2003 array type declaration syntax
[ https://issues.apache.org/jira/browse/SPARK-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6874: --- Assignee: (was: Apache Spark) Add support for SQL:2003 array type declaration syntax -- Key: SPARK-6874 URL: https://issues.apache.org/jira/browse/SPARK-6874 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola Priority: Minor As of SQL:2003, arrays are standard SQL types, However, declaration syntax differs from Spark's CQL-like syntax. Examples of standard syntax: BIGINT ARRAY BIGINT ARRAY[100] BIGINT ARRAY[100] ARRAY[200] It would be great to have support standard syntax here. Some additional details that this addition should have IMO: - Forbit mixed syntax such as ARRAYINT ARRAY[100] - Ignore the maximum capacity (ARRAY[N]) but allow it to be specified. This seems to be what others (i.e. PostgreSQL) are doing. ARRAYBIGINT ARRAY[100] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6874) Add support for SQL:2003 array type declaration syntax
[ https://issues.apache.org/jira/browse/SPARK-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491573#comment-14491573 ] Apache Spark commented on SPARK-6874: - User 'smola' has created a pull request for this issue: https://github.com/apache/spark/pull/5483 Add support for SQL:2003 array type declaration syntax -- Key: SPARK-6874 URL: https://issues.apache.org/jira/browse/SPARK-6874 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola Priority: Minor As of SQL:2003, arrays are standard SQL types, However, declaration syntax differs from Spark's CQL-like syntax. Examples of standard syntax: BIGINT ARRAY BIGINT ARRAY[100] BIGINT ARRAY[100] ARRAY[200] It would be great to have support standard syntax here. Some additional details that this addition should have IMO: - Forbit mixed syntax such as ARRAYINT ARRAY[100] - Ignore the maximum capacity (ARRAY[N]) but allow it to be specified. This seems to be what others (i.e. PostgreSQL) are doing. ARRAYBIGINT ARRAY[100] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6874) Add support for SQL:2003 array type declaration syntax
[ https://issues.apache.org/jira/browse/SPARK-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6874: --- Assignee: Apache Spark Add support for SQL:2003 array type declaration syntax -- Key: SPARK-6874 URL: https://issues.apache.org/jira/browse/SPARK-6874 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola Assignee: Apache Spark Priority: Minor As of SQL:2003, arrays are standard SQL types, However, declaration syntax differs from Spark's CQL-like syntax. Examples of standard syntax: BIGINT ARRAY BIGINT ARRAY[100] BIGINT ARRAY[100] ARRAY[200] It would be great to have support standard syntax here. Some additional details that this addition should have IMO: - Forbit mixed syntax such as ARRAYINT ARRAY[100] - Ignore the maximum capacity (ARRAY[N]) but allow it to be specified. This seems to be what others (i.e. PostgreSQL) are doing. ARRAYBIGINT ARRAY[100] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491548#comment-14491548 ] Cheng Lian commented on SPARK-6859: --- [~yijieshen] Thanks for reporting! And yes, please also open a JIRA ticket for Parquet and link it with this one so that it's easier to track. [~marmbrus] I guess we should disable pushing down filters involving binary type before this bug is fixed in Parquet. Parquet File Binary column statistics error when reuse byte[] among rows Key: SPARK-6859 URL: https://issues.apache.org/jira/browse/SPARK-6859 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Yijie Shen Priority: Minor Suppose I create a dataRDD which extends RDD\[Row\], and each row is GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is reused among rows but has different content each time. When I convert it to a dataFrame and save it as Parquet File, the file's row group statistic(max min) of Binary column would be wrong. \\ \\ Here is the reason: In Parquet, BinaryStatistic just keep max min as parquet.io.api.Binary references, Spark sql would generate a new Binary backed by the same Array\[Byte\] passed from row. | |reference| |backed| | |max: Binary|--|ByteArrayBackedBinary|--|Array\[Byte\]| Therefore, each time parquet updating row group's statistic, max min would always refer to the same Array\[Byte\], which has new content each time. When parquet decides to save it into file, the last row's content would be saved as both max min. \\ \\ It seems it is a parquet bug because it's parquet's responsibility to update statistics correctly. But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6874) Add support for SQL:2003 array type declaration syntax
Santiago M. Mola created SPARK-6874: --- Summary: Add support for SQL:2003 array type declaration syntax Key: SPARK-6874 URL: https://issues.apache.org/jira/browse/SPARK-6874 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola Priority: Minor As of SQL:2003, arrays are standard SQL types, However, declaration syntax differs from Spark's CQL-like syntax. Examples of standard syntax: BIGINT ARRAY BIGINT ARRAY[100] BIGINT ARRAY[100] ARRAY[200] It would be great to have support standard syntax here. Some additional details that this addition should have IMO: - Forbit mixed syntax such as ARRAYINT ARRAY[100] - Ignore the maximum capacity (ARRAY[N]) but allow it to be specified. This seems to be what others (i.e. PostgreSQL) are doing. ARRAYBIGINT ARRAY[100] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5364) HiveQL transform doesn't support the non output clause
[ https://issues.apache.org/jira/browse/SPARK-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5364. Resolution: Fixed Fix Version/s: 1.3.1 HiveQL transform doesn't support the non output clause -- Key: SPARK-5364 URL: https://issues.apache.org/jira/browse/SPARK-5364 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Trivial Fix For: 1.3.1 This is a quick fix for query (in HiveContext) like: {panel} SELECT transform(key + 1, value) USING '/bin/cat' FROM src {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-5364) HiveQL transform doesn't support the non output clause
[ https://issues.apache.org/jira/browse/SPARK-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-5364: HiveQL transform doesn't support the non output clause -- Key: SPARK-5364 URL: https://issues.apache.org/jira/browse/SPARK-5364 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Trivial Fix For: 1.3.0 This is a quick fix for query (in HiveContext) like: {panel} SELECT transform(key + 1, value) USING '/bin/cat' FROM src {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4801) Add CTE capability to HiveContext
[ https://issues.apache.org/jira/browse/SPARK-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4801. - Resolution: Duplicate Fix Version/s: 1.4.0 Add CTE capability to HiveContext - Key: SPARK-4801 URL: https://issues.apache.org/jira/browse/SPARK-4801 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Jacob Davis Fix For: 1.4.0 This is a request to add CTE functionality to HiveContext. Common Table Expressions are added in Hive 0.13.0 with HIVE-1180. Using CTE style syntax within HiveContext currently results in the following caused by message: {code} Caused by: scala.MatchError: TOK_CTE (of class org.apache.hadoop.hive.ql.parse.ASTNode) at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500) at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:500) at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:248) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5364) HiveQL transform doesn't support the non output clause
[ https://issues.apache.org/jira/browse/SPARK-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian closed SPARK-5364. - Assignee: Liang-Chi Hsieh HiveQL transform doesn't support the non output clause -- Key: SPARK-5364 URL: https://issues.apache.org/jira/browse/SPARK-5364 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Liang-Chi Hsieh Priority: Trivial Fix For: 1.3.0 This is a quick fix for query (in HiveContext) like: {panel} SELECT transform(key + 1, value) USING '/bin/cat' FROM src {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4760) ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files
[ https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4760. - Resolution: Fixed Fix Version/s: 1.3.0 The native parquet support (which is used for both Spark SQL and Hive DDL by default) automatically computes sizes starting with Spark 1.3. So running ANALYZE is not needed for auto broadcast joins anymore. Please reopen if you see any issues with this new feature. ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files -- Key: SPARK-4760 URL: https://issues.apache.org/jira/browse/SPARK-4760 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Priority: Critical Fix For: 1.3.0 In a older Spark version built around Oct. 12, I was able to use ANALYZE TABLE table COMPUTE STATISTICS noscan to get estimated table size, which is important for optimizing joins. (I'm joining 15 small dimension tables, and this is crucial to me). In the more recent Spark builds, it fails to estimate the table size unless I remove noscan. Here's the statistics I got using DESC EXTENDED: old: parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166} new: parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} And I've tried turning off spark.sql.hive.convertMetastoreParquet in my spark-defaults.conf and the result is unaffected (in both versions). Looks like the Parquet support in new Hive (0.13.1) is broken? Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1412) Disable partial aggregation automatically when reduction factor is low
[ https://issues.apache.org/jira/browse/SPARK-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1412: Summary: Disable partial aggregation automatically when reduction factor is low (was: [SQL] Disable partial aggregation automatically when reduction factor is low) Disable partial aggregation automatically when reduction factor is low -- Key: SPARK-1412 URL: https://issues.apache.org/jira/browse/SPARK-1412 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Priority: Minor Once we see enough number of rows in partial aggregation and don't observe any reduction, the aggregate operator should just turn off partial aggregation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1412) [SQL] Disable partial aggregation automatically when reduction factor is low
[ https://issues.apache.org/jira/browse/SPARK-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1412: Assignee: (was: Michael Armbrust) [SQL] Disable partial aggregation automatically when reduction factor is low Key: SPARK-1412 URL: https://issues.apache.org/jira/browse/SPARK-1412 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Priority: Minor Once we see enough number of rows in partial aggregation and don't observe any reduction, the aggregate operator should just turn off partial aggregation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org