[jira] [Commented] (SPARK-20193) Selecting empty struct causes ExpressionEncoder error.
[ https://issues.apache.org/jira/browse/SPARK-20193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954584#comment-15954584 ] Liang-Chi Hsieh commented on SPARK-20193: - Actually I am not sure what {{struct()}} represents. If you want a null for this struct, you can write: {code} spark.range(3).select(col("id"), lit(null).cast(new StructType())) {code} > Selecting empty struct causes ExpressionEncoder error. > -- > > Key: SPARK-20193 > URL: https://issues.apache.org/jira/browse/SPARK-20193 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Adrian Ionescu > Labels: struct > > {{def struct(cols: Column*): Column}} > Given the above signature and the lack of any note in the docs saying that a > struct with no columns is not supported, I would expect the following to work: > {{spark.range(3).select(col("id"), struct().as("empty_struct")).collect}} > However, this results in: > {quote} > java.lang.AssertionError: assertion failed: each serializer expression should > contains at least one `BoundReference` > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:240) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:238) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.(ExpressionEncoder.scala:238) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:63) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2837) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1131) > ... 39 elided > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20207) Add ablity to exclude current row in WindowSpec
Mathew Wicks created SPARK-20207: Summary: Add ablity to exclude current row in WindowSpec Key: SPARK-20207 URL: https://issues.apache.org/jira/browse/SPARK-20207 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Mathew Wicks Priority: Minor It would be useful if we could implement a way to exclude the current row in WindowSpec. (We can currently only select ranges of rows/time.) Currently, users have to resort to ridiculous measures to exclude the current row from windowing aggregations. As seen here: http://stackoverflow.com/questions/43180723/spark-sql-excluding-the-current-row-in-partition-by-windowing-functions/43198839#43198839 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954559#comment-15954559 ] Liang-Chi Hsieh commented on SPARK-20144: - I don't think the API has the guarantee about the data ordering. The difference between 1.6.3 to 2.0.2 is just due to the change of internal implementation. I checked the current FileSourceScanExec, it still reorders the partition files. When you save the sorted data into Parquet, only the data in individual Parquet file can maintain the data ordering. We shouldn't expect a special ordering on the whole data read back, if the API doesn't explicitly guarantee that. > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20079) Re registration of AM hangs spark cluster in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-20079: Description: The ExecutorAllocationManager.reset method is called when re-registering AM, which sets the ExecutorAllocationManager.initializing field true. When this field is true, the Driver does not start a new executor from the AM request. The following two cases will cause the field to False 1. A executor idle for some time. 2. There are new stages to be submitted After the a stage was submitted, the AM was killed and restart ,the above two cases will not appear. 1. When AM is killed, the yarn will kill all running containers. All execuotr will be lost and no executor will be idle. 2. No surviving executor, resulting in the current stage will never be completed, DAG will not submit a new stage. Reproduction steps: 1. Start cluster {noformat} echo -e "sc.parallelize(1 to 2000).foreach(_ => Thread.sleep(1000))" | ./bin/spark-shell --master yarn-client --executor-cores 1 --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.maxExecutors=2 {noformat} 2. Kill the AM process when a stage is scheduled. was: 1. Start cluster echo -e "sc.parallelize(1 to 2000).foreach(_ => Thread.sleep(1000))" | ./bin/spark-shell --master yarn-client --executor-cores 1 --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.maxExecutors=2 2. Kill the AM process when a stage is scheduled. > Re registration of AM hangs spark cluster in yarn-client mode > - > > Key: SPARK-20079 > URL: https://issues.apache.org/jira/browse/SPARK-20079 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Guoqiang Li > > The ExecutorAllocationManager.reset method is called when re-registering AM, > which sets the ExecutorAllocationManager.initializing field true. When this > field is true, the Driver does not start a new executor from the AM request. > The following two cases will cause the field to False > 1. A executor idle for some time. > 2. There are new stages to be submitted > After the a stage was submitted, the AM was killed and restart ,the above two > cases will not appear. > 1. When AM is killed, the yarn will kill all running containers. All execuotr > will be lost and no executor will be idle. > 2. No surviving executor, resulting in the current stage will never be > completed, DAG will not submit a new stage. > Reproduction steps: > 1. Start cluster > {noformat} > echo -e "sc.parallelize(1 to 2000).foreach(_ => Thread.sleep(1000))" | > ./bin/spark-shell --master yarn-client --executor-cores 1 --conf > spark.shuffle.service.enabled=true --conf > spark.dynamicAllocation.enabled=true --conf > spark.dynamicAllocation.maxExecutors=2 > {noformat} > 2. Kill the AM process when a stage is scheduled. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11421) Add the ability to add a jar to the current class loader
[ https://issues.apache.org/jira/browse/SPARK-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954533#comment-15954533 ] Daniel Erenrich commented on SPARK-11421: - Is this not basically a duplicate of the much older https://issues.apache.org/jira/browse/SPARK-5377 > Add the ability to add a jar to the current class loader > > > Key: SPARK-11421 > URL: https://issues.apache.org/jira/browse/SPARK-11421 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: holdenk >Priority: Minor > > addJar add's jars for future operations, but could also add to the current > class loader, this would be really useful in Python & R most likely where > some included python code may wish to add some jars. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20176) Spark Dataframe UDAF issue
[ https://issues.apache.org/jira/browse/SPARK-20176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954527#comment-15954527 ] Dinesh Man Amatya commented on SPARK-20176: --- Thanks Kazuaki for the effort. I was able to resolve the issue by upgrading the spark and scala version as follows, scala.version : 2.11.5 scala.compat.version : 2.11 spark.version : 2.1.0 > Spark Dataframe UDAF issue > -- > > Key: SPARK-20176 > URL: https://issues.apache.org/jira/browse/SPARK-20176 > Project: Spark > Issue Type: IT Help > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: Dinesh Man Amatya > > Getting following error in custom UDAF > Error while decoding: java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 33: Incompatible expression types "boolean" and "java.lang.Boolean" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private Object[] values1; > /* 011 */ private org.apache.spark.sql.types.StructType schema; > /* 012 */ private org.apache.spark.sql.types.StructType schema1; > /* 013 */ > /* 014 */ > /* 015 */ public SpecificSafeProjection(Object[] references) { > /* 016 */ this.references = references; > /* 017 */ mutableRow = (MutableRow) references[references.length - 1]; > /* 018 */ > /* 019 */ > /* 020 */ this.schema = (org.apache.spark.sql.types.StructType) > references[0]; > /* 021 */ this.schema1 = (org.apache.spark.sql.types.StructType) > references[1]; > /* 022 */ } > /* 023 */ > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ values = new Object[2]; > /* 028 */ > /* 029 */ boolean isNull2 = i.isNullAt(0); > /* 030 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 031 */ > /* 032 */ boolean isNull1 = isNull2; > /* 033 */ final java.lang.String value1 = isNull1 ? null : > (java.lang.String) value2.toString(); > /* 034 */ isNull1 = value1 == null; > /* 035 */ if (isNull1) { > /* 036 */ values[0] = null; > /* 037 */ } else { > /* 038 */ values[0] = value1; > /* 039 */ } > /* 040 */ > /* 041 */ boolean isNull5 = i.isNullAt(1); > /* 042 */ InternalRow value5 = isNull5 ? null : (i.getStruct(1, 2)); > /* 043 */ boolean isNull3 = false; > /* 044 */ org.apache.spark.sql.Row value3 = null; > /* 045 */ if (!false && isNull5) { > /* 046 */ > /* 047 */ final org.apache.spark.sql.Row value6 = null; > /* 048 */ isNull3 = true; > /* 049 */ value3 = value6; > /* 050 */ } else { > /* 051 */ > /* 052 */ values1 = new Object[2]; > /* 053 */ > /* 054 */ boolean isNull10 = i.isNullAt(1); > /* 055 */ InternalRow value10 = isNull10 ? null : (i.getStruct(1, 2)); > /* 056 */ > /* 057 */ boolean isNull9 = isNull10 || false; > /* 058 */ final boolean value9 = isNull9 ? false : (Boolean) > value10.isNullAt(0); > /* 059 */ boolean isNull8 = false; > /* 060 */ double value8 = -1.0; > /* 061 */ if (!isNull9 && value9) { > /* 062 */ > /* 063 */ final double value12 = -1.0; > /* 064 */ isNull8 = true; > /* 065 */ value8 = value12; > /* 066 */ } else { > /* 067 */ > /* 068 */ boolean isNull14 = i.isNullAt(1); > /* 069 */ InternalRow value14 = isNull14 ? null : (i.getStruct(1, 2)); > /* 070 */ boolean isNull13 = isNull14; > /* 071 */ double value13 = -1.0; > /* 072 */ > /* 073 */ if (!isNull14) { > /* 074 */ > /* 075 */ if (value14.isNullAt(0)) { > /* 076 */ isNull13 = true; > /* 077 */ } else { > /* 078 */ value13 = value14.getDouble(0); > /* 079 */ } > /* 080 */ > /* 081 */ } > /* 082 */ isNull8 = isNull13; > /* 083 */ value8 = value13; > /* 084 */ } > /* 085 */ if (isNull8) { > /* 086 */ values1[0] = null; > /* 087 */ } else { > /* 088 */ values1[0] = value8; > /* 089 */ } > /* 090 */ > /* 091 */ boolean isNull17 = i.isNullAt(1); > /* 092 */ InternalRow value17 = isNull17 ? null : (i.getStruct(1, 2)); > /* 093 */ > /* 094 */ boolean isNull16 = isNull17 || false; > /* 095 */ final boolean value16 = isNull16 ? false : (Boolean) > value17.isNullAt(1); > /* 096 */ boolean
[jira] [Updated] (SPARK-20206) spark.ui.killEnabled=false property doesn't reflect on task/stages
[ https://issues.apache.org/jira/browse/SPARK-20206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] srinivasan updated SPARK-20206: --- Priority: Minor (was: Major) > spark.ui.killEnabled=false property doesn't reflect on task/stages > -- > > Key: SPARK-20206 > URL: https://issues.apache.org/jira/browse/SPARK-20206 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: srinivasan >Priority: Minor > > spark.ui.killEnabled=false property doesn't reflect on active task and > stages.kill hyperlink is still enabled on active tasks and stages -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20206) spark.ui.killEnabled=false property doesn't reflect on task/stages
srinivasan created SPARK-20206: -- Summary: spark.ui.killEnabled=false property doesn't reflect on task/stages Key: SPARK-20206 URL: https://issues.apache.org/jira/browse/SPARK-20206 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.0 Reporter: srinivasan spark.ui.killEnabled=false property doesn't reflect on active task and stages.kill hyperlink is still enabled on active tasks and stages -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14726) Support for sampling when inferring schema in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954447#comment-15954447 ] Hyukjin Kwon edited comment on SPARK-14726 at 4/4/17 1:47 AM: -- Actually, after re-thinking, it seems we would not need this for now if not many users request this. Now we can do a workaround as below: {code} val ds = Seq("a", "b", "c", "d").toDS val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 0.7)).schema spark.read.schema(sampledSchema).csv(ds) {code} Actually, this will allow more dynamic options, e.g., with replacement or without replacement or filtering or even just limit 100. I will keep eyes on similar issues and reopen if it seems many users want this. Please reopen this if you strongly feel this should be supported as an option or anyone feels so. was (Author: hyukjin.kwon): Actually, after re-thinking, it seems we would not need this for now if not many users request this. Workaround as below: {code} val ds = Seq("a", "b", "c", "d").toDS val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 0.7)).schema spark.read.schema(sampledSchema).csv(ds) {code} Actually, this will allow more dynamic options, e.g., with replacement or without replacement or filtering or even just limit 100. I will keep eyes on similar issues and reopen if it seems many users want this. Please reopen this if you strongly feel this should be supported as an option or anyone feels so. > Support for sampling when inferring schema in CSV data source > - > > Key: SPARK-14726 > URL: https://issues.apache.org/jira/browse/SPARK-14726 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Bomi Kim > > Currently, I am using CSV data source and trying to get used to Spark 2.0 > because it has built-in CSV data source. > I realized that CSV data source infers schema with all the data. JSON data > source supports sampling ratio option. > It would be great if CSV data source has this option too (or is this > supported already?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14726) Support for sampling when inferring schema in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954447#comment-15954447 ] Hyukjin Kwon edited comment on SPARK-14726 at 4/4/17 1:40 AM: -- Actually, after re-thinking, it seems we would not need this for now if not many users request this. Workaround as below: {code} val ds = Seq("a", "b", "c", "d").toDS val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 0.7)).schema spark.read.schema(sampledSchema).csv(ds) {code} Actually, this will allow more dynamic options, e.g., with replacement or without replacement or filtering or even just limit 100. I will keep eyes on similar issues and reopen if it seems many users want this. Please reopen this if you strongly feel this should be supported as an option or anyone feels so. was (Author: hyukjin.kwon): Actually, after re-thinking, it seems we would not need this for now if not many users request this. Workaround as below: {code} val ds = Seq("a", "b", "c", "d").toDS.sample(false, 0.7) val sampledSchema = spark.read.option("inferSchema", true).csv(ds).schema spark.read.schema(sampledSchema).csv("/tmp/path") {code} Actually, this will allow more dynamic options, e.g., with replacement or without replacement or filtering or even just limit 100. I will keep eyes on similar issues and reopen if it seems many users want this. Please reopen this if you strongly feel this should be supported as an option or anyone feels so. > Support for sampling when inferring schema in CSV data source > - > > Key: SPARK-14726 > URL: https://issues.apache.org/jira/browse/SPARK-14726 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Bomi Kim > > Currently, I am using CSV data source and trying to get used to Spark 2.0 > because it has built-in CSV data source. > I realized that CSV data source infers schema with all the data. JSON data > source supports sampling ratio option. > It would be great if CSV data source has this option too (or is this > supported already?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14726) Support for sampling when inferring schema in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-14726. -- Resolution: Won't Fix Actually, after re-thinking, it seems we would not need this for now if not many users request this. Workaround as below: {code} val ds = Seq("a", "b", "c", "d").toDS.sample(false, 0.7) val sampledSchema = spark.read.option("inferSchema", true).csv(ds).schema spark.read.schema(sampledSchema).csv("/tmp/path") {code} Actually, this will allow more dynamic options, e.g., with replacement or without replacement or filtering or even just limit 100. I will keep eyes on similar issues and reopen if it seems many users want this. Please reopen this if you strongly feel this should be supported as an option or anyone feels so. > Support for sampling when inferring schema in CSV data source > - > > Key: SPARK-14726 > URL: https://issues.apache.org/jira/browse/SPARK-14726 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Bomi Kim > > Currently, I am using CSV data source and trying to get used to Spark 2.0 > because it has built-in CSV data source. > I realized that CSV data source infers schema with all the data. JSON data > source supports sampling ratio option. > It would be great if CSV data source has this option too (or is this > supported already?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19186) Hash symbol in middle of Sybase database table name causes Spark Exception
[ https://issues.apache.org/jira/browse/SPARK-19186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19186. -- Resolution: Not A Problem ^ I agree with this. Also, up to my knowledge, we can deal with the dialect in favour of SPARK-17614, assuming the exception came from https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L60-L62 within Spark. I am resolving this per the issue described in this JIRA. Please reopen this if I misunderstood. > Hash symbol in middle of Sybase database table name causes Spark Exception > -- > > Key: SPARK-19186 > URL: https://issues.apache.org/jira/browse/SPARK-19186 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Adrian Schulewitz >Priority: Minor > > If I use a table name without a '#' symbol in the middle then no exception > occurs but with one an exception is thrown. According to Sybase 15 > documentation a '#' is a legal character. > val testSql = "SELECT * FROM CTP#ADR_TYPE_DBF" > val conf = new SparkConf().setAppName("MUREX DMart Simple Reader via > SQL").setMaster("local[2]") > val sess = SparkSession > .builder() > .appName("MUREX DMart Simple SQL Reader") > .config(conf) > .getOrCreate() > import sess.implicits._ > val df = sess.read > .format("jdbc") > .option("url", > "jdbc:jtds:sybase://auq7064s.unix.anz:4020/mxdmart56") > .option("driver", "net.sourceforge.jtds.jdbc.Driver") > .option("dbtable", "CTP#ADR_TYPE_DBF") > .option("UDT_DEALCRD_REP", "mxdmart56") > .option("user", "INSTAL") > .option("password", "INSTALL") > .load() > df.createOrReplaceTempView("trades") > val resultsDF = sess.sql(testSql) > resultsDF.show() > 17/01/12 14:30:01 INFO SharedState: Warehouse path is > 'file:/C:/DEVELOPMENT/Projects/MUREX/trunk/murex-eom-reporting/spark-warehouse/'. > 17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: trades > 17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: SELECT * FROM > CTP#ADR_TYPE_DBF > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '#' expecting {, ',', 'SELECT', 'FROM', 'ADD', 'AS', > 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', > 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', > 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', > 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', > 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', > 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', > 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', > 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', > 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', > 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', > 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', > 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', > 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', > 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', > 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', > 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', > 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', > 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', > 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', > 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', > 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', > 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', > 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', > 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, > BACKQUOTED_IDENTIFIER}(line 1, pos 17) > == SQL == > SELECT * FROM CTP#ADR_TYPE_DBF > -^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at >
[jira] [Resolved] (SPARK-10364) Support Parquet logical type TIMESTAMP_MILLIS
[ https://issues.apache.org/jira/browse/SPARK-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-10364. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15332 [https://github.com/apache/spark/pull/15332] > Support Parquet logical type TIMESTAMP_MILLIS > - > > Key: SPARK-10364 > URL: https://issues.apache.org/jira/browse/SPARK-10364 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian > Fix For: 2.2.0 > > > The {{TimestampType}} in Spark SQL is of microsecond precision. Ideally, we > should convert Spark SQL timestamp values into Parquet {{TIMESTAMP_MICROS}}. > But unfortunately parquet-mr hasn't supported it yet. > For the read path, we should be able to read {{TIMESTAMP_MILLIS}} Parquet > values and pad a 0 microsecond part to read values. > For the write path, currently we are writing timestamps as {{INT96}}, similar > to Impala and Hive. One alternative is that, we can have a separate SQL > option to let users be able to write Spark SQL timestamp values as > {{TIMESTAMP_MILLIS}}. Of course, in this way the microsecond part will be > truncated. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19408) cardinality estimation involving two columns of the same table
[ https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19408: Description: In SPARK-17075, we estimate cardinality of predicate expression "column (op) literal", where op is =, <, <=, >, >= or <=>. In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work. In this jira, we want to estimate the filter factor of predicate expressions involving two columns of same table. For example, multiple tpc-h queries have this kind of predicate "WHERE l_commitdate < l_receiptdate". was: In SPARK-17075, we estimate cardinality of predicate expression "column (op) literal", where op is =, <, <=, >, or >=. In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work. In this jira, we want to estimate the filter factor of predicate expressions involving two columns of same table. For example, multiple tpc-h queries have this kind of predicate "WHERE l_commitdate < l_receiptdate". > cardinality estimation involving two columns of the same table > -- > > Key: SPARK-19408 > URL: https://issues.apache.org/jira/browse/SPARK-19408 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Ron Hu > Fix For: 2.2.0 > > > In SPARK-17075, we estimate cardinality of predicate expression "column (op) > literal", where op is =, <, <=, >, >= or <=>. In SQL queries, we also see > predicate expressions involving two columns such as "column-1 (op) column-2" > where column-1 and column-2 belong to same table. Note that, if column-1 and > column-2 belong to different tables, then it is a join operator's work, NOT a > filter operator's work. > In this jira, we want to estimate the filter factor of predicate expressions > involving two columns of same table. For example, multiple tpc-h queries > have this kind of predicate "WHERE l_commitdate < l_receiptdate". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19408) cardinality estimation involving two columns of the same table
[ https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19408. - Resolution: Fixed Assignee: Ron Hu Fix Version/s: 2.2.0 > cardinality estimation involving two columns of the same table > -- > > Key: SPARK-19408 > URL: https://issues.apache.org/jira/browse/SPARK-19408 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Ron Hu >Assignee: Ron Hu > Fix For: 2.2.0 > > > In SPARK-17075, we estimate cardinality of predicate expression "column (op) > literal", where op is =, <, <=, >, >= or <=>. In SQL queries, we also see > predicate expressions involving two columns such as "column-1 (op) column-2" > where column-1 and column-2 belong to same table. Note that, if column-1 and > column-2 belong to different tables, then it is a join operator's work, NOT a > filter operator's work. > In this jira, we want to estimate the filter factor of predicate expressions > involving two columns of same table. For example, multiple tpc-h queries > have this kind of predicate "WHERE l_commitdate < l_receiptdate". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20145) "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't
[ https://issues.apache.org/jira/browse/SPARK-20145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20145. - Resolution: Fixed Assignee: sam elamin Fix Version/s: 2.2.0 > "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't > > > Key: SPARK-20145 > URL: https://issues.apache.org/jira/browse/SPARK-20145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Juliusz Sompolski >Assignee: sam elamin > Fix For: 2.2.0 > > > Executed at clean tip of the master branch, with all default settings: > scala> spark.sql("SELECT * FROM range(1)") > res1: org.apache.spark.sql.DataFrame = [id: bigint] > scala> spark.sql("SELECT * FROM RANGE(1)") > org.apache.spark.sql.AnalysisException: could not resolve `RANGE` to a > table-valued function; line 1 pos 14 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:126) > at > org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) > ... > I believe it should be case insensitive? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954387#comment-15954387 ] Mridul Muralidharan edited comment on SPARK-20205 at 4/4/17 12:15 AM: -- For history server that will fail - good point. Atleast for custom listeners, users can workaround until next release by using current time (in there code when field submissionTime is None). Thanks for clarifying [~vanzin] ! was (Author: mridulm80): For history server that will fail - good point. Atleast for custom listeners, users can workaround until next release by using current time. Thanks for clarifying [~vanzin] ! > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954387#comment-15954387 ] Mridul Muralidharan commented on SPARK-20205: - For history server that will fail - good point. Atleast for custom listeners, users can workaround until next release by using current time. Thanks for clarifying [~vanzin] ! > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18893) Not support "alter table .. add columns .."
[ https://issues.apache.org/jira/browse/SPARK-18893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18893. - Resolution: Fixed Fix Version/s: 2.2.0 > Not support "alter table .. add columns .." > > > Key: SPARK-18893 > URL: https://issues.apache.org/jira/browse/SPARK-18893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: zuotingbing > Fix For: 2.2.0 > > > when we update spark from version 1.5.2 to 2.0.1, all cases we have need > change the table use "alter table add columns " failed, but it is said "All > Hive DDL Functions, including: alter table" in the official document : > http://spark.apache.org/docs/latest/sql-programming-guide.html. > Is there any plan to support sql "alter table .. add/replace columns" ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18893) Not support "alter table .. add columns .."
[ https://issues.apache.org/jira/browse/SPARK-18893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954375#comment-15954375 ] Wenchen Fan commented on SPARK-18893: - https://issues.apache.org/jira/browse/SPARK-19261 > Not support "alter table .. add columns .." > > > Key: SPARK-18893 > URL: https://issues.apache.org/jira/browse/SPARK-18893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: zuotingbing > > when we update spark from version 1.5.2 to 2.0.1, all cases we have need > change the table use "alter table add columns " failed, but it is said "All > Hive DDL Functions, including: alter table" in the official document : > http://spark.apache.org/docs/latest/sql-programming-guide.html. > Is there any plan to support sql "alter table .. add/replace columns" ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954357#comment-15954357 ] Marcelo Vanzin commented on SPARK-20205: bq. I was referring to the case where we are persisting to event log or consuming events to externally persist them. I see. In that case I believe it will always be unset. For live listeners, current time is a good enough approximation, but for the history server, for example, that's not an option (since {{SparkListenerStageSubmitted}} does not have a {{time}} field). > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954348#comment-15954348 ] Mridul Muralidharan commented on SPARK-20205: - bq. I wouldn't say incorrect; at worst it's gonna be slightly inaccurate. I was referring to the case where we are persisting to event log or consuming events to externally persist them. In this context, will we always have unspecified submissionTime or is there case where submissionTime is pointing to some incorrect/spurious value (if this is always in the codepath after makeNewStageAttempt; then it should be fine). Essentially, is the workaround for existing spark versions to simply set submissionTime to current time if it is None for SparkListenerStageSubmitted sufficient ? Will it miss some corner case ? (value is set but is incorrect ?) > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954340#comment-15954340 ] Marcelo Vanzin commented on SPARK-20205: bq. This is nasty ! This means submissionTime will always be unset ? Well, it's a little more complicated than that. The UI code currently "self heals", because it just keeps a pointer to the {{StageInfo}} object which is modified by the scheduler later. So eventually the UI sees the value. But the event log, for example, might not have the submission time. bq. Btw, is it possible for submissionTime to be set - but to an incorrect value ? I wouldn't say incorrect; at worst it's gonna be slightly inaccurate. > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954333#comment-15954333 ] Mridul Muralidharan commented on SPARK-20205: - This is nasty ! This means submissionTime will always be unset ? Btw, is it possible for submissionTime to be set - but to an incorrect value ? > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints
[ https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954312#comment-15954312 ] Kamal Gurala commented on SPARK-4899: - Some performance related concerns https://github.com/apache/spark/pull/60#r16817226 > Support Mesos features: roles and checkpoints > - > > Key: SPARK-4899 > URL: https://issues.apache.org/jira/browse/SPARK-4899 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 1.2.0 >Reporter: Andrew Ash > > Inspired by https://github.com/apache/spark/pull/60 > Mesos has two features that would be nice for Spark to take advantage of: > 1. Roles -- a way to specify ACLs and priorities for users > 2. Checkpoints -- a way to restart a failed Mesos slave without losing all > the work that was happening on the box > Some of these may require a Mesos upgrade past our current 0.18.1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application
[ https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954233#comment-15954233 ] Steve Loughran edited comment on SPARK-20153 at 4/3/17 10:13 PM: - This is fixed in Hadoop 2.8 with [per-bucket configuration|http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets]; HADOOP-13336. I would *really* advise against trying to re-implement this in spark as having one consistent model for configuring s3a bindings everywhere as there are a lot more options than just credentials; the S3 endpoint being a critical one when trying to work with V4 auth endpoints. As a temporary workaround, one which will leak your secrets to logs, know that you can go s3a://key:secret@bucket, URL encoding the secret, and so get access. Once you use this, consider all logs sensitive data. was (Author: ste...@apache.org): This is fixed in Hadoop 2.8 with [per-bucket configuration|http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets]; HADOOP-13336. I would *really* advise against trying to re-implement this in spark as having one consistent model for configuring s3a bindings everywhere will the only way to debug what's going on, especially given that for security reasons you can't log what's going on. As a temporary workaround, one which will leak your secrets to logs, know that you can go s3a://key:secret@bucket, URL encoding the secret. > Support Multiple aws credentials in order to access multiple Hive on S3 table > in spark application > --- > > Key: SPARK-20153 > URL: https://issues.apache.org/jira/browse/SPARK-20153 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1, 2.1.0 >Reporter: Franck Tago >Priority: Minor > > I need to access multiple hive tables in my spark application where each hive > table is > 1- an external table with data sitting on S3 > 2- each table is own by a different AWS user so I need to provide different > AWS credentials. > I am familiar with setting the aws credentials in the hadoop configuration > object but that does not really help me because I can only set one pair of > (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey ) > From my research , there is no easy or elegant way to do this in spark . > Why is that ? > How do I address this use case? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application
[ https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954233#comment-15954233 ] Steve Loughran commented on SPARK-20153: This is fixed in Hadoop 2.8 with [per-bucket configuration|http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets]; HADOOP-13336. I would *really* advise against trying to re-implement this in spark as having one consistent model for configuring s3a bindings everywhere will the only way to debug what's going on, especially given that for security reasons you can't log what's going on. As a temporary workaround, one which will leak your secrets to logs, know that you can go s3a://key:secret@bucket, URL encoding the secret. > Support Multiple aws credentials in order to access multiple Hive on S3 table > in spark application > --- > > Key: SPARK-20153 > URL: https://issues.apache.org/jira/browse/SPARK-20153 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1, 2.1.0 >Reporter: Franck Tago >Priority: Minor > > I need to access multiple hive tables in my spark application where each hive > table is > 1- an external table with data sitting on S3 > 2- each table is own by a different AWS user so I need to provide different > AWS credentials. > I am familiar with setting the aws credentials in the hadoop configuration > object but that does not really help me because I can only set one pair of > (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey ) > From my research , there is no easy or elegant way to do this in spark . > Why is that ? > How do I address this use case? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints
[ https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954212#comment-15954212 ] Charles Allen commented on SPARK-4899: -- It was discussed on the mailing list with [~timchen] that checkpointing might just need a timeout setting available to the other schedulers. > Support Mesos features: roles and checkpoints > - > > Key: SPARK-4899 > URL: https://issues.apache.org/jira/browse/SPARK-4899 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 1.2.0 >Reporter: Andrew Ash > > Inspired by https://github.com/apache/spark/pull/60 > Mesos has two features that would be nice for Spark to take advantage of: > 1. Roles -- a way to specify ACLs and priorities for users > 2. Checkpoints -- a way to restart a failed Mesos slave without losing all > the work that was happening on the box > Some of these may require a Mesos upgrade past our current 0.18.1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20064) Bump the PySpark verison number to 2.2
[ https://issues.apache.org/jira/browse/SPARK-20064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20064: Assignee: (was: Apache Spark) > Bump the PySpark verison number to 2.2 > -- > > Key: SPARK-20064 > URL: https://issues.apache.org/jira/browse/SPARK-20064 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: holdenk >Priority: Minor > Labels: starter > > The version.py should be updated for the new version. Note: this isn't > critical since for any releases made with make-distribution the version > number is read from the xml, but if anyone builds from source and manually > looks at the version # it would be good to have it match. This is a good > starter issue, but something we should do quickly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20064) Bump the PySpark verison number to 2.2
[ https://issues.apache.org/jira/browse/SPARK-20064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20064: Assignee: Apache Spark > Bump the PySpark verison number to 2.2 > -- > > Key: SPARK-20064 > URL: https://issues.apache.org/jira/browse/SPARK-20064 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: holdenk >Assignee: Apache Spark >Priority: Minor > Labels: starter > > The version.py should be updated for the new version. Note: this isn't > critical since for any releases made with make-distribution the version > number is read from the xml, but if anyone builds from source and manually > looks at the version # it would be good to have it match. This is a good > starter issue, but something we should do quickly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20064) Bump the PySpark verison number to 2.2
[ https://issues.apache.org/jira/browse/SPARK-20064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954186#comment-15954186 ] Apache Spark commented on SPARK-20064: -- User 'setjet' has created a pull request for this issue: https://github.com/apache/spark/pull/17523 > Bump the PySpark verison number to 2.2 > -- > > Key: SPARK-20064 > URL: https://issues.apache.org/jira/browse/SPARK-20064 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: holdenk >Priority: Minor > Labels: starter > > The version.py should be updated for the new version. Note: this isn't > critical since for any releases made with make-distribution the version > number is read from the xml, but if anyone builds from source and manually > looks at the version # it would be good to have it match. This is a good > starter issue, but something we should do quickly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints
[ https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954170#comment-15954170 ] Charles Allen commented on SPARK-4899: -- {{org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils#createSchedulerDriver}} seems to allow checkpointing, which only {{org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler}} uses. Neither the fine grained nor coarse grained schedulers use it, is there a reason for that? > Support Mesos features: roles and checkpoints > - > > Key: SPARK-4899 > URL: https://issues.apache.org/jira/browse/SPARK-4899 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 1.2.0 >Reporter: Andrew Ash > > Inspired by https://github.com/apache/spark/pull/60 > Mesos has two features that would be nice for Spark to take advantage of: > 1. Roles -- a way to specify ACLs and priorities for users > 2. Checkpoints -- a way to restart a failed Mesos slave without losing all > the work that was happening on the box > Some of these may require a Mesos upgrade past our current 0.18.1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
Marcelo Vanzin created SPARK-20205: -- Summary: DAGScheduler posts SparkListenerStageSubmitted before updating stage Key: SPARK-20205 URL: https://issues.apache.org/jira/browse/SPARK-20205 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Marcelo Vanzin Probably affects other versions, haven't checked. The code that submits the event to the bus is around line 991: {code} stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq) listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties)) {code} Later in the same method, the stage information is updated (around line 1057): {code} if (tasks.size > 0) { logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " + s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") taskScheduler.submitTasks(new TaskSet( tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties)) stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) {code} That means an event handler might get a stage submitted event with an unset submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954132#comment-15954132 ] Apache Spark commented on SPARK-18278: -- User 'foxish' has created a pull request for this issue: https://github.com/apache/spark/pull/17522 > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20176) Spark Dataframe UDAF issue
[ https://issues.apache.org/jira/browse/SPARK-20176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954093#comment-15954093 ] Kazuaki Ishizaki commented on SPARK-20176: -- Thanks. The code seem to work for the master. I am investigating which change fixes the issue. > Spark Dataframe UDAF issue > -- > > Key: SPARK-20176 > URL: https://issues.apache.org/jira/browse/SPARK-20176 > Project: Spark > Issue Type: IT Help > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: Dinesh Man Amatya > > Getting following error in custom UDAF > Error while decoding: java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 33: Incompatible expression types "boolean" and "java.lang.Boolean" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private Object[] values1; > /* 011 */ private org.apache.spark.sql.types.StructType schema; > /* 012 */ private org.apache.spark.sql.types.StructType schema1; > /* 013 */ > /* 014 */ > /* 015 */ public SpecificSafeProjection(Object[] references) { > /* 016 */ this.references = references; > /* 017 */ mutableRow = (MutableRow) references[references.length - 1]; > /* 018 */ > /* 019 */ > /* 020 */ this.schema = (org.apache.spark.sql.types.StructType) > references[0]; > /* 021 */ this.schema1 = (org.apache.spark.sql.types.StructType) > references[1]; > /* 022 */ } > /* 023 */ > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ values = new Object[2]; > /* 028 */ > /* 029 */ boolean isNull2 = i.isNullAt(0); > /* 030 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 031 */ > /* 032 */ boolean isNull1 = isNull2; > /* 033 */ final java.lang.String value1 = isNull1 ? null : > (java.lang.String) value2.toString(); > /* 034 */ isNull1 = value1 == null; > /* 035 */ if (isNull1) { > /* 036 */ values[0] = null; > /* 037 */ } else { > /* 038 */ values[0] = value1; > /* 039 */ } > /* 040 */ > /* 041 */ boolean isNull5 = i.isNullAt(1); > /* 042 */ InternalRow value5 = isNull5 ? null : (i.getStruct(1, 2)); > /* 043 */ boolean isNull3 = false; > /* 044 */ org.apache.spark.sql.Row value3 = null; > /* 045 */ if (!false && isNull5) { > /* 046 */ > /* 047 */ final org.apache.spark.sql.Row value6 = null; > /* 048 */ isNull3 = true; > /* 049 */ value3 = value6; > /* 050 */ } else { > /* 051 */ > /* 052 */ values1 = new Object[2]; > /* 053 */ > /* 054 */ boolean isNull10 = i.isNullAt(1); > /* 055 */ InternalRow value10 = isNull10 ? null : (i.getStruct(1, 2)); > /* 056 */ > /* 057 */ boolean isNull9 = isNull10 || false; > /* 058 */ final boolean value9 = isNull9 ? false : (Boolean) > value10.isNullAt(0); > /* 059 */ boolean isNull8 = false; > /* 060 */ double value8 = -1.0; > /* 061 */ if (!isNull9 && value9) { > /* 062 */ > /* 063 */ final double value12 = -1.0; > /* 064 */ isNull8 = true; > /* 065 */ value8 = value12; > /* 066 */ } else { > /* 067 */ > /* 068 */ boolean isNull14 = i.isNullAt(1); > /* 069 */ InternalRow value14 = isNull14 ? null : (i.getStruct(1, 2)); > /* 070 */ boolean isNull13 = isNull14; > /* 071 */ double value13 = -1.0; > /* 072 */ > /* 073 */ if (!isNull14) { > /* 074 */ > /* 075 */ if (value14.isNullAt(0)) { > /* 076 */ isNull13 = true; > /* 077 */ } else { > /* 078 */ value13 = value14.getDouble(0); > /* 079 */ } > /* 080 */ > /* 081 */ } > /* 082 */ isNull8 = isNull13; > /* 083 */ value8 = value13; > /* 084 */ } > /* 085 */ if (isNull8) { > /* 086 */ values1[0] = null; > /* 087 */ } else { > /* 088 */ values1[0] = value8; > /* 089 */ } > /* 090 */ > /* 091 */ boolean isNull17 = i.isNullAt(1); > /* 092 */ InternalRow value17 = isNull17 ? null : (i.getStruct(1, 2)); > /* 093 */ > /* 094 */ boolean isNull16 = isNull17 || false; > /* 095 */ final boolean value16 = isNull16 ? false : (Boolean) > value17.isNullAt(1); > /* 096 */ boolean isNull15 = false; > /* 097 */ double value15 = -1.0; > /* 098 */ if (!isNull16 && value16) { >
[jira] [Comment Edited] (SPARK-20176) Spark Dataframe UDAF issue
[ https://issues.apache.org/jira/browse/SPARK-20176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954093#comment-15954093 ] Kazuaki Ishizaki edited comment on SPARK-20176 at 4/3/17 8:13 PM: -- Thanks. The code seem to work for the master. I am investigating which change fixed the issue. was (Author: kiszk): Thanks. The code seem to work for the master. I am investigating which change fixes the issue. > Spark Dataframe UDAF issue > -- > > Key: SPARK-20176 > URL: https://issues.apache.org/jira/browse/SPARK-20176 > Project: Spark > Issue Type: IT Help > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: Dinesh Man Amatya > > Getting following error in custom UDAF > Error while decoding: java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 33: Incompatible expression types "boolean" and "java.lang.Boolean" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private Object[] values1; > /* 011 */ private org.apache.spark.sql.types.StructType schema; > /* 012 */ private org.apache.spark.sql.types.StructType schema1; > /* 013 */ > /* 014 */ > /* 015 */ public SpecificSafeProjection(Object[] references) { > /* 016 */ this.references = references; > /* 017 */ mutableRow = (MutableRow) references[references.length - 1]; > /* 018 */ > /* 019 */ > /* 020 */ this.schema = (org.apache.spark.sql.types.StructType) > references[0]; > /* 021 */ this.schema1 = (org.apache.spark.sql.types.StructType) > references[1]; > /* 022 */ } > /* 023 */ > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ values = new Object[2]; > /* 028 */ > /* 029 */ boolean isNull2 = i.isNullAt(0); > /* 030 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 031 */ > /* 032 */ boolean isNull1 = isNull2; > /* 033 */ final java.lang.String value1 = isNull1 ? null : > (java.lang.String) value2.toString(); > /* 034 */ isNull1 = value1 == null; > /* 035 */ if (isNull1) { > /* 036 */ values[0] = null; > /* 037 */ } else { > /* 038 */ values[0] = value1; > /* 039 */ } > /* 040 */ > /* 041 */ boolean isNull5 = i.isNullAt(1); > /* 042 */ InternalRow value5 = isNull5 ? null : (i.getStruct(1, 2)); > /* 043 */ boolean isNull3 = false; > /* 044 */ org.apache.spark.sql.Row value3 = null; > /* 045 */ if (!false && isNull5) { > /* 046 */ > /* 047 */ final org.apache.spark.sql.Row value6 = null; > /* 048 */ isNull3 = true; > /* 049 */ value3 = value6; > /* 050 */ } else { > /* 051 */ > /* 052 */ values1 = new Object[2]; > /* 053 */ > /* 054 */ boolean isNull10 = i.isNullAt(1); > /* 055 */ InternalRow value10 = isNull10 ? null : (i.getStruct(1, 2)); > /* 056 */ > /* 057 */ boolean isNull9 = isNull10 || false; > /* 058 */ final boolean value9 = isNull9 ? false : (Boolean) > value10.isNullAt(0); > /* 059 */ boolean isNull8 = false; > /* 060 */ double value8 = -1.0; > /* 061 */ if (!isNull9 && value9) { > /* 062 */ > /* 063 */ final double value12 = -1.0; > /* 064 */ isNull8 = true; > /* 065 */ value8 = value12; > /* 066 */ } else { > /* 067 */ > /* 068 */ boolean isNull14 = i.isNullAt(1); > /* 069 */ InternalRow value14 = isNull14 ? null : (i.getStruct(1, 2)); > /* 070 */ boolean isNull13 = isNull14; > /* 071 */ double value13 = -1.0; > /* 072 */ > /* 073 */ if (!isNull14) { > /* 074 */ > /* 075 */ if (value14.isNullAt(0)) { > /* 076 */ isNull13 = true; > /* 077 */ } else { > /* 078 */ value13 = value14.getDouble(0); > /* 079 */ } > /* 080 */ > /* 081 */ } > /* 082 */ isNull8 = isNull13; > /* 083 */ value8 = value13; > /* 084 */ } > /* 085 */ if (isNull8) { > /* 086 */ values1[0] = null; > /* 087 */ } else { > /* 088 */ values1[0] = value8; > /* 089 */ } > /* 090 */ > /* 091 */ boolean isNull17 = i.isNullAt(1); > /* 092 */ InternalRow value17 = isNull17 ? null : (i.getStruct(1, 2)); > /* 093 */ > /* 094 */ boolean isNull16 = isNull17 || false; > /* 095 */ final boolean value16 = isNull16 ? false :
[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953968#comment-15953968 ] Wenchen Fan commented on SPARK-19659: - What's the smallest unit of fetching remote shuffle blocks? If the unit is block, I think it's really hard to avoid OOM entirely, as if the estimated block size is wrong, fetching this block may cause OOM and we can do nothing about it. (I guess that's why you add {{spark.reducer.maxBytesShuffleToMemory}} in your PR.) If the unit can be smaller like a byte buffer, and we can fully track and control the shuffle fetch memory usage, I think then we can solve the OOM problem pretty good without introducing new config to users. Is it possible to do it with some advanced netty API? > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20204) separate SQLConf into catalyst confs and sql confs
[ https://issues.apache.org/jira/browse/SPARK-20204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953930#comment-15953930 ] Apache Spark commented on SPARK-20204: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17521 > separate SQLConf into catalyst confs and sql confs > -- > > Key: SPARK-20204 > URL: https://issues.apache.org/jira/browse/SPARK-20204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20204) separate SQLConf into catalyst confs and sql confs
[ https://issues.apache.org/jira/browse/SPARK-20204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20204: Assignee: Apache Spark (was: Wenchen Fan) > separate SQLConf into catalyst confs and sql confs > -- > > Key: SPARK-20204 > URL: https://issues.apache.org/jira/browse/SPARK-20204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20204) separate SQLConf into catalyst confs and sql confs
[ https://issues.apache.org/jira/browse/SPARK-20204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20204: Assignee: Wenchen Fan (was: Apache Spark) > separate SQLConf into catalyst confs and sql confs > -- > > Key: SPARK-20204 > URL: https://issues.apache.org/jira/browse/SPARK-20204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20204) separate SQLConf into catalyst confs and sql confs
Wenchen Fan created SPARK-20204: --- Summary: separate SQLConf into catalyst confs and sql confs Key: SPARK-20204 URL: https://issues.apache.org/jira/browse/SPARK-20204 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953820#comment-15953820 ] Bryan Cutler commented on SPARK-19979: -- >From the discussion in the PR {noformat} val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) val dt = new DecisionTreeClassifier() .setMaxDepth(5) val pipeline = new Pipeline() val pipeline1: Array[PipelineStage] = Array(tokenizer, hashingTF, lr) val pipeline2: Array[PipelineStage] = Array(tokenizer, hashingTF, dt) val pipeline1_grid = new ParamGridBuilder() .baseOn(pipeline.stages -> pipeline1) .addGrid(hashingTF.numFeatures, Array(10, 100, 1000)) .addGrid(lr.regParam, Array(0.1, 0.01)) .build() val pipeline2_grid = new ParamGridBuilder() .baseOn(pipeline.stages -> pipeline2) .addGrid(hashingTF.numFeatures, Array(10, 100, 1000)) .build() val paramGrid = pipeline1_grid ++ pipeline2_grid val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(2) // Use 3+ in practice {noformat} [~josephkb] [~mlnick] would this be good to add to the documentation? > [MLLIB] Multiple Estimators/Pipelines In CrossValidator > --- > > Key: SPARK-19979 > URL: https://issues.apache.org/jira/browse/SPARK-19979 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: David Leifker > > Update CrossValidator and TrainValidationSplit to be able to accept multiple > pipelines and grid parameters for testing different algorithms and/or being > able to better control tuning combinations. Maintains backwards compatible > API and reads legacy serialized objects. > The same could be done using an external iterative approach. Build different > pipelines, throwing each into a CrossValidator, and then taking the best > model from each of those CrossValidators. Then finally picking the best from > those. This is the initial approach I explored. It resulted in a lot of > boiler plate code that felt like it shouldn't need to exist if the api simply > allowed for arrays of estimators and their parameters. > A couple advantages to this implementation to consider come from keeping the > functional interface to the CrossValidator. > 1. The caching of the folds is better utilized. An external iterative > approach creates a new set of k folds for each CrossValidator fit and the > folds are discarded after each CrossValidator run. In this implementation a > single set of k folds is created and cached for all of the pipelines. > 2. A potential advantage of using this implementation is for future > parallelization of the pipelines within the CrossValdiator. It is of course > possible to handle the parallelization outside of the CrossValidator here > too, however I believe there is already work in progress to parallelize the > grid parameters and that could be extended to multiple pipelines. > Both of those behind-the-scene optimizations are possible because of > providing the CrossValidator with the data and the complete set of > pipelines/estimators to evaluate up front allowing one to abstract away the > implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan
[ https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953793#comment-15953793 ] Apache Spark commented on SPARK-19712: -- User 'nsyca' has created a pull request for this issue: https://github.com/apache/spark/pull/17520 > EXISTS and Left Semi join do not produce the same plan > -- > > Key: SPARK-19712 > URL: https://issues.apache.org/jira/browse/SPARK-19712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nattavut Sutyanyong > > This problem was found during the development of SPARK-18874. > The EXISTS form in the following query: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 > from t3 where t1.t1b=t3.t3b)")}} > gives the optimized plan below: > {code} > == Optimized Logical Plan == > Join Inner, (t1a#7 = t2a#25) > :- Join LeftSemi, (t1b#8 = t3b#58) > : :- Filter isnotnull(t1a#7) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Project [1 AS 1#271, t3b#58] > : +- Relation[t3a#57,t3b#58,t3c#59] parquet > +- Filter isnotnull(t2a#25) >+- Relation[t2a#25,t2b#26,t2c#27] parquet > {code} > whereas a semantically equivalent Left Semi join query below: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on > t1.t1b=t3.t3b")}} > gives the following optimized plan: > {code} > == Optimized Logical Plan == > Join LeftSemi, (t1b#8 = t3b#58) > :- Join Inner, (t1a#7 = t2a#25) > : :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7)) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Filter isnotnull(t2a#25) > : +- Relation[t2a#25,t2b#26,t2c#27] parquet > +- Project [t3b#58] >+- Relation[t3a#57,t3b#58,t3c#59] parquet > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan
[ https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19712: Assignee: Apache Spark > EXISTS and Left Semi join do not produce the same plan > -- > > Key: SPARK-19712 > URL: https://issues.apache.org/jira/browse/SPARK-19712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nattavut Sutyanyong >Assignee: Apache Spark > > This problem was found during the development of SPARK-18874. > The EXISTS form in the following query: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 > from t3 where t1.t1b=t3.t3b)")}} > gives the optimized plan below: > {code} > == Optimized Logical Plan == > Join Inner, (t1a#7 = t2a#25) > :- Join LeftSemi, (t1b#8 = t3b#58) > : :- Filter isnotnull(t1a#7) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Project [1 AS 1#271, t3b#58] > : +- Relation[t3a#57,t3b#58,t3c#59] parquet > +- Filter isnotnull(t2a#25) >+- Relation[t2a#25,t2b#26,t2c#27] parquet > {code} > whereas a semantically equivalent Left Semi join query below: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on > t1.t1b=t3.t3b")}} > gives the following optimized plan: > {code} > == Optimized Logical Plan == > Join LeftSemi, (t1b#8 = t3b#58) > :- Join Inner, (t1a#7 = t2a#25) > : :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7)) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Filter isnotnull(t2a#25) > : +- Relation[t2a#25,t2b#26,t2c#27] parquet > +- Project [t3b#58] >+- Relation[t3a#57,t3b#58,t3c#59] parquet > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan
[ https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19712: Assignee: (was: Apache Spark) > EXISTS and Left Semi join do not produce the same plan > -- > > Key: SPARK-19712 > URL: https://issues.apache.org/jira/browse/SPARK-19712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nattavut Sutyanyong > > This problem was found during the development of SPARK-18874. > The EXISTS form in the following query: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 > from t3 where t1.t1b=t3.t3b)")}} > gives the optimized plan below: > {code} > == Optimized Logical Plan == > Join Inner, (t1a#7 = t2a#25) > :- Join LeftSemi, (t1b#8 = t3b#58) > : :- Filter isnotnull(t1a#7) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Project [1 AS 1#271, t3b#58] > : +- Relation[t3a#57,t3b#58,t3c#59] parquet > +- Filter isnotnull(t2a#25) >+- Relation[t2a#25,t2b#26,t2c#27] parquet > {code} > whereas a semantically equivalent Left Semi join query below: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on > t1.t1b=t3.t3b")}} > gives the following optimized plan: > {code} > == Optimized Logical Plan == > Join LeftSemi, (t1b#8 = t3b#58) > :- Join Inner, (t1a#7 = t2a#25) > : :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7)) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Filter isnotnull(t2a#25) > : +- Relation[t2a#25,t2b#26,t2c#27] parquet > +- Project [t3b#58] >+- Relation[t3a#57,t3b#58,t3c#59] parquet > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20047) Constrained Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953728#comment-15953728 ] DB Tsai commented on SPARK-20047: - I changed the target to 2.3.0 Thanks. > Constrained Logistic Regression > --- > > Key: SPARK-20047 > URL: https://issues.apache.org/jira/browse/SPARK-20047 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Yanbo Liang > > For certain applications, such as stacked regressions, it is important to put > non-negative constraints on the regression coefficients. Also, if the ranges > of coefficients are known, it makes sense to constrain the coefficient search > space. > Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ > R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a > set of m linear constraints on the coefficients is very challenging as > discussed in many literatures. > However, for box constraints on the coefficients, the optimization is well > solved. For gradient descent, people can projected gradient descent in the > primal by zeroing the negative weights at each step. For LBFGS, an extended > version of it, LBFGS-B can handle large scale box optimization efficiently. > Unfortunately, for OWLQN, there is no good efficient way to do optimization > with box constrains. > As a result, in this work, we only implement constrained LR with box > constrains without L1 regularization. > Note that since we standardize the data in training phase, so the > coefficients seen in the optimization subroutine are in the scaled space; as > a result, we need to convert the box constrains into scaled space. > Users will be able to set the lower / upper bounds of each coefficients and > intercepts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20047) Constrained Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-20047: Affects Version/s: (was: 2.1.0) 2.2.0 Target Version/s: 2.3.0 (was: 2.2.0) > Constrained Logistic Regression > --- > > Key: SPARK-20047 > URL: https://issues.apache.org/jira/browse/SPARK-20047 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Yanbo Liang > > For certain applications, such as stacked regressions, it is important to put > non-negative constraints on the regression coefficients. Also, if the ranges > of coefficients are known, it makes sense to constrain the coefficient search > space. > Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ > R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a > set of m linear constraints on the coefficients is very challenging as > discussed in many literatures. > However, for box constraints on the coefficients, the optimization is well > solved. For gradient descent, people can projected gradient descent in the > primal by zeroing the negative weights at each step. For LBFGS, an extended > version of it, LBFGS-B can handle large scale box optimization efficiently. > Unfortunately, for OWLQN, there is no good efficient way to do optimization > with box constrains. > As a result, in this work, we only implement constrained LR with box > constrains without L1 regularization. > Note that since we standardize the data in training phase, so the > coefficients seen in the optimization subroutine are in the scaled space; as > a result, we need to convert the box constrains into scaled space. > Users will be able to set the lower / upper bounds of each coefficients and > intercepts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20193) Selecting empty struct causes ExpressionEncoder error.
[ https://issues.apache.org/jira/browse/SPARK-20193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953704#comment-15953704 ] Adrian Ionescu commented on SPARK-20193: cc [~hvanhovell] > Selecting empty struct causes ExpressionEncoder error. > -- > > Key: SPARK-20193 > URL: https://issues.apache.org/jira/browse/SPARK-20193 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Adrian Ionescu > Labels: struct > > {{def struct(cols: Column*): Column}} > Given the above signature and the lack of any note in the docs saying that a > struct with no columns is not supported, I would expect the following to work: > {{spark.range(3).select(col("id"), struct().as("empty_struct")).collect}} > However, this results in: > {quote} > java.lang.AssertionError: assertion failed: each serializer expression should > contains at least one `BoundReference` > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:240) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:238) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.(ExpressionEncoder.scala:238) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:63) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2837) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1131) > ... 39 elided > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20194) Support partition pruning for InMemoryCatalog
[ https://issues.apache.org/jira/browse/SPARK-20194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20194. - Resolution: Fixed Assignee: Adrian Ionescu Fix Version/s: 2.2.0 > Support partition pruning for InMemoryCatalog > - > > Key: SPARK-20194 > URL: https://issues.apache.org/jira/browse/SPARK-20194 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Adrian Ionescu >Assignee: Adrian Ionescu > Fix For: 2.2.0 > > > {{listPartitionsByFilter()}} is not yet implemented for {{InMemoryCatalog}}: > {quote} > // TODO: Provide an implementation > throw new UnsupportedOperationException( > "listPartitionsByFilter is not implemented for InMemoryCatalog") > {quote} > Because of this, there is a hack in {{FindDataSourceTable}} that avoids > passing along the {{CatalogTable}} to the {{DataSource}} it creates when the > catalog implementation is not "hive", so that, when the latter is resolved, > an {{InMemoryFileIndex}} is created instead of a {{CatalogFileIndex}} which > the {{PruneFileSourcePartitions}} rule matches for. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arush Kharbanda updated SPARK-20199: Comment: was deleted (was: I will work on this issue.) > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar >Priority: Minor > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11783) When deployed against remote Hive metastore, HiveContext.executionHive points to wrong metastore
[ https://issues.apache.org/jira/browse/SPARK-11783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953647#comment-15953647 ] Jonathan Maron commented on SPARK-11783: I am running a spark job and, when instantiating a HiveContext, I see that the client creates a local derby-based metastore. Is this the intent for client processes? I don't understand the necessity for a client process to create a metastore instance rather than leverage the remote metastore server. > When deployed against remote Hive metastore, HiveContext.executionHive points > to wrong metastore > > > Key: SPARK-11783 > URL: https://issues.apache.org/jira/browse/SPARK-11783 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1, 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.6.0 > > > When using remote metastore, execution Hive client somehow is initialized to > point to the actual remote metastore instead of the dummy local Derby > metastore. > To reproduce this issue: > # Configuring {{conf/hive-site.xml}} to point to a remote Hive 1.2.1 > metastore. > # Set {{hive.metastore.uris}} to {{thrift://localhost:9083}}. > # Start metastore service using {{$HIVE_HOME/bin/hive --service metastore}} > # Start Thrift server with remote debugging options > # Attach the debugger to the Thrift server driver process, we can verify that > {{executionHive}} points to the remote metastore rather than the local > execution Derby metastore. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9272) Persist information of individual partitions when persisting partitioned data source tables to metastore
[ https://issues.apache.org/jira/browse/SPARK-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953582#comment-15953582 ] Daniel Tomes commented on SPARK-9272: - BUMP This is an important issue. Let's get this resolved. > Persist information of individual partitions when persisting partitioned data > source tables to metastore > > > Key: SPARK-9272 > URL: https://issues.apache.org/jira/browse/SPARK-9272 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian > > Currently, when a partitioned data source table is persisted to Hive > metastore, we only persist its partition columns. Information about > individual partitions are not persisted. This forces us to do a partition > discovery before reading a persisted partitioned table, which hurts > performance. > To fix this issue, we may persist partition information into metastore. > Specifically, the format should be compatible with Hive to ensure > interoperability. > One of the approach to collect partition values and partition directory path > for dynamicly partitioned tables is to use accumulators to collect expected > information during the write job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20047) Constrained Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953551#comment-15953551 ] Nick Pentreath commented on SPARK-20047: Is this really targeted for 2.2.0? > Constrained Logistic Regression > --- > > Key: SPARK-20047 > URL: https://issues.apache.org/jira/browse/SPARK-20047 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: DB Tsai >Assignee: Yanbo Liang > > For certain applications, such as stacked regressions, it is important to put > non-negative constraints on the regression coefficients. Also, if the ranges > of coefficients are known, it makes sense to constrain the coefficient search > space. > Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ > R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a > set of m linear constraints on the coefficients is very challenging as > discussed in many literatures. > However, for box constraints on the coefficients, the optimization is well > solved. For gradient descent, people can projected gradient descent in the > primal by zeroing the negative weights at each step. For LBFGS, an extended > version of it, LBFGS-B can handle large scale box optimization efficiently. > Unfortunately, for OWLQN, there is no good efficient way to do optimization > with box constrains. > As a result, in this work, we only implement constrained LR with box > constrains without L1 regularization. > Note that since we standardize the data in training phase, so the > coefficients seen in the optimization subroutine are in the scaled space; as > a result, we need to convert the box constrains into scaled space. > Users will be able to set the lower / upper bounds of each coefficients and > intercepts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953489#comment-15953489 ] Sean Owen commented on SPARK-20202: --- Alrighty, you can leave the status for now, but generally committers set Blocker. I'm not entirely clear this blocks a release, not yet. You're absolutely right, but, the hive fork with binaries and source is part of this project. At least, that's the idea. For example, this is notionally voted on and released with each Spark release, but the binary/source of this fork project isn't separately, explicitly, voted on and separately released. I think that should occur for avoidance of doubt, that this is a blessed artifact of the Spark project. Would this answer your process and policy concerns about the release? It's not pretty but I think that's within the law. Of course, it's no answer in the long term. The goal is to not have to use the fork at all. If Hive packaging changes are already in place to make it unnecessary, great (is that all there is to it, everyone?) I don't know if that presents a solution for earlier versions of Hive. This fork thing may persist in existing branches, but it has to at least be released and used in a proper way. This may need fixes right now. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Blocker > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated SPARK-20202: -- Priority: Blocker (was: Critical) It is against Apache policy to release binaries that aren't part of your project. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Blocker > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953386#comment-15953386 ] Hyukjin Kwon commented on SPARK-19809: -- Shoudn't it contain footer and schema information or a magic number at least? I am not sure if we can say 0 byte file is an ORC file. > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >
[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953341#comment-15953341 ] Michał Dawid commented on SPARK-19809: -- Those empty files have been created while processing with Pig scripts. {code}-rw-rw-rw- 3 etl hdfs 14103 2017-04-03 01:26 part-v001-o000-r-0_a_2 -rw-rw-rw- 3 etl hdfs 0 2017-04-03 01:26 part-v001-o000-r-0_a_3 -rw-rw-rw- 3 etl hdfs 10125 2017-04-03 01:27 part-v001-o000-r-0_a_4 {code} > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953328#comment-15953328 ] Cyril de Vogelaere commented on SPARK-20203: Oh, I thought we were talking about the performance implication of adding an if which would be tested often. For the issue you just pointed, I will agree it would be a major negative consequence of that change. Sorry, I didn't understand that it was what you were talking about. Well, then I suppose we should resolve this thread with a "won't fix". Except if you think the potential user friendlyness can balance that major default. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953319#comment-15953319 ] Sean Owen commented on SPARK-20203: --- How can this not have performance implications? you generate more frequent patterns, potentially a lot more. You can see this even in the comments and error messages about collecting too many elements to the driver. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953318#comment-15953318 ] Cyril de Vogelaere commented on SPARK-20203: I'm not splitting it, I deleted the other thread. I did agree adding the zero special value might have a tiny negative effect on performance, without adding new functionnalities. So I closed it, following that line of thought. This post, is just about changing the default value. Which, you agreed, can be discussed. That's a new context of discussion, so I created a new thread. This should make more sense no ? > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953315#comment-15953315 ] Owen O'Malley commented on SPARK-20202: --- I should also say here that the Hive community is willing to help. We are in the process of rolling releases so if Spark needs a change, we can work together to get this done. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Critical > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953299#comment-15953299 ] Cyril de Vogelaere edited comment on SPARK-20203 at 4/3/17 11:18 AM: - This cannot have performance implication, we are not changing anything but the default value. It does change the number of solution we are searching for. So of course it will take longer since the search space is bigger. But on a dataset where it already found everything, it should still do so. And not be slower at all. Now, it would just find everything by default. Which, I agree, should be debated. To know whether that's really what we want the default behavior of the program to be. was (Author: syrux): This cannot have performance implication, we are not changing anything but the default value. It does change the number of solution we are searching for. So of course it will take longer since the search space is bigger. But on a dataset where it already found everything, it should still do so. Now, it would just find everything by default. Which, I agree, should be debated. To know whether that's really what we want the default behavior of the program to be. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953299#comment-15953299 ] Cyril de Vogelaere commented on SPARK-20203: This cannot have performance implication, we are not changing anything but the default value. It does change the number of solution we are searching for. So of course it will take longer since the search space is bigger. But on a dataset where it already found everything, it should still do so. Now, it would just find everything by default. Which, I agree, should be debated. To know whether that's really what we want the default behavior of the program to be. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953298#comment-15953298 ] Owen O'Malley commented on SPARK-20202: --- As an Apache member, the Spark project can't release binary artifacts that aren't made from its Apache code base. So either, the Spark project needs to use Hive's release artifacts or it formally fork Hive and move the fork into its git repository at Apache and rename it away from org.apache.hive to org.apache.spark. The current path is not allowed. Hive is in the middle of rolling releases and thus this is a good time to make requests. The old uber jar (hive-exec) is already released separately with the classifier "core." It looks like we are using the same protobuf (2.5.0) and kryo (3.0.3) versions. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Critical > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953298#comment-15953298 ] Owen O'Malley edited comment on SPARK-20202 at 4/3/17 11:16 AM: As an Apache member, the Spark project can't release binary artifacts that aren't made from its Apache code base. So either, the Spark project needs to use Hive's release artifacts or it needs to formally fork Hive and move the fork into its git repository at Apache and rename it away from org.apache.hive to org.apache.spark. The current path is not allowed. Hive is in the middle of rolling releases and thus this is a good time to make requests. The old uber jar (hive-exec) is already released separately with the classifier "core." It looks like we are using the same protobuf (2.5.0) and kryo (3.0.3) versions. was (Author: owen.omalley): As an Apache member, the Spark project can't release binary artifacts that aren't made from its Apache code base. So either, the Spark project needs to use Hive's release artifacts or it formally fork Hive and move the fork into its git repository at Apache and rename it away from org.apache.hive to org.apache.spark. The current path is not allowed. Hive is in the middle of rolling releases and thus this is a good time to make requests. The old uber jar (hive-exec) is already released separately with the classifier "core." It looks like we are using the same protobuf (2.5.0) and kryo (3.0.3) versions. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Critical > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953297#comment-15953297 ] Cyril de Vogelaere commented on SPARK-20203: SPARK-20180 was about adding a special value (0) to find all pattern no matter their length, and put it as default value. You pointed it might lower the performances, without adding more functionalities. So I closed that thread. This one is just about changing the default value, no other changes in the code. You said it needed discussion, since it was a change in default behavior. But the amount of comment on the last thread would discourage discussion, I felt like a new thread would be more appropriate. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere closed SPARK-20180. -- Resolution: Won't Fix > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953289#comment-15953289 ] Sean Owen commented on SPARK-20203: --- This is again not addressing the point, that doing so has performance implications. Or could. That has to be established. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20203: --- Description: I think changing the default value to Int.MaxValue would be more user friendly. At least for new users. Personally, when I run an algorithm, I expect it to find all solution by default. And a limited number of them, when I set the parameters to do so. The current implementation limit the length of solution patterns to 10. Thus preventing all solution to be printed when running slightly large datasets. I feel like that should be changed, but since this would change the default behavior of PrefixSpan. I think asking for the communities opinion should come first. So, what do you think ? was: I think changing the default value to Int.MaxValue would be more user friendly. At least for new user. Personally, when I run an algorithm, I expect it to find all solution by default. And a limited number of them, when I set the parameters so. The current implementation limit the length of solution patterns to 10. Thus preventing all solution to be printed when running slightly large datasets. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953282#comment-15953282 ] Cyril de Vogelaere commented on SPARK-20180: Fine, I thought a TODO left in the code would reflect the wish of the community, at least a little. I will close this thread and open a new one on changing the default value to maxInteger, since I personnally think it would be more friendly to new users. Link to new thread : https://issues.apache.org/jira/browse/SPARK-20203 Tomorrow, I will create a new thread with another improvement I want to add to spark. I need to run a performance test on just that change first, to prove it will be usefull. I hope you will follow it too. > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953280#comment-15953280 ] Sean Owen commented on SPARK-20203: --- I don't understand, this is the same as SPARK-20180? > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would > be more user friendly. At least for new user. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
Cyril de Vogelaere created SPARK-20203: -- Summary: Change default maxPatternLength value to Int.MaxValue in PrefixSpan Key: SPARK-20203 URL: https://issues.apache.org/jira/browse/SPARK-20203 Project: Spark Issue Type: Wish Components: MLlib Affects Versions: 2.1.0 Reporter: Cyril de Vogelaere Priority: Trivial I think changing the default value to Int.MaxValue would be more user friendly. At least for new user. Personally, when I run an algorithm, I expect it to find all solution by default. And a limited number of them, when I set the parameters so. The current implementation limit the length of solution patterns to 10. Thus preventing all solution to be printed when running slightly large datasets. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20202: -- Priority: Critical (was: Blocker) Fix Version/s: (was: 2.1.1) (was: 1.6.4) (was: 2.0.3) I see wide agreement on that. One question I have is, is including Hive this way merely a really-not-nice-to-have or actually not allowed? I think the question is whether sources are available, right? because releases can't have binary-only parts. I plead ignorance, I have never myself paid much attention to this integration. If it's not then this sounds like something has to change for releases beyond 2.1.1 and this can be targeted as a Blocker accordingly. Does this depend on refactoring or changes in Hive? IIRC the problem was hive-exec being an uber-jar, but it's been a long time since I read any of that discussion. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Critical > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953240#comment-15953240 ] Sean Owen commented on SPARK-20180: --- Surely, the impact is more than an 'if' statement. If you contemplate much larger spans that's going to take longer to compute and return right? I think we're not at all in agreement there, especially as you're seeing the test (?) run forever. Yes I know there's a TODO (BTW you can see who wrote it with 'blame') but that doesn't mean I agree with it. It also doesn't say it should be a default. Keep in mind how much time it takes to discuss these changes relative to the value. We need to converge rapidly to decisions. The question here is performance impact on non-trivial examples. So far I just don't see much compelling reason to change a default. The functionality you want is already available. > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20202) Remove references to org.spark-project.hive
Owen O'Malley created SPARK-20202: - Summary: Remove references to org.spark-project.hive Key: SPARK-20202 URL: https://issues.apache.org/jira/browse/SPARK-20202 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 1.6.4, 2.0.3, 2.1.1 Reporter: Owen O'Malley Priority: Blocker Fix For: 1.6.4, 2.0.3, 2.1.1 Spark can't continue to depend on their fork of Hive and must move to standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19752) OrcGetSplits fails with 0 size files
[ https://issues.apache.org/jira/browse/SPARK-19752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19752. -- Resolution: Duplicate It sounds a duplicate of SPARK-19809. Please reopen that if I misunderstood. > OrcGetSplits fails with 0 size files > > > Key: SPARK-19752 > URL: https://issues.apache.org/jira/browse/SPARK-19752 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.0 >Reporter: Nick Orka > > There is a possibility that during some sql queries a partition may have a 0 > size file (empty file). Next time when I try to read from the file by sql > query, I'm getting this error: > 17/02/27 10:33:11 INFO PerfLogger: start=1488191591570 end=1488191591599 duration=29 > from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> > 17/02/27 10:33:11 ERROR ApplicationMaster: User class threw exception: > java.lang.reflect.InvocationTargetException > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$JavaVanillaMethodMirror1.jinvokeraw(JavaMirrors.scala:373) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$JavaMethodMirror.jinvoke(JavaMirrors.scala:339) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$JavaVanillaMethodMirror.apply(JavaMirrors.scala:355) > at com.sessionm.Datapipeline$.main(Datapipeline.scala:200) > at com.sessionm.Datapipeline.main(Datapipeline.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627) > Caused by: java.lang.RuntimeException: serious problem > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84) > at > scala.collection.parallel.AugmentedIterableIterator$class.map2combiner(RemainsIterator.scala:115) > at > scala.collection.parallel.immutable.ParVector$ParVectorIterator.map2combiner(ParVector.scala:62) > at > scala.collection.parallel.ParIterableLike$Map.leaf(ParIterableLike.scala:1054) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) > at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51) > at > scala.collection.parallel.ParIterableLike$Map.tryLeaf(ParIterableLike.scala:1051) > at > scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:169) > at > scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443) > at > scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149) > at >
[jira] [Resolved] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19809. -- Resolution: Invalid I don't think there is 0 byte ORC file. It should have the footer. Moreover, currently, Spark's ORC datasource does not write out empty files (see https://issues.apache.org/jira/browse/SPARK-15474). Please reopen this if I misunderstood. It would be great if there is some steps to reproduce maybe to verify this issue. I am resolving this. > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) >
[jira] [Comment Edited] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953201#comment-15953201 ] Cyril de Vogelaere edited comment on SPARK-20180 at 4/3/17 9:57 AM: => Why not let the default be Int.MaxValue? I'm also ok with a default Int.MaxValue, if the special value zero is really something you are against. if that's what this is about, update the title to reflect it. => I will gladly do that, if you think the current title is misleading. This is a behavior change by default, so we should think carefully about it => Yes, I agree. What are the downsides – why would someone have ever made it 10? presumably, performance. => The changed code consist simply in an additionnal condition in an if. If you want to see a graph, I have one that test the differences in performances, but on my implementation optimised for single-item pattern. So it wouldn't be relevant, if you are worried of performance drop, I can do additional tests, on the two lines I changed. If you want me to use some particular dataset, I will also gladly oblige. Just say the word, and you will have them by tommorow. So it would be less about what impact it has on the performance, since it would be negligeable (again, i'm ready to prove that if you want me to), but about whether that feature seems needed or not. Which I agree, is debatable. Also, whichever senior implemented it that way, left this comment : // TODO: support unbounded pattern length when maxPatternLength = 0 Which you can find in the original code, and is the reason I created this Jira's thread first. Among the list of improvement I want to propose. You can find that line in the PrefixSpan code if you don't believe me.If theses change are rejected, then when I have the occasion, I will remove that line. Since this thread would have established that it isn't needed. You mention tests don't end and haven't established it's not due to your change. => I'm establishing that right now ... as I said. Also, they are ending, but they are really really slow. I don't think we can proceed with this in this state, right? => I will leave the decision to you was (Author: syrux): => Why not let the default be Int.MaxValue? I'm also ok with a default Int.MaxValue, if the special value zero is really something you are against. if that's what this is about, update the title to reflect it. => I will gladly do that, if you think the current title is misleading. This is a behavior change by default, so we should think carefully about it => Yes, I agree. What are the downsides – why would someone have ever made it 10? presumably, performance. => The changed code consist simply in an additionnal condition in an if. If you want to see a graph, I have one that test the differences in performances, but on my implementation optimised for single-item pattern. So it wouldn't be relevant, if you are worried of performance drop, I can do additional tests, on the two lines I changed. If you want me to use some particular dataset, I will also gladly oblige. Just say the word, and you will have them by tommorow. So it would be less about what impact it has on the performance, since it would be negligeable (again, i'm ready to prove that if you want me to), but about whether that feature seems needed or not. Which I agree, is debatable. Also, whichever senior implemented it that way, left this comment : // TODO: support unbounded pattern length when maxPatternLength = 0 Which you can find in the original code, and is the reason I created this Jira's thread first. Among the list of improvement I want to propose. You can find that line in the PrefixSpan code if you don't believe me.If theses change are rejected, then when I have the occasion, I will remove that line. So it would establish it isn't needed. You mention tests don't end and haven't established it's not due to your change. => I'm establishing that right now ... as I said. Also, they are ending, but they are really really slow. I don't think we can proceed with this in this state, right? => I will leave the decision to you > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value
[jira] [Assigned] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19641: --- Assignee: Hyukjin Kwon > JSON schema inference in DROPMALFORMED mode produces incorrect schema > - > > Key: SPARK-19641 > URL: https://issues.apache.org/jira/browse/SPARK-19641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nathan Howell >Assignee: Hyukjin Kwon > Fix For: 2.2.0 > > > In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no > columns. This occurs when one document contains a valid JSON value (such as a > string or number) and the other documents contain objects or arrays. > When the default case in {{JsonInferSchema.compatibleRootType}} is reached > when merging a {{StringType}} and a {{StructType}} the resulting type will be > a {{StringType}}, which is then discarded because a {{StructType}} is > expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953201#comment-15953201 ] Cyril de Vogelaere edited comment on SPARK-20180 at 4/3/17 9:45 AM: => Why not let the default be Int.MaxValue? I'm also ok with a default Int.MaxValue, if the special value zero is really something you are against. if that's what this is about, update the title to reflect it. => I will gladly do that, if you think the current title is misleading. This is a behavior change by default, so we should think carefully about it => Yes, I agree. What are the downsides – why would someone have ever made it 10? presumably, performance. => The changed code consist simply in an additionnal condition in an if. If you want to see a graph, I have one that test the differences in performances, but on my implementation optimised for single-item pattern. So it wouldn't be relevant, if you are worried of performance drop, I can do additional tests, on the two lines I changed. If you want me to use some particular dataset, I will also gladly oblige. Just say the word, and you will have them by tommorow. So it would be less about what impact it has on the performance, since it would be negligeable (again, i'm ready to prove that if you want me to), but about whether that feature seems needed or not. Which I agree, is debatable. Also, whichever senior implemented it that way, left this comment : // TODO: support unbounded pattern length when maxPatternLength = 0 Which you can find in the original code, and is the reason I created this Jira's thread first. Among the list of improvement I want to propose. You can find that line in the PrefixSpan code if you don't believe me.If theses change are rejected, then when I have the occasion, I will remove that line. So it would establish it isn't needed. You mention tests don't end and haven't established it's not due to your change. => I'm establishing that right now ... as I said. Also, they are ending, but they are really really slow. I don't think we can proceed with this in this state, right? => I will leave the decision to you was (Author: syrux): => Why not let the default be Int.MaxValue? I'm also ok with a default Int.MaxValue, if the special value zero is really something you are against. if that's what this is about, update the title to reflect it. => I will gladly do that, if you think the current title is misleading. This is a behavior change by default, so we should think carefully about it => Yes, I agree. What are the downsides – why would someone have ever made it 10? presumably, performance. => The changed code consist simply in an additionnal condition in an if. If you want to see a graph, I have one that test the differences in performances, but on my implementation optimised for single-item pattern. So it wouldn't be relevant, if you are worried of performance drop, I can do additional tests, on the two lines I changed. If you want me to use some particular dataset, I will also gladly oblige. Just say the word, and you will have them by tommorow. So it would be less about what impact it has on the performance, since it would be negligeable (again, i'm ready to prove that if you want me to), but about whether that feature seems needed or not. Which I agree, is debatable. Also, whichever senior implemented it that way, left this comment : // TODO: support unbounded pattern length when maxPatternLength = 0 Which you can find in the original code, and is the reason I created this Jira's thread first. Among the list of improvement I want to propose. You can find that line in the PrefixSpan code if you don't believe me.If theses change are rejected, then when I have the occasion, I will remove that line. So it would establish it isn't needed. You mention tests don't end and haven't established it's not due to your change. => I'm establishing that right now ... as I said. I don't think we can proceed with this in this state, right? => I will leave the decision to you > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any
[jira] [Resolved] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19641. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17492 [https://github.com/apache/spark/pull/17492] > JSON schema inference in DROPMALFORMED mode produces incorrect schema > - > > Key: SPARK-19641 > URL: https://issues.apache.org/jira/browse/SPARK-19641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nathan Howell > Fix For: 2.2.0 > > > In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no > columns. This occurs when one document contains a valid JSON value (such as a > string or number) and the other documents contain objects or arrays. > When the default case in {{JsonInferSchema.compatibleRootType}} is reached > when merging a {{StringType}} and a {{StructType}} the resulting type will be > a {{StringType}}, which is then discarded because a {{StructType}} is > expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953201#comment-15953201 ] Cyril de Vogelaere commented on SPARK-20180: => Why not let the default be Int.MaxValue? I'm also ok with a default Int.MaxValue, if the special value zero is really something you are against. if that's what this is about, update the title to reflect it. => I will gladly do that, if you think the current title is misleading. This is a behavior change by default, so we should think carefully about it => Yes, I agree. What are the downsides – why would someone have ever made it 10? presumably, performance. => The changed code consist simply in an additionnal condition in an if. If you want to see a graph, I have one that test the differences in performances, but on my implementation optimised for single-item pattern. So it wouldn't be relevant, if you are worried of performance drop, I can do additional tests, on the two lines I changed. If you want me to use some particular dataset, I will also gladly oblige. Just say the word, and you will have them by tommorow. So it would be less about what impact it has on the performance, since it would be negligeable (again, i'm ready to prove that if you want me to), but about whether that feature seems needed or not. Which I agree, is debatable. Also, whichever senior implemented it that way, left this comment : // TODO: support unbounded pattern length when maxPatternLength = 0 Which you can find in the original code, and is the reason I created this Jira's thread first. Among the list of improvement I want to propose. You can find that line in the PrefixSpan code if you don't believe me.If theses change are rejected, then when I have the occasion, I will remove that line. So it would establish it isn't needed. You mention tests don't end and haven't established it's not due to your change. => I'm establishing that right now ... as I said. I don't think we can proceed with this in this state, right? => I will leave the decision to you > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19969) Doc and examples for Imputer
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19969: -- Assignee: yuhao yang > Doc and examples for Imputer > > > Key: SPARK-19969 > URL: https://issues.apache.org/jira/browse/SPARK-19969 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Assignee: yuhao yang > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19969) Doc and examples for Imputer
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19969. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17324 [https://github.com/apache/spark/pull/17324] > Doc and examples for Imputer > > > Key: SPARK-19969 > URL: https://issues.apache.org/jira/browse/SPARK-19969 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20090) Add StructType.fieldNames to Python API
[ https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953199#comment-15953199 ] Hyukjin Kwon commented on SPARK-20090: -- [~josephkb], gentle ping. > Add StructType.fieldNames to Python API > --- > > Key: SPARK-20090 > URL: https://issues.apache.org/jira/browse/SPARK-20090 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Joseph K. Bradley >Priority: Trivial > > The Scala/Java API for {{StructType}} has a method {{fieldNames}}. It would > be nice if the Python {{StructType}} did as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20108) Spark query is getting failed with exception
[ https://issues.apache.org/jira/browse/SPARK-20108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953196#comment-15953196 ] Hyukjin Kwon commented on SPARK-20108: -- It will help other guys like me to track down the problem and solve this. > Spark query is getting failed with exception > > > Key: SPARK-20108 > URL: https://issues.apache.org/jira/browse/SPARK-20108 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: ZS EDGE > > In our project we have implemented a logic where we programatically generate > spark queries. These queries are executed as a sub query and below is the > sample query-- > sqlContext.sql("INSERT INTO TABLE > test_client_r2_r2_2_prod_db1_oz.S3_EMPDTL_Incremental_invalid SELECT > 'S3_EMPDTL_Incremental',S3_EMPDTL_Incremental.row_id,S3_EMPDTL_Incremental.SOURCE_FILE_NAME,S3_EMPDTL_Incremental.SOURCE_ROW_ID,'S3_EMPDTL_Incremental','2017-03-22 > > 20:18:59','1','Emp_id#$Emp_name#$Emp_phone#$Emp_salary_in_K#$Emp_address_id#$Date_of_Birth#$Status#$Dept_id#$Date_of_joining#$Row_Number#$Dec_check#$','test','Y','N/A','','' > FROM S3_EMPDTL_Incremental_r AS S3_EMPDTL_Incremental where row_id IN > (select row_id from s3_empdtl_incremental_r where row_id IN(42949672960))") > While executing the above code in the pyspark it is throwing below exception-- > FAILS>> > .spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) > at > org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > [Stage 32:=> (10 + 5) / > 26]17/03/22 15:42:10 ERROR TaskSetManager: Task 4 in stage 32.0 > failed 4 times; aborting job > 17/03/22 15:42:10 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in > stage 32.0 failed 4 times, most recent failure: Lost task 4.3 in > stage 32.0 (TID 857, ip-10-116-1-73.ec2.internal): > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at
[jira] [Commented] (SPARK-20108) Spark query is getting failed with exception
[ https://issues.apache.org/jira/browse/SPARK-20108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953195#comment-15953195 ] Hyukjin Kwon commented on SPARK-20108: -- It seems almost impossible to reproduce to me. Do you mind if I ask a self-reproducer? > Spark query is getting failed with exception > > > Key: SPARK-20108 > URL: https://issues.apache.org/jira/browse/SPARK-20108 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: ZS EDGE > > In our project we have implemented a logic where we programatically generate > spark queries. These queries are executed as a sub query and below is the > sample query-- > sqlContext.sql("INSERT INTO TABLE > test_client_r2_r2_2_prod_db1_oz.S3_EMPDTL_Incremental_invalid SELECT > 'S3_EMPDTL_Incremental',S3_EMPDTL_Incremental.row_id,S3_EMPDTL_Incremental.SOURCE_FILE_NAME,S3_EMPDTL_Incremental.SOURCE_ROW_ID,'S3_EMPDTL_Incremental','2017-03-22 > > 20:18:59','1','Emp_id#$Emp_name#$Emp_phone#$Emp_salary_in_K#$Emp_address_id#$Date_of_Birth#$Status#$Dept_id#$Date_of_joining#$Row_Number#$Dec_check#$','test','Y','N/A','','' > FROM S3_EMPDTL_Incremental_r AS S3_EMPDTL_Incremental where row_id IN > (select row_id from s3_empdtl_incremental_r where row_id IN(42949672960))") > While executing the above code in the pyspark it is throwing below exception-- > FAILS>> > .spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) > at > org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > [Stage 32:=> (10 + 5) / > 26]17/03/22 15:42:10 ERROR TaskSetManager: Task 4 in stage 32.0 > failed 4 times; aborting job > 17/03/22 15:42:10 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in > stage 32.0 failed 4 times, most recent failure: Lost task 4.3 in > stage 32.0 (TID 857, ip-10-116-1-73.ec2.internal): > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at
[jira] [Updated] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20180: --- Summary: Add a special value for unlimited max pattern length in Prefix span, and set it as default. (was: Unlimited max pattern length in Prefix span) > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20185) csv decompressed incorrectly with extention other than 'gz'
[ https://issues.apache.org/jira/browse/SPARK-20185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953183#comment-15953183 ] Hyukjin Kwon edited comment on SPARK-20185 at 4/3/17 9:28 AM: -- {{codec}} or {{compression}} is an option for writing out as documented. It seems the workaround is not so difficult and the Hadoop's behaviour looks sensible to me as well. was (Author: hyukjin.kwon): {{codec}} or {{compression}} is an option for writing out as documented. It seems the workaround is not so difficult and the behaviour looks reasonable to me as well. > csv decompressed incorrectly with extention other than 'gz' > --- > > Key: SPARK-20185 > URL: https://issues.apache.org/jira/browse/SPARK-20185 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Ran Mingxuan >Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > With code below: > val start_time = System.currentTimeMillis() > val gzFile = spark.read > .format("com.databricks.spark.csv") > .option("header", "false") > .option("inferSchema", "false") > .option("codec", "gzip") > .load("/foo/someCsvFile.gz.bak") > gzFile.repartition(1).write.mode("overwrite").parquet("/foo/") > got error even if I indicated the codec: > WARN util.NativeCodeLoader: Unable to load native-hadoop library for your > platform... using builtin-java classes where applicable > 17/03/23 15:44:55 WARN ipc.Client: Exception encountered while connecting to > the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby. Visit > https://s.apache.org/sbnn-error > 17/03/23 15:44:58 ERROR executor.Executor: Exception in task 2.0 in stage > 12.0 (TID 977) > java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:109) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166) > Have to add extension to GzipCodec to make my code run. > import org.apache.hadoop.io.compress.GzipCodec > class BakGzipCodec extends GzipCodec { > override def getDefaultExtension(): String = ".gz.bak" > } > I suppose the file loader should get file codec depending on option first, > and then to extension. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20185) csv decompressed incorrectly with extention other than 'gz'
[ https://issues.apache.org/jira/browse/SPARK-20185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953183#comment-15953183 ] Hyukjin Kwon commented on SPARK-20185: -- {{codec}} or {{compression}} is an option for writing out as documented. It seems the workaround is not so difficult and the behaviour looks reasonable to me as well. > csv decompressed incorrectly with extention other than 'gz' > --- > > Key: SPARK-20185 > URL: https://issues.apache.org/jira/browse/SPARK-20185 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Ran Mingxuan >Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > With code below: > val start_time = System.currentTimeMillis() > val gzFile = spark.read > .format("com.databricks.spark.csv") > .option("header", "false") > .option("inferSchema", "false") > .option("codec", "gzip") > .load("/foo/someCsvFile.gz.bak") > gzFile.repartition(1).write.mode("overwrite").parquet("/foo/") > got error even if I indicated the codec: > WARN util.NativeCodeLoader: Unable to load native-hadoop library for your > platform... using builtin-java classes where applicable > 17/03/23 15:44:55 WARN ipc.Client: Exception encountered while connecting to > the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby. Visit > https://s.apache.org/sbnn-error > 17/03/23 15:44:58 ERROR executor.Executor: Exception in task 2.0 in stage > 12.0 (TID 977) > java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:109) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166) > Have to add extension to GzipCodec to make my code run. > import org.apache.hadoop.io.compress.GzipCodec > class BakGzipCodec extends GzipCodec { > override def getDefaultExtension(): String = ".gz.bak" > } > I suppose the file loader should get file codec depending on option first, > and then to extension. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9002) KryoSerializer initialization does not include 'Array[Int]'
[ https://issues.apache.org/jira/browse/SPARK-9002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9002. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17482 [https://github.com/apache/spark/pull/17482] > KryoSerializer initialization does not include 'Array[Int]' > --- > > Key: SPARK-9002 > URL: https://issues.apache.org/jira/browse/SPARK-9002 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 > Environment: MacBook Pro, OS X 10.10.4, Spark 1.4.0, master=local[*], > IntelliJ IDEA. >Reporter: Randy Kerber >Priority: Minor > Labels: easyfix, newbie > Fix For: 2.2.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > The object KryoSerializer (inside KryoRegistrator.scala) contains a list of > classes that are automatically registered with Kryo. That list includes: > Array\[Byte], Array\[Long], and Array\[Short]. Array\[Int] is missing from > that list. Can't think of any good reason it shouldn't also be included. > Note: This is first time creating an issue or contributing code to an apache > project. Apologies if I'm not following the process correct. Appreciate any > guidance or assistance. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20180) Unlimited max pattern length in Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953166#comment-15953166 ] Sean Owen commented on SPARK-20180: --- Why not let the default be Int.MaxValue? if that's what this is about, update the title to reflect it. This is a behavior change by default, so we should think carefully about it. What are the downsides -- why would someone have ever made it 10? presumably, performance. I don't see you've benchmarked the impact of making this unlimited by default. You mention tests don't end and haven't established it's not due to your change. I don't think we can proceed with this in this state, right? > Unlimited max pattern length in Prefix span > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20166) Use XXX for ISO timezone instead of ZZ which is FastDateFormat specific in CSV/JSON time related options
[ https://issues.apache.org/jira/browse/SPARK-20166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20166: - Assignee: Hyukjin Kwon Priority: Minor (was: Trivial) > Use XXX for ISO timezone instead of ZZ which is FastDateFormat specific in > CSV/JSON time related options > > > Key: SPARK-20166 > URL: https://issues.apache.org/jira/browse/SPARK-20166 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.2.0 > > > We can use {{XXX}} format instead of {{ZZ}}. {{ZZ}} seems a > {{FastDateFormat}} specific Please see > https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#iso8601timezone > and > https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/FastDateFormat.html > {{ZZ}} supports "ISO 8601 extended format time zones" but it seems > {{FastDateFormat}} specific option. > It seems we better replace {{ZZ}} to {{XXX}} because they look use the same > strategy - > https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L930. > > I also checked the codes and manually debugged it for sure. It seems both > cases use the same pattern {code}( Z|(?:[+-]\\d{2}(?::)\\d{2})) {code}. > Note that this is a fix about documentation not the behaviour change because > {{ZZ}} seems invalid date format in {{SimpleDateFormat}} as documented in > {{DataFrameReader}}: > {quote} >* `timestampFormat` (default `-MM-dd'T'HH:mm:ss.SSSZZ`): sets the > string that >* indicates a timestamp format. Custom date formats follow the formats at >* `java.text.SimpleDateFormat`. This applies to timestamp type. > {quote} > {code} > scala> new > java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00") > res4: java.util.Date = Tue Mar 21 20:00:00 KST 2017 > scala> new > java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z") > res10: java.util.Date = Tue Mar 21 09:00:00 KST 2017 > scala> new > java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00") > java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000-11:00" > at java.text.DateFormat.parse(DateFormat.java:366) > ... 48 elided > scala> new > java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z") > java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000Z" > at java.text.DateFormat.parse(DateFormat.java:366) > ... 48 elided > {code} > {code} > scala> > org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00") > res7: java.util.Date = Tue Mar 21 20:00:00 KST 2017 > scala> > org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z") > res1: java.util.Date = Tue Mar 21 09:00:00 KST 2017 > scala> > org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00") > res8: java.util.Date = Tue Mar 21 20:00:00 KST 2017 > scala> > org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z") > res2: java.util.Date = Tue Mar 21 09:00:00 KST 2017 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20166) Use XXX for ISO timezone instead of ZZ which is FastDateFormat specific in CSV/JSON time related options
[ https://issues.apache.org/jira/browse/SPARK-20166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20166. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17489 [https://github.com/apache/spark/pull/17489] > Use XXX for ISO timezone instead of ZZ which is FastDateFormat specific in > CSV/JSON time related options > > > Key: SPARK-20166 > URL: https://issues.apache.org/jira/browse/SPARK-20166 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Trivial > Fix For: 2.2.0 > > > We can use {{XXX}} format instead of {{ZZ}}. {{ZZ}} seems a > {{FastDateFormat}} specific Please see > https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#iso8601timezone > and > https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/FastDateFormat.html > {{ZZ}} supports "ISO 8601 extended format time zones" but it seems > {{FastDateFormat}} specific option. > It seems we better replace {{ZZ}} to {{XXX}} because they look use the same > strategy - > https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L930. > > I also checked the codes and manually debugged it for sure. It seems both > cases use the same pattern {code}( Z|(?:[+-]\\d{2}(?::)\\d{2})) {code}. > Note that this is a fix about documentation not the behaviour change because > {{ZZ}} seems invalid date format in {{SimpleDateFormat}} as documented in > {{DataFrameReader}}: > {quote} >* `timestampFormat` (default `-MM-dd'T'HH:mm:ss.SSSZZ`): sets the > string that >* indicates a timestamp format. Custom date formats follow the formats at >* `java.text.SimpleDateFormat`. This applies to timestamp type. > {quote} > {code} > scala> new > java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00") > res4: java.util.Date = Tue Mar 21 20:00:00 KST 2017 > scala> new > java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z") > res10: java.util.Date = Tue Mar 21 09:00:00 KST 2017 > scala> new > java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00") > java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000-11:00" > at java.text.DateFormat.parse(DateFormat.java:366) > ... 48 elided > scala> new > java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z") > java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000Z" > at java.text.DateFormat.parse(DateFormat.java:366) > ... 48 elided > {code} > {code} > scala> > org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00") > res7: java.util.Date = Tue Mar 21 20:00:00 KST 2017 > scala> > org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z") > res1: java.util.Date = Tue Mar 21 09:00:00 KST 2017 > scala> > org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00") > res8: java.util.Date = Tue Mar 21 20:00:00 KST 2017 > scala> > org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z") > res2: java.util.Date = Tue Mar 21 09:00:00 KST 2017 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20180) Unlimited max pattern length in Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953160#comment-15953160 ] Cyril de Vogelaere commented on SPARK-20180: Can you not just set a very large max, like Int.MaxValue or similar? => Yes, I said that in the fourth paragraph of my last coment. A carefull user could always set it to Int.MaxValue and never have problems in an empirical situation. Still, it doesn't change the fact that I advocate for that special value (0) as the default value. Since it would be nice that, at first run and no matter the dataset, all solution pattern are found. Even if they are longer than 10. It's not normal for tests to run more than a couple hours. You need to see why. Is your test of unlimited max pattern stuck? => It's not my test per say, it's the dev/run-tests tests which are asked to run before creating a pull resquest. I tested with my few changes and it ran for a day and a half, I'm re-running it now on the current state of the lib, without my changes, it doesn't seem faster ... for now at least ... So I'm pretty sure I didn't screw up on that, for now the errors seem the same too, but I didn't take a deep look at them. > Unlimited max pattern length in Prefix span > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15352) Topology aware block replication
[ https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953153#comment-15953153 ] Apache Spark commented on SPARK-15352: -- User 'lins05' has created a pull request for this issue: https://github.com/apache/spark/pull/17519 > Topology aware block replication > > > Key: SPARK-15352 > URL: https://issues.apache.org/jira/browse/SPARK-15352 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Mesos, Spark Core, YARN >Reporter: Shubham Chopra >Assignee: Shubham Chopra > > With cached RDDs, Spark can be used for online analytics where it is used to > respond to online queries. But loss of RDD partitions due to node/executor > failures can cause huge delays in such use cases as the data would have to be > regenerated. > Cached RDDs, even when using multiple replicas per block, are not currently > resilient to node failures when multiple executors are started on the same > node. Block replication currently chooses a peer at random, and this peer > could also exist on the same host. > This effort would add topology aware replication to Spark that can be enabled > with pluggable strategies. For ease of development/review, this is being > broken down to three major work-efforts: > 1.Making peer selection for replication pluggable > 2.Providing pluggable implementations for providing topology and topology > aware replication > 3.Pro-active replenishment of lost blocks -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15352) Topology aware block replication
[ https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15352: Assignee: Shubham Chopra (was: Apache Spark) > Topology aware block replication > > > Key: SPARK-15352 > URL: https://issues.apache.org/jira/browse/SPARK-15352 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Mesos, Spark Core, YARN >Reporter: Shubham Chopra >Assignee: Shubham Chopra > > With cached RDDs, Spark can be used for online analytics where it is used to > respond to online queries. But loss of RDD partitions due to node/executor > failures can cause huge delays in such use cases as the data would have to be > regenerated. > Cached RDDs, even when using multiple replicas per block, are not currently > resilient to node failures when multiple executors are started on the same > node. Block replication currently chooses a peer at random, and this peer > could also exist on the same host. > This effort would add topology aware replication to Spark that can be enabled > with pluggable strategies. For ease of development/review, this is being > broken down to three major work-efforts: > 1.Making peer selection for replication pluggable > 2.Providing pluggable implementations for providing topology and topology > aware replication > 3.Pro-active replenishment of lost blocks -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15352) Topology aware block replication
[ https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15352: Assignee: Apache Spark (was: Shubham Chopra) > Topology aware block replication > > > Key: SPARK-15352 > URL: https://issues.apache.org/jira/browse/SPARK-15352 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Mesos, Spark Core, YARN >Reporter: Shubham Chopra >Assignee: Apache Spark > > With cached RDDs, Spark can be used for online analytics where it is used to > respond to online queries. But loss of RDD partitions due to node/executor > failures can cause huge delays in such use cases as the data would have to be > regenerated. > Cached RDDs, even when using multiple replicas per block, are not currently > resilient to node failures when multiple executors are started on the same > node. Block replication currently chooses a peer at random, and this peer > could also exist on the same host. > This effort would add topology aware replication to Spark that can be enabled > with pluggable strategies. For ease of development/review, this is being > broken down to three major work-efforts: > 1.Making peer selection for replication pluggable > 2.Providing pluggable implementations for providing topology and topology > aware replication > 3.Pro-active replenishment of lost blocks -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19985) Some ML Models error when copy or do not set parent
[ https://issues.apache.org/jira/browse/SPARK-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19985: -- Assignee: Bryan Cutler > Some ML Models error when copy or do not set parent > --- > > Key: SPARK-19985 > URL: https://issues.apache.org/jira/browse/SPARK-19985 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler > Fix For: 2.2.0 > > > Some ML Models fail when copied due to not having a default constructor and > implementing {{copy}} with {{defaultCopy}}. Other cases do not properly set > the parent when the model is copied. These models were missing the normal > check that tests for these in the test suites. > Models with issues are: > * RFormlaModel > * MultilayerPerceptronClassificationModel > * BucketedRandomProjectionLSHModel > * MinHashLSH -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19985) Some ML Models error when copy or do not set parent
[ https://issues.apache.org/jira/browse/SPARK-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19985. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17326 [https://github.com/apache/spark/pull/17326] > Some ML Models error when copy or do not set parent > --- > > Key: SPARK-19985 > URL: https://issues.apache.org/jira/browse/SPARK-19985 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Bryan Cutler > Fix For: 2.2.0 > > > Some ML Models fail when copied due to not having a default constructor and > implementing {{copy}} with {{defaultCopy}}. Other cases do not properly set > the parent when the model is copied. These models were missing the normal > check that tests for these in the test suites. > Models with issues are: > * RFormlaModel > * MultilayerPerceptronClassificationModel > * BucketedRandomProjectionLSHModel > * MinHashLSH -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org