[jira] [Comment Edited] (SPARK-18187) CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch
[ https://issues.apache.org/jira/browse/SPARK-18187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646800#comment-15646800 ] Genmao Yu edited comment on SPARK-18187 at 11/8/16 7:53 AM: [~marmbrus] +1 to your +“I think the configuration should only be used when deciding if we should perform a new compaction. The identification of a compaction vs a delta should be done based on the file itself.“+ h4. how to set "compactInterval"? {{compactInterval}} can be set by user in the first time. In case it is changed by user, we should check-update and use it. If {{isDeletingExpiredLog=false}}, we can get original {{compactInterval}} by computing the interval of {{.compact}} suffix, and then check it against user setting; if {{isDeletingExpiredLog=true}}, we can just use user setting , because this no expired meta log. was (Author: unclegen): [~marmbrus] +1 to your +think the configuration should only be used when deciding if we should perform a new compaction. The identification of a compaction vs a delta should be done based on the file itself.+ h4. how to set "compactInterval"? {{compactInterval}} can be set by user in the first time. In case it is changed by user, we should check-update and use it. If {{isDeletingExpiredLog=false}}, we can get original {{compactInterval}} by computing the interval of {{.compact}} suffix, and then check it against user setting; if {{isDeletingExpiredLog=true}}, we can just use user setting , because this no expired meta log. > CompactibleFileStreamLog should not rely on "compactInterval" to detect a > compaction batch > -- > > Key: SPARK-18187 > URL: https://issues.apache.org/jira/browse/SPARK-18187 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Shixiong Zhu >Priority: Critical > > Right now CompactibleFileStreamLog uses compactInterval to check if a batch > is a compaction batch. However, since this conf is controlled by the user, > they may just change it, and CompactibleFileStreamLog will just silently > return the wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18187) CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch
[ https://issues.apache.org/jira/browse/SPARK-18187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646800#comment-15646800 ] Genmao Yu commented on SPARK-18187: --- [~marmbrus] +1 to your +think the configuration should only be used when deciding if we should perform a new compaction. The identification of a compaction vs a delta should be done based on the file itself.+ h4. how to set "compactInterval"? {{compactInterval}} can be set by user in the first time. In case it is changed by user, we should check-update and use it. If {{isDeletingExpiredLog=false}}, we can get original {{compactInterval}} by computing the interval of {{.compact}} suffix, and then check it against user setting; if {{isDeletingExpiredLog=true}}, we can just use user setting , because this no expired meta log. > CompactibleFileStreamLog should not rely on "compactInterval" to detect a > compaction batch > -- > > Key: SPARK-18187 > URL: https://issues.apache.org/jira/browse/SPARK-18187 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Shixiong Zhu >Priority: Critical > > Right now CompactibleFileStreamLog uses compactInterval to check if a batch > is a compaction batch. However, since this conf is controlled by the user, > they may just change it, and CompactibleFileStreamLog will just silently > return the wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s
[ https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646795#comment-15646795 ] Jason Pan commented on SPARK-18353: --- in org.apache.spark.deploy.Client, there is one line: conf.set("spark.rpc.askTimeout", "10") should we remove this line? when use rest: in org.apache.spark.deploy.rest.RestSubmissionClient , there is no this line. > spark.rpc.askTimeout defalut value is not 120s > -- > > Key: SPARK-18353 > URL: https://issues.apache.org/jira/browse/SPARK-18353 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.1 > Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 > 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Jason Pan >Priority: Critical > > in http://spark.apache.org/docs/latest/configuration.html > spark.rpc.askTimeout 120sDuration for an RPC ask operation to wait > before timing out > the defalut value is 120s as documented. > However when I run "spark-summit": > the cmd is: > Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" > "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" > "-Xmx1024M" "-Dspark.eventLog.enabled=true" > "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" > "-Dspark.app.name=org.apache.spark.examples.SparkPi" > "-Dspark.submit.deployMode=cluster" > "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar" > "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" > "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" > "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" > "org.apache.spark.deploy.worker.DriverWrapper" > "spark://Worker@9.111.159.127:7103" > "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar" > "org.apache.spark.examples.SparkPi" "1000" > Dspark.rpc.askTimeout=10 > the value is 10, it is not the same as document. > Note: when I summit to REST URL, it has no this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s
[ https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Pan updated SPARK-18353: -- Description: in http://spark.apache.org/docs/latest/configuration.html spark.rpc.askTimeout 120s Duration for an RPC ask operation to wait before timing out the defalut value is 120s as documented. However when I run "spark-summit": the cmd is: Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.eventLog.enabled=true" "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" "-Dspark.app.name=org.apache.spark.examples.SparkPi" "-Dspark.submit.deployMode=cluster" "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar" "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@9.111.159.127:7103" "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar" "org.apache.spark.examples.SparkPi" "1000" Dspark.rpc.askTimeout=10 the value is 10, it is not the same as document. Note: when I summit to REST URL, it has no this issue. was: for the doc: > spark.rpc.askTimeout defalut value is not 120s > -- > > Key: SPARK-18353 > URL: https://issues.apache.org/jira/browse/SPARK-18353 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.1 > Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 > 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Jason Pan >Priority: Critical > > in http://spark.apache.org/docs/latest/configuration.html > spark.rpc.askTimeout 120sDuration for an RPC ask operation to wait > before timing out > the defalut value is 120s as documented. > However when I run "spark-summit": > the cmd is: > Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" > "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" > "-Xmx1024M" "-Dspark.eventLog.enabled=true" > "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" > "-Dspark.app.name=org.apache.spark.examples.SparkPi" > "-Dspark.submit.deployMode=cluster" > "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar" > "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" > "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" > "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" > "org.apache.spark.deploy.worker.DriverWrapper" > "spark://Worker@9.111.159.127:7103" > "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar" > "org.apache.spark.examples.SparkPi" "1000" > Dspark.rpc.askTimeout=10 > the value is 10, it is not the same as document. > Note: when I summit to REST URL, it has no this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s
Jason Pan created SPARK-18353: - Summary: spark.rpc.askTimeout defalut value is not 120s Key: SPARK-18353 URL: https://issues.apache.org/jira/browse/SPARK-18353 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.1, 1.6.1 Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux Reporter: Jason Pan Priority: Critical for the doc: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16496) Add wholetext as option for reading text in SQL.
[ https://issues.apache.org/jira/browse/SPARK-16496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16496: Target Version/s: 2.2.0 > Add wholetext as option for reading text in SQL. > > > Key: SPARK-16496 > URL: https://issues.apache.org/jira/browse/SPARK-16496 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prashant Sharma > > In multiple text analysis problems, it is not often desirable for the rows to > be split by "\n". There exists a wholeText reader for RDD API, and this JIRA > just adds the same support for Dataset API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame
[ https://issues.apache.org/jira/browse/SPARK-17969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-17969: - > I think it's user unfriendly to process standard json file with DataFrame > -- > > Key: SPARK-17969 > URL: https://issues.apache.org/jira/browse/SPARK-17969 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.1 >Reporter: Jianfei Wang >Priority: Minor > > Currently, with DataFrame API, we can't load standard json file directly, > maybe we can provide an override method to process this, the logic is as > below: > ``` > val df = spark.sparkContext.wholeTextFiles("data/test.json") > val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => > val index = x.indexOf(',') > x.substring(index + 1, x.length - 1) > } > val json_df = spark.read.json(json_rdd) > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10840) SparkSQL doesn't work well with JSON
[ https://issues.apache.org/jira/browse/SPARK-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-10840. --- Resolution: Duplicate > SparkSQL doesn't work well with JSON > > > Key: SPARK-10840 > URL: https://issues.apache.org/jira/browse/SPARK-10840 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jordan Sarraf >Priority: Minor > Labels: JSON, Scala, SparkSQL > > Well formed JSON doesn't work with the 1.5.1 version while using > sqlContext.read.json(""): > { > "employees": { > "employee": [ > { > "name": "Mia", > "surname": "Radison", > "mobile": "7295913821", > "email": "miaradi...@sparky.com" > }, > { > "name": "Thor", > "surname": "Kovaskz", > "mobile": "8829177193", > "email": "tkova...@sparky.com" > }, > { > "name": "Bindy", > "surname": "Kvuls", > "mobile": "5033828845", > "email": "bind...@sparky.com" > } > ] > } > } > For the above following error is obtained: > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2) > scala.MatchError: (VALUE_STRING,StructType()) (of class scala.Tuple2) > Where as, this works fine because all components are in the same line: > [ > {"name": "Mia","surname": "Radison","mobile": "7295913821","email": > "miaradi...@sparky.com"}, > {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": > "tkova...@sparky.com"}, > {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": > "bind...@sparky.com"} > ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame
[ https://issues.apache.org/jira/browse/SPARK-17969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-17969. --- Resolution: Duplicate > I think it's user unfriendly to process standard json file with DataFrame > -- > > Key: SPARK-17969 > URL: https://issues.apache.org/jira/browse/SPARK-17969 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.1 >Reporter: Jianfei Wang >Priority: Minor > > Currently, with DataFrame API, we can't load standard json file directly, > maybe we can provide an override method to process this, the logic is as > below: > ``` > val df = spark.sparkContext.wholeTextFiles("data/test.json") > val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => > val index = x.indexOf(',') > x.substring(index + 1, x.length - 1) > } > val json_df = spark.read.json(json_rdd) > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-7366) Support multi-line JSON objects
[ https://issues.apache.org/jira/browse/SPARK-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-7366: > Support multi-line JSON objects > --- > > Key: SPARK-7366 > URL: https://issues.apache.org/jira/browse/SPARK-7366 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Joe Halliwell >Priority: Minor > > h2. Background: why the existing formats aren't enough > The present object-per-line format for ingesting JSON data has a couple of > deficiencies: > 1. It's not itself JSON > 2. It's often harder for humans to read > The object-per-file format addresses these, but at a cost of producing many > files which can be unwieldy. > Since it is feasible to read and write large JSON files via streaming (and > many systems do) it seems reasonable to support them directly as an input > format. > h2. Suggested approach: use a depth hint > The key challenge is to find record boundaries without parsing the file from > the start i.e. given an offset, locate a nearby boundary. In the general case > this is impossible: you can't be sure you've identified the start of a > top-level record without tracing back to the start of the file. > However, if we know something more of the structure of the file i.e. maximum > object depth it seems plausible that we can do better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18352: Summary: Parse normal, multi-line JSON files (not just JSON Lines) (was: Parse normal JSON files (not just JSON Lines)) > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7366) Support multi-line JSON objects
[ https://issues.apache.org/jira/browse/SPARK-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-7366. -- Resolution: Duplicate > Support multi-line JSON objects > --- > > Key: SPARK-7366 > URL: https://issues.apache.org/jira/browse/SPARK-7366 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Joe Halliwell >Priority: Minor > > h2. Background: why the existing formats aren't enough > The present object-per-line format for ingesting JSON data has a couple of > deficiencies: > 1. It's not itself JSON > 2. It's often harder for humans to read > The object-per-file format addresses these, but at a cost of producing many > files which can be unwieldy. > Since it is feasible to read and write large JSON files via streaming (and > many systems do) it seems reasonable to support them directly as an input > format. > h2. Suggested approach: use a depth hint > The key challenge is to find record boundaries without parsing the file from > the start i.e. given an offset, locate a nearby boundary. In the general case > this is impossible: you can't be sure you've identified the start of a > top-level record without tracing back to the start of the file. > However, if we know something more of the structure of the file i.e. maximum > object depth it seems plausible that we can do better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7366) Support multi-line JSON objects
[ https://issues.apache.org/jira/browse/SPARK-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-7366. -- Resolution: Fixed I'm closing this in favor of https://issues.apache.org/jira/browse/SPARK-18352 In reality, it's unlikely each file is enormous and we must split them. If we don't do file splits, then it is not really an issue here and a single Spark task can stream through an entire file to do the parsing. > Support multi-line JSON objects > --- > > Key: SPARK-7366 > URL: https://issues.apache.org/jira/browse/SPARK-7366 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Joe Halliwell >Priority: Minor > > h2. Background: why the existing formats aren't enough > The present object-per-line format for ingesting JSON data has a couple of > deficiencies: > 1. It's not itself JSON > 2. It's often harder for humans to read > The object-per-file format addresses these, but at a cost of producing many > files which can be unwieldy. > Since it is feasible to read and write large JSON files via streaming (and > many systems do) it seems reasonable to support them directly as an input > format. > h2. Suggested approach: use a depth hint > The key challenge is to find record boundaries without parsing the file from > the start i.e. given an offset, locate a nearby boundary. In the general case > this is impossible: you can't be sure you've identified the start of a > top-level record without tracing back to the start of the file. > However, if we know something more of the structure of the file i.e. maximum > object depth it seems plausible that we can do better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18352) Parse normal JSON files (not just JSON Lines)
Reynold Xin created SPARK-18352: --- Summary: Parse normal JSON files (not just JSON Lines) Key: SPARK-18352 URL: https://issues.apache.org/jira/browse/SPARK-18352 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Spark currently can only parse JSON files that are JSON lines, i.e. each record has an entire line and records are separated by new line. In reality, a lot of users want to use Spark to parse actual JSON files, and are surprised to learn that it doesn't do that. We can introduce a new mode (wholeJsonFile?) in which we don't split the files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18351) from_json and to_json for parsing JSON for string columns
[ https://issues.apache.org/jira/browse/SPARK-18351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18351. - Resolution: Fixed Fix Version/s: 2.1.0 > from_json and to_json for parsing JSON for string columns > - > > Key: SPARK-18351 > URL: https://issues.apache.org/jira/browse/SPARK-18351 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18295) Match up to_json to from_json in null safety
[ https://issues.apache.org/jira/browse/SPARK-18295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18295: Issue Type: Sub-task (was: Bug) Parent: SPARK-18351 > Match up to_json to from_json in null safety > > > Key: SPARK-18295 > URL: https://issues.apache.org/jira/browse/SPARK-18295 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Hyukjin Kwon > Fix For: 2.1.0 > > > {code} > scala> val df = Seq(Some(Tuple1(Tuple1(1))), None).toDF("a") > df: org.apache.spark.sql.DataFrame = [a: struct<_1: int>] > scala> df.show() > ++ > | a| > ++ > | [1]| > |null| > ++ > scala> df.select(to_json($"a")).show() > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeFields(JacksonGenerator.scala:138) > at > org.apache.spark.sql.catalyst.json.JacksonGenerator$$anonfun$write$1.apply$mcV$sp(JacksonGenerator.scala:194) > at > org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeObject(JacksonGenerator.scala:131) > at > org.apache.spark.sql.catalyst.json.JacksonGenerator.write(JacksonGenerator.scala:193) > at > org.apache.spark.sql.catalyst.expressions.StructToJson.eval(jsonExpressions.scala:544) > at > org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142) > at > org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48) > at > org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18295) Match up to_json to from_json in null safety
[ https://issues.apache.org/jira/browse/SPARK-18295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18295: Assignee: Hyukjin Kwon > Match up to_json to from_json in null safety > > > Key: SPARK-18295 > URL: https://issues.apache.org/jira/browse/SPARK-18295 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > {code} > scala> val df = Seq(Some(Tuple1(Tuple1(1))), None).toDF("a") > df: org.apache.spark.sql.DataFrame = [a: struct<_1: int>] > scala> df.show() > ++ > | a| > ++ > | [1]| > |null| > ++ > scala> df.select(to_json($"a")).show() > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeFields(JacksonGenerator.scala:138) > at > org.apache.spark.sql.catalyst.json.JacksonGenerator$$anonfun$write$1.apply$mcV$sp(JacksonGenerator.scala:194) > at > org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeObject(JacksonGenerator.scala:131) > at > org.apache.spark.sql.catalyst.json.JacksonGenerator.write(JacksonGenerator.scala:193) > at > org.apache.spark.sql.catalyst.expressions.StructToJson.eval(jsonExpressions.scala:544) > at > org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142) > at > org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48) > at > org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17764) to_json function for parsing Structs to json Strings
[ https://issues.apache.org/jira/browse/SPARK-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17764: Issue Type: Sub-task (was: Improvement) Parent: SPARK-18351 > to_json function for parsing Structs to json Strings > > > Key: SPARK-17764 > URL: https://issues.apache.org/jira/browse/SPARK-17764 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > After SPARK-17699, now Spark supprots {{from_json}}. It might be nicer if we > have {{to_json}} too, in particular, in the case to write out dataframes by > some data sources not supporting nested structured types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18351) from_json and to_json for parsing JSON for string columns
Reynold Xin created SPARK-18351: --- Summary: from_json and to_json for parsing JSON for string columns Key: SPARK-18351 URL: https://issues.apache.org/jira/browse/SPARK-18351 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe
[ https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18260: Issue Type: Sub-task (was: Bug) Parent: SPARK-18351 > from_json can throw a better exception when it can't find the column or be > nullSafe > --- > > Key: SPARK-18260 > URL: https://issues.apache.org/jira/browse/SPARK-18260 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Blocker > Fix For: 2.1.0 > > > I got this exception: > {code} > SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 > failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID > 74170, 10.0.138.84, executor 2): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > {code} > This was because the column that I called `from_json` on didn't exist for all > of my rows. Either from_json should be null safe, or it should fail with a > better error message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17699) from_json function for parsing json Strings into Structs
[ https://issues.apache.org/jira/browse/SPARK-17699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17699: Issue Type: Sub-task (was: New Feature) Parent: SPARK-18351 > from_json function for parsing json Strings into Structs > > > Key: SPARK-17699 > URL: https://issues.apache.org/jira/browse/SPARK-17699 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Critical > Fix For: 2.1.0 > > > Today, we have good support for reading standalone JSON data. However, > sometimes (especially when reading from streaming sources such as Kafka) the > JSON is embedded in an envelope that has other information we'd like to > preserve. It would be nice if we could also parse JSON string columns, while > preserving the original JSON schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18350) Support session local timezone
[ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646704#comment-15646704 ] Reynold Xin commented on SPARK-18350: - I'm guessing the easiest way to do this is to change all the expressions that can be impacted by timezones to add an explicit timezone argument, and the analyzer automatically places the timezone argument in those expressions. cc [~hvanhovell] [~cloud_fan] [~smilegator] [~vssrinath] for input. > Support session local timezone > -- > > Key: SPARK-18350 > URL: https://issues.apache.org/jira/browse/SPARK-18350 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > We should introduce a session local timezone setting that is used for > execution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18350) Support session local timezone
Reynold Xin created SPARK-18350: --- Summary: Support session local timezone Key: SPARK-18350 URL: https://issues.apache.org/jira/browse/SPARK-18350 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin As of Spark 2.1, Spark SQL assumes the machine timezone for datetime manipulation, which is bad if users are not in the same timezones as the machines, or if different users have different timezones. We should introduce a session local timezone setting that is used for execution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16545) Structured Streaming : foreachSink creates the Physical Plan multiple times per TriggerInterval
[ https://issues.apache.org/jira/browse/SPARK-16545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646665#comment-15646665 ] Mario Briggs commented on SPARK-16545: -- [~lwlin] I agree with the PR discussion. I am not terribly sure what the value of the 'Resolution' state should be when closing... 'Later' for e.g. to indicate this is being fixed elsehere etc > Structured Streaming : foreachSink creates the Physical Plan multiple times > per TriggerInterval > > > Key: SPARK-16545 > URL: https://issues.apache.org/jira/browse/SPARK-16545 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: Mario Briggs > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18349) Update R API documentation on ml model summary
Felix Cheung created SPARK-18349: Summary: Update R API documentation on ml model summary Key: SPARK-18349 URL: https://issues.apache.org/jira/browse/SPARK-18349 Project: Spark Issue Type: Bug Components: ML, SparkR Affects Versions: 2.1.0 Reporter: Felix Cheung It has been discovered that there is a fair bit of consistency in the documentation of summary functions, eg. {code} #' @return \code{summary} returns a summary object of the fitted model, a list of components #' including formula, number of features, list of features, feature importances, number of #' trees, and tree weights setMethod("summary", signature(object = "GBTRegressionModel") {code} For instance, what should be listed for the return value? Should it be a name or a phrase, or should it be a list of items; and should there be a longer description on what they mean, or reference link to Scala doc. We will need to review this for all model summary implementations in mllib.R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18348) Improve tree ensemble model summary
Felix Cheung created SPARK-18348: Summary: Improve tree ensemble model summary Key: SPARK-18348 URL: https://issues.apache.org/jira/browse/SPARK-18348 Project: Spark Issue Type: Bug Components: ML, SparkR Affects Versions: 2.0.0, 2.1.0 Reporter: Felix Cheung During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is discovered and discussed that - we don't have a good summary on nodes or trees for their observations, loss, probability and so on - we don't have a shared API with nicely formatted output We believe this could be a shared API that benefits multiple language bindings, including R, when available. For example, here is what R {code}rpart{code} shows for model summary: {code} Call: rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, method = "class") n= 81 CP nsplit rel errorxerror xstd 1 0.17647059 0 1.000 1.000 0.2155872 2 0.01960784 1 0.8235294 0.9411765 0.2107780 3 0.0100 4 0.7647059 1.0588235 0.2200975 Variable importance StartAge Number 64 24 12 Node number 1: 81 observations,complexity param=0.1764706 predicted class=absent expected loss=0.2098765 P(node) =1 class counts:6417 probabilities: 0.790 0.210 left son=2 (62 obs) right son=3 (19 obs) Primary splits: Start < 8.5 to the right, improve=6.762330, (0 missing) Number < 5.5 to the left, improve=2.866795, (0 missing) Age< 39.5 to the left, improve=2.250212, (0 missing) Surrogate splits: Number < 6.5 to the left, agree=0.802, adj=0.158, (0 split) Node number 2: 62 observations,complexity param=0.01960784 predicted class=absent expected loss=0.09677419 P(node) =0.7654321 class counts:56 6 probabilities: 0.903 0.097 left son=4 (29 obs) right son=5 (33 obs) Primary splits: Start < 14.5 to the right, improve=1.0205280, (0 missing) Age< 55 to the left, improve=0.6848635, (0 missing) Number < 4.5 to the left, improve=0.2975332, (0 missing) Surrogate splits: Number < 3.5 to the left, agree=0.645, adj=0.241, (0 split) Age< 16 to the left, agree=0.597, adj=0.138, (0 split) Node number 3: 19 observations predicted class=present expected loss=0.4210526 P(node) =0.2345679 class counts: 811 probabilities: 0.421 0.579 Node number 4: 29 observations predicted class=absent expected loss=0 P(node) =0.3580247 class counts:29 0 probabilities: 1.000 0.000 Node number 5: 33 observations,complexity param=0.01960784 predicted class=absent expected loss=0.1818182 P(node) =0.4074074 class counts:27 6 probabilities: 0.818 0.182 left son=10 (12 obs) right son=11 (21 obs) Primary splits: Age< 55 to the left, improve=1.2467530, (0 missing) Start < 12.5 to the right, improve=0.2887701, (0 missing) Number < 3.5 to the right, improve=0.1753247, (0 missing) Surrogate splits: Start < 9.5 to the left, agree=0.758, adj=0.333, (0 split) Number < 5.5 to the right, agree=0.697, adj=0.167, (0 split) Node number 10: 12 observations predicted class=absent expected loss=0 P(node) =0.1481481 class counts:12 0 probabilities: 1.000 0.000 Node number 11: 21 observations,complexity param=0.01960784 predicted class=absent expected loss=0.2857143 P(node) =0.2592593 class counts:15 6 probabilities: 0.714 0.286 left son=22 (14 obs) right son=23 (7 obs) Primary splits: Age< 111 to the right, improve=1.71428600, (0 missing) Start < 12.5 to the right, improve=0.79365080, (0 missing) Number < 3.5 to the right, improve=0.07142857, (0 missing) Node number 22: 14 observations predicted class=absent expected loss=0.1428571 P(node) =0.1728395 class counts:12 2 probabilities: 0.857 0.143 Node number 23: 7 observations predicted class=present expected loss=0.4285714 P(node) =0.08641975 class counts: 3 4 probabilities: 0.429 0.571 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18347) Infra for R - need qpdf on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-18347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18347: - Description: As a part of working on building R package (https://issues.apache.org/jira/browse/SPARK-18264) we discover that building the package and vignettes require a tool call qpdf (for compressing PDFs) In R, it is looking for qpdf as such: Sys.which(Sys.getenv("R_QPDF", "qpdf")) ie. which qpdf or whatever the export R_QPDF is pointing to. Otherwise it raises a warning as such: * checking for unstated dependencies in examples ... OK WARNING ‘qpdf’ is needed for checks on size reduction of PDFs cc [~shaneknapp] was: As a part of working on building R package (https://issues.apache.org/jira/browse/SPARK-18264) we discover that building the package and vignettes require a tool call qpdf (for compressing PDFs) In R, it is looking for qpdf as such: Sys.which(Sys.getenv("R_QPDF", "qpdf")) ie. which qpdf or whatever the export R_QPDF is pointing to. cc @shaneknapp > Infra for R - need qpdf on Jenkins > -- > > Key: SPARK-18347 > URL: https://issues.apache.org/jira/browse/SPARK-18347 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > > As a part of working on building R package > (https://issues.apache.org/jira/browse/SPARK-18264) we discover that building > the package and vignettes require a tool call qpdf (for compressing PDFs) > In R, it is looking for qpdf as such: > Sys.which(Sys.getenv("R_QPDF", "qpdf")) > ie. which qpdf or whatever the export R_QPDF is pointing to. > Otherwise it raises a warning as such: > * checking for unstated dependencies in examples ... OK > WARNING > ‘qpdf’ is needed for checks on size reduction of PDFs > cc > [~shaneknapp] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18347) Infra for R - need qpdf on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-18347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18347: - Description: As a part of working on building R package (https://issues.apache.org/jira/browse/SPARK-18264) we discover that building the package and vignettes require a tool call qpdf (for compressing PDFs) In R, it is looking for qpdf as such: {code}Sys.which(Sys.getenv("R_QPDF", "qpdf")){code} ie. which qpdf or whatever the export R_QPDF is pointing to. Otherwise it raises a warning as such: {code} * checking for unstated dependencies in examples ... OK WARNING ‘qpdf’ is needed for checks on size reduction of PDFs {code} cc [~shaneknapp] was: As a part of working on building R package (https://issues.apache.org/jira/browse/SPARK-18264) we discover that building the package and vignettes require a tool call qpdf (for compressing PDFs) In R, it is looking for qpdf as such: Sys.which(Sys.getenv("R_QPDF", "qpdf")) ie. which qpdf or whatever the export R_QPDF is pointing to. Otherwise it raises a warning as such: * checking for unstated dependencies in examples ... OK WARNING ‘qpdf’ is needed for checks on size reduction of PDFs cc [~shaneknapp] > Infra for R - need qpdf on Jenkins > -- > > Key: SPARK-18347 > URL: https://issues.apache.org/jira/browse/SPARK-18347 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > > As a part of working on building R package > (https://issues.apache.org/jira/browse/SPARK-18264) we discover that building > the package and vignettes require a tool call qpdf (for compressing PDFs) > In R, it is looking for qpdf as such: > {code}Sys.which(Sys.getenv("R_QPDF", "qpdf")){code} > ie. which qpdf or whatever the export R_QPDF is pointing to. > Otherwise it raises a warning as such: > {code} > * checking for unstated dependencies in examples ... OK > WARNING > ‘qpdf’ is needed for checks on size reduction of PDFs > {code} > cc > [~shaneknapp] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18347) Infra for R - need qpdf on Jenkins
Felix Cheung created SPARK-18347: Summary: Infra for R - need qpdf on Jenkins Key: SPARK-18347 URL: https://issues.apache.org/jira/browse/SPARK-18347 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0 Reporter: Felix Cheung As a part of working on building R package (https://issues.apache.org/jira/browse/SPARK-18264) we discover that building the package and vignettes require a tool call qpdf (for compressing PDFs) In R, it is looking for qpdf as such: Sys.which(Sys.getenv("R_QPDF", "qpdf")) ie. which qpdf or whatever the export R_QPDF is pointing to. cc @shaneknapp -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18264) Build and package R vignettes
[ https://issues.apache.org/jira/browse/SPARK-18264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18264: - Summary: Build and package R vignettes (was: Package R vignettes) > Build and package R vignettes > - > > Key: SPARK-18264 > URL: https://issues.apache.org/jira/browse/SPARK-18264 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > > https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Writing-package-vignettes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18332) SparkR 2.1 QA: Programming guide update and migration guide
[ https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646598#comment-15646598 ] Felix Cheung commented on SPARK-18332: -- The R vignettes is a R-specific thing that is also a separate document from the Spark programming guide. > SparkR 2.1 QA: Programming guide update and migration guide > --- > > Key: SPARK-18332 > URL: https://issues.apache.org/jira/browse/SPARK-18332 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide. Updates > will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > Note: New features are handled in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18332) SparkR 2.1 QA: Programming guide update and migration guide
[ https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646598#comment-15646598 ] Felix Cheung edited comment on SPARK-18332 at 11/8/16 5:46 AM: --- The R vignettes is a R-specific thing that is also a separate document from the Spark programming guide. Perhaps that should be included in this task for future release QA clones. was (Author: felixcheung): The R vignettes is a R-specific thing that is also a separate document from the Spark programming guide. > SparkR 2.1 QA: Programming guide update and migration guide > --- > > Key: SPARK-18332 > URL: https://issues.apache.org/jira/browse/SPARK-18332 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide. Updates > will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > Note: New features are handled in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18345) Structured Streaming quick examples fails with default configuration
[ https://issues.apache.org/jira/browse/SPARK-18345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18345: Assignee: Apache Spark > Structured Streaming quick examples fails with default configuration > > > Key: SPARK-18345 > URL: https://issues.apache.org/jira/browse/SPARK-18345 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Tsuyoshi Ozawa >Assignee: Apache Spark > > StructuredNetworkWordCount results in failure because it needs HDFS > configuration. It should use local filesystem instead of using HDFS by > default. > {quote} > Exception in thread "main" java.net.ConnectException: Call From > ozamac-2.local/192.168.33.1 to localhost:9000 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) > at org.apache.hadoop.ipc.Client.call(Client.java:1351) > at org.apache.hadoop.ipc.Client.call(Client.java:1300) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:225) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:260) > at > org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount$.main(StructuredNetworkWordCount.scala:71) > at > org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount.main(StructuredNetworkWordCount.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {quote} > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18345) Structured Streaming quick examples fails with default configuration
[ https://issues.apache.org/jira/browse/SPARK-18345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646594#comment-15646594 ] Apache Spark commented on SPARK-18345: -- User 'oza' has created a pull request for this issue: https://github.com/apache/spark/pull/15806 > Structured Streaming quick examples fails with default configuration > > > Key: SPARK-18345 > URL: https://issues.apache.org/jira/browse/SPARK-18345 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Tsuyoshi Ozawa > > StructuredNetworkWordCount results in failure because it needs HDFS > configuration. It should use local filesystem instead of using HDFS by > default. > {quote} > Exception in thread "main" java.net.ConnectException: Call From > ozamac-2.local/192.168.33.1 to localhost:9000 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) > at org.apache.hadoop.ipc.Client.call(Client.java:1351) > at org.apache.hadoop.ipc.Client.call(Client.java:1300) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:225) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:260) > at > org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount$.main(StructuredNetworkWordCount.scala:71) > at > org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount.main(StructuredNetworkWordCount.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {quote} > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18345) Structured Streaming quick examples fails with default configuration
[ https://issues.apache.org/jira/browse/SPARK-18345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18345: Assignee: (was: Apache Spark) > Structured Streaming quick examples fails with default configuration > > > Key: SPARK-18345 > URL: https://issues.apache.org/jira/browse/SPARK-18345 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Tsuyoshi Ozawa > > StructuredNetworkWordCount results in failure because it needs HDFS > configuration. It should use local filesystem instead of using HDFS by > default. > {quote} > Exception in thread "main" java.net.ConnectException: Call From > ozamac-2.local/192.168.33.1 to localhost:9000 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) > at org.apache.hadoop.ipc.Client.call(Client.java:1351) > at org.apache.hadoop.ipc.Client.call(Client.java:1300) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:225) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:260) > at > org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount$.main(StructuredNetworkWordCount.scala:71) > at > org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount.main(StructuredNetworkWordCount.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {quote} > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18332) SparkR 2.1 QA: Programming guide update and migration guide
[ https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18332: - Comment: was deleted (was: link https://issues.apache.org/jira/browse/SPARK-18279 https://issues.apache.org/jira/browse/SPARK-18266 ) > SparkR 2.1 QA: Programming guide update and migration guide > --- > > Key: SPARK-18332 > URL: https://issues.apache.org/jira/browse/SPARK-18332 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide. Updates > will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > Note: New features are handled in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18332) SparkR 2.1 QA: Programming guide update and migration guide
[ https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646592#comment-15646592 ] Felix Cheung commented on SPARK-18332: -- link https://issues.apache.org/jira/browse/SPARK-18279 https://issues.apache.org/jira/browse/SPARK-18266 > SparkR 2.1 QA: Programming guide update and migration guide > --- > > Key: SPARK-18332 > URL: https://issues.apache.org/jira/browse/SPARK-18332 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Before the release, we need to update the SparkR Programming Guide. Updates > will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > Note: New features are handled in [SPARK-18330]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16892) flatten function to get flat array (or map) column from array of array (or array of map) column
[ https://issues.apache.org/jira/browse/SPARK-16892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646584#comment-15646584 ] Kapil Singh commented on SPARK-16892: - It's not for flattening Rows. It's for flattening columns. The columns themselves can be of array of array or array of map types. How would you flatten them to obtain columns of array and map types respectively? Also this is for DataFrame expressions/functions. > flatten function to get flat array (or map) column from array of array (or > array of map) column > --- > > Key: SPARK-16892 > URL: https://issues.apache.org/jira/browse/SPARK-16892 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Kapil Singh > > flatten(input) > Converts input of array of array type into flat array type by inserting > elements of all element arrays into single array. Example: > input: [[1, 2, 3], [4, 5], [-1, -2, 0]] > output: [1, 2, 3, 4, 5, -1, -2, 0] > Converts input of array of map type into flat map type by inserting key-value > pairs of all element maps into single map. Example: > input: [(1 -> "one", 2 -> "two"), (0 -> "zero"), (4 -> "four")] > output: (1 -> "one", 2 -> "two", 0 -> "zero", 4 -> "four") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18346) TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec
[ https://issues.apache.org/jira/browse/SPARK-18346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646578#comment-15646578 ] Apache Spark commented on SPARK-18346: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/15805 > TRUNCATE TABLE should fail if no partition is matched for the given > non-partial partition spec > -- > > Key: SPARK-18346 > URL: https://issues.apache.org/jira/browse/SPARK-18346 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18346) TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec
[ https://issues.apache.org/jira/browse/SPARK-18346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18346: Assignee: Wenchen Fan (was: Apache Spark) > TRUNCATE TABLE should fail if no partition is matched for the given > non-partial partition spec > -- > > Key: SPARK-18346 > URL: https://issues.apache.org/jira/browse/SPARK-18346 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18346) TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec
[ https://issues.apache.org/jira/browse/SPARK-18346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18346: Assignee: Apache Spark (was: Wenchen Fan) > TRUNCATE TABLE should fail if no partition is matched for the given > non-partial partition spec > -- > > Key: SPARK-18346 > URL: https://issues.apache.org/jira/browse/SPARK-18346 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16894) take function for returning the first n elements of array column
[ https://issues.apache.org/jira/browse/SPARK-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646576#comment-15646576 ] Kapil Singh commented on SPARK-16894: - This is not about selecting first n elements/columns from a Row. It's for selecting first n elements of an array type column. So for every record/Row the input column has some m elements but the result column has only first n elements of the input column. This operation is similar to scala collection's take operation. The scope of the operation is cell values and not Row. > take function for returning the first n elements of array column > > > Key: SPARK-16894 > URL: https://issues.apache.org/jira/browse/SPARK-16894 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Kapil Singh > > take(inputArray, n) > Returns array containing first n elements of inputArray -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18346) TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec
Wenchen Fan created SPARK-18346: --- Summary: TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec Key: SPARK-18346 URL: https://issues.apache.org/jira/browse/SPARK-18346 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16894) take function for returning the first n elements of array column
[ https://issues.apache.org/jira/browse/SPARK-16894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646563#comment-15646563 ] Kapil Singh commented on SPARK-16894: - The use case is similar to scala collection's take method. So, for example, one of the input columns is an array containing product category hierarchy e.g. [apparel, men, t-shirt, printed, ...] and I'm only interested in first n (say 3) categories. I want a function/expression on DataFrame so that I can get an output column containing only first 3 categories e.g. [apparel, men, t-shirt] > take function for returning the first n elements of array column > > > Key: SPARK-16894 > URL: https://issues.apache.org/jira/browse/SPARK-16894 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Kapil Singh > > take(inputArray, n) > Returns array containing first n elements of inputArray -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18345) Structured Streaming quick examples fails with default configuration
[ https://issues.apache.org/jira/browse/SPARK-18345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646551#comment-15646551 ] Tsuyoshi Ozawa commented on SPARK-18345: I would like to tackle this problem. I fixed it locally, so will send PR soon. > Structured Streaming quick examples fails with default configuration > > > Key: SPARK-18345 > URL: https://issues.apache.org/jira/browse/SPARK-18345 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Tsuyoshi Ozawa > > StructuredNetworkWordCount results in failure because it needs HDFS > configuration. It should use local filesystem instead of using HDFS by > default. > {quote} > Exception in thread "main" java.net.ConnectException: Call From > ozamac-2.local/192.168.33.1 to localhost:9000 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) > at org.apache.hadoop.ipc.Client.call(Client.java:1351) > at org.apache.hadoop.ipc.Client.call(Client.java:1300) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106) > at > org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:225) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:260) > at > org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount$.main(StructuredNetworkWordCount.scala:71) > at > org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount.main(StructuredNetworkWordCount.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {quote} > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18345) Structured Streaming quick examples fails with default configuration
Tsuyoshi Ozawa created SPARK-18345: -- Summary: Structured Streaming quick examples fails with default configuration Key: SPARK-18345 URL: https://issues.apache.org/jira/browse/SPARK-18345 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.0.1 Reporter: Tsuyoshi Ozawa StructuredNetworkWordCount results in failure because it needs HDFS configuration. It should use local filesystem instead of using HDFS by default. {quote} Exception in thread "main" java.net.ConnectException: Call From ozamac-2.local/192.168.33.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1351) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397) at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:225) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:260) at org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount$.main(StructuredNetworkWordCount.scala:71) at org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount.main(StructuredNetworkWordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {quote} . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18344) TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec
Wenchen Fan created SPARK-18344: --- Summary: TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec Key: SPARK-18344 URL: https://issues.apache.org/jira/browse/SPARK-18344 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646477#comment-15646477 ] Song Jun edited comment on SPARK-18055 at 11/8/16 5:04 AM: --- [~davies] I can't reproduce it on the master branch or on databricks(Spark 2.0.1-db1 (Scala 2.11)),my code: import scala.collection.mutable.ArrayBuffer case class MyData(id: String,arr: Seq\[String]) val myarr = ArrayBuffer\[MyData]() for(i <- 20 to 30){ val arr = ArrayBuffer\[String]() for(j <- 1 to 10) { arr += (i+j).toString } val mydata = new MyData(i.toString,arr) myarr += mydata } val rdd = spark.sparkContext.makeRDD(myarr) val ds = rdd.toDS ds.rdd.flatMap(_.arr) ds.flatMap(_.arr) there is no exception, it has fixed? or My code is wrong? I also test with the test-jar_2.11-1.0.jar in spark-shell: >spark-shell --jars test-jar_2.11-1.0.jar and there is no exception. was (Author: windpiger): [~davies] I can't reproduce it on the master branch or on databricks(Spark 2.0.1-db1 (Scala 2.11)),my code: import scala.collection.mutable.ArrayBuffer case class MyData(id: String,arr: Seq\[String]) val myarr = ArrayBuffer\[MyData]() for(i <- 20 to 30){ val arr = ArrayBuffer\[String]() for(j <- 1 to 10) { arr += (i+j).toString } val mydata = new MyData(i.toString,arr) myarr += mydata } val rdd = spark.sparkContext.makeRDD(myarr) val ds = rdd.toDS ds.rdd.flatMap(_.arr) ds.flatMap(_.arr) there is no exception, it has fixed? or My code is wrong? or I must test it with customed jar? > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18343) FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write
Luke Miner created SPARK-18343: -- Summary: FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write Key: SPARK-18343 URL: https://issues.apache.org/jira/browse/SPARK-18343 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.1 Environment: Spark 2.0.1 Hadoop 2.7.1 Mesos 1.0.1 Ubuntu 14.04 Reporter: Luke Miner I have a driver program where I write read data in from Cassandra using spark, perform some operations, and then write out to JSON on S3. The program runs fine when I use Spark 1.6.1 and the spark-cassandra-connector 1.6.0-M1. However, if I try to upgrade to Spark 2.0.1 (hadoop 2.7.1) and spark-cassandra-connector 2.0.0-M3, the program completes in the sense that all the expected files are written to S3, but the program never terminates. I do run `sc.stop()` at the end of the program. I am also using Mesos 1.0.1. In both cases I use the default output committer. >From the thread dump (included below) it seems like it could be waiting on: >`org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner` Code snippet: {code} // get MongoDB oplog operations val operations = sc.cassandraTable[JsonOperation](keyspace, namespace) .where("ts >= ? AND ts < ?", minTimestamp, maxTimestamp) // replay oplog operations into documents val documents = operations .spanBy(op => op.id) .map { case (id: String, ops: Iterable[T]) => (id, apply(ops)) } .filter { case (id, result) => result.isInstanceOf[Document] } .map { case (id, document) => MergedDocument(id = id, document = document .asInstanceOf[Document]) } // write documents to json on s3 documents .map(document => document.toJson) .coalesce(partitions) .saveAsTextFile(path, classOf[GzipCodec]) sc.stop() {code} Thread dump on the driver: {code} 60 context-cleaner-periodic-gc TIMED_WAITING 46 dag-scheduler-event-loopWAITING 4389DestroyJavaVM RUNNABLE 12 dispatcher-event-loop-0 WAITING 13 dispatcher-event-loop-1 WAITING 14 dispatcher-event-loop-2 WAITING 15 dispatcher-event-loop-3 WAITING 47 driver-revive-threadTIMED_WAITING 3 Finalizer WAITING 82 ForkJoinPool-1-worker-17WAITING 43 heartbeat-receiver-event-loop-threadTIMED_WAITING 93 java-sdk-http-connection-reaper TIMED_WAITING 4387java-sdk-progress-listener-callback-thread WAITING 25 map-output-dispatcher-0 WAITING 26 map-output-dispatcher-1 WAITING 27 map-output-dispatcher-2 WAITING 28 map-output-dispatcher-3 WAITING 29 map-output-dispatcher-4 WAITING 30 map-output-dispatcher-5 WAITING 31 map-output-dispatcher-6 WAITING 32 map-output-dispatcher-7 WAITING 48 MesosCoarseGrainedSchedulerBackend-mesos-driver RUNNABLE 44 netty-rpc-env-timeout TIMED_WAITING 92 org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner WAITING 62 pool-19-thread-1TIMED_WAITING 2 Reference Handler WAITING 61 Scheduler-1112394071TIMED_WAITING 20 shuffle-server-0RUNNABLE 55 shuffle-server-0RUNNABLE 21 shuffle-server-1RUNNABLE 56 shuffle-server-1RUNNABLE 22 shuffle-server-2RUNNABLE 57 shuffle-server-2RUNNABLE 23 shuffle-server-3RUNNABLE 58 shuffle-server-3RUNNABLE 4 Signal Dispatcher RUNNABLE 59 Spark Context Cleaner TIMED_WAITING 9 SparkListenerBusWAITING 35 SparkUI-35-selector-ServerConnectorManager@651d3734/0 RUNNABLE 36 SparkUI-36-acceptor-0@467924cb-ServerConnector@3b5eaf92{HTTP/1.1}{0.0.0.0:4040} RUNNABLE 37 SparkUI-37-selector-ServerConnectorManager@651d3734/1 RUNNABLE 38 SparkUI-38 TIMED_WAITING 39 SparkUI-39 TIMED_WAITING 40 SparkUI-40 TIMED_WAITING 41 SparkUI-41 RUNNABLE 42 SparkUI-42 TIMED_WAITING 438 task-result-getter-0WAITING 450 task-result-getter-1WAITING 489 task-result-getter-2WAITING 492 task-result-getter-3WAITING 75 threadDeathWatcher-2-1 TIMED_WAITING 45 Timer-0 WAITING {code} Thread dump on the executors. It's the same on all of them: {code} 24 dispatcher-event-loop-0 WAITING 25 dispatcher-event-loop-1 WAITING 26 dispatcher-event-loop-2 RUNNABLE 27 dispatcher-event-loop-3 WAITING 39 driver-heartbeater TIMED_WAITING 3 Finalizer WAITING 58 java-sdk-http-connection-reaper TIMED_WAITING 75 java-sdk-progress-listener-callback-thread WAITING 1 mainTIMED_WAITING 33 netty-rpc-env-timeout TIMED_WAITING 55 org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner WAITING 59 pool-17-thread-1TIMED_WAITING 2 Reference Handler WAITING 28
[jira] [Comment Edited] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646477#comment-15646477 ] Song Jun edited comment on SPARK-18055 at 11/8/16 4:51 AM: --- [~davies] I can't reproduce it on the master branch or on databricks(Spark 2.0.1-db1 (Scala 2.11)),my code: import scala.collection.mutable.ArrayBuffer case class MyData(id: String,arr: Seq\[String]) val myarr = ArrayBuffer\[MyData]() for(i <- 20 to 30){ val arr = ArrayBuffer\[String]() for(j <- 1 to 10) { arr += (i+j).toString } val mydata = new MyData(i.toString,arr) myarr += mydata } val rdd = spark.sparkContext.makeRDD(myarr) val ds = rdd.toDS ds.rdd.flatMap(_.arr) ds.flatMap(_.arr) there is no exception, it has fixed? or My code is wrong? or I must test it with customed jar? was (Author: windpiger): [~davies] I can't reproduce it on the master branch or on databricks(Spark 2.0.1-db1 (Scala 2.11)),my code: import scala.collection.mutable.ArrayBuffer case class MyData(id: String,arr: Seq\[String]) val myarr = ArrayBuffer\[MyData]() for(i <- 20 to 30){ val arr = ArrayBuffer\[String]() for(j <- 1 to 10) { arr += (i+j).toString } val mydata = new MyData(i.toString,arr) myarr += mydata } val rdd = spark.sparkContext.makeRDD(myarr) val ds = rdd.toDS ds.rdd.flatMap(_.arr) ds.flatMap(_.arr) there is no exception, it has fixed? or My code is wrong? > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18342) HDFSBackedStateStore can fail to rename files causing snapshotting and recovery to fail
[ https://issues.apache.org/jira/browse/SPARK-18342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646484#comment-15646484 ] Apache Spark commented on SPARK-18342: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/15804 > HDFSBackedStateStore can fail to rename files causing snapshotting and > recovery to fail > --- > > Key: SPARK-18342 > URL: https://issues.apache.org/jira/browse/SPARK-18342 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Burak Yavuz >Priority: Critical > > The HDFSBackedStateStore renames temporary files to delta files as it commits > new versions. It however doesn't check whether the rename succeeded. If the > rename fails, then recovery will not be possible. It should fail during the > rename stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18342) HDFSBackedStateStore can fail to rename files causing snapshotting and recovery to fail
[ https://issues.apache.org/jira/browse/SPARK-18342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18342: Assignee: (was: Apache Spark) > HDFSBackedStateStore can fail to rename files causing snapshotting and > recovery to fail > --- > > Key: SPARK-18342 > URL: https://issues.apache.org/jira/browse/SPARK-18342 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Burak Yavuz >Priority: Critical > > The HDFSBackedStateStore renames temporary files to delta files as it commits > new versions. It however doesn't check whether the rename succeeded. If the > rename fails, then recovery will not be possible. It should fail during the > rename stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18342) HDFSBackedStateStore can fail to rename files causing snapshotting and recovery to fail
[ https://issues.apache.org/jira/browse/SPARK-18342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18342: Assignee: Apache Spark > HDFSBackedStateStore can fail to rename files causing snapshotting and > recovery to fail > --- > > Key: SPARK-18342 > URL: https://issues.apache.org/jira/browse/SPARK-18342 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Burak Yavuz >Assignee: Apache Spark >Priority: Critical > > The HDFSBackedStateStore renames temporary files to delta files as it commits > new versions. It however doesn't check whether the rename succeeded. If the > rename fails, then recovery will not be possible. It should fail during the > rename stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646477#comment-15646477 ] Song Jun commented on SPARK-18055: -- [~davies] I can't reproduce it on the master branch or on databricks(Spark 2.0.1-db1 (Scala 2.11)),my code: import scala.collection.mutable.ArrayBuffer case class MyData(id: String,arr: Seq\[String]) val myarr = ArrayBuffer\[MyData]() for(i <- 20 to 30){ val arr = ArrayBuffer\[String]() for(j <- 1 to 10) { arr += (i+j).toString } val mydata = new MyData(i.toString,arr) myarr += mydata } val rdd = spark.sparkContext.makeRDD(myarr) val ds = rdd.toDS ds.rdd.flatMap(_.arr) ds.flatMap(_.arr) there is no exception, it has fixed? or My code is wrong? > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646453#comment-15646453 ] Harish commented on SPARK-17463: I was able to figure out the issue, its not related to this bug. > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at
[jira] [Comment Edited] (SPARK-16892) flatten function to get flat array (or map) column from array of array (or array of map) column
[ https://issues.apache.org/jira/browse/SPARK-16892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646443#comment-15646443 ] Jayadevan M edited comment on SPARK-16892 at 11/8/16 4:21 AM: -- I hope you can use flatMap for this scenario. I don't think any new function is required for this. For example var array=Array(Array(1, 2, 3), Array(4, 5), Array(-1, -2, 0)); var rdd = sc.parallelize(array); rdd.flatMap(x=>x).collect(); was (Author: jayadevan.m): I hope you can use flatMap for this scenario. For example var array=Array(Array(1, 2, 3), Array(4, 5), Array(-1, -2, 0)); var rdd = sc.parallelize(array); rdd.flatMap(x=>x).collect(); > flatten function to get flat array (or map) column from array of array (or > array of map) column > --- > > Key: SPARK-16892 > URL: https://issues.apache.org/jira/browse/SPARK-16892 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Kapil Singh > > flatten(input) > Converts input of array of array type into flat array type by inserting > elements of all element arrays into single array. Example: > input: [[1, 2, 3], [4, 5], [-1, -2, 0]] > output: [1, 2, 3, 4, 5, -1, -2, 0] > Converts input of array of map type into flat map type by inserting key-value > pairs of all element maps into single map. Example: > input: [(1 -> "one", 2 -> "two"), (0 -> "zero"), (4 -> "four")] > output: (1 -> "one", 2 -> "two", 0 -> "zero", 4 -> "four") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16892) flatten function to get flat array (or map) column from array of array (or array of map) column
[ https://issues.apache.org/jira/browse/SPARK-16892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646443#comment-15646443 ] Jayadevan M commented on SPARK-16892: - I hope you can use flatMap for this scenario. For example var array=Array(Array(1, 2, 3), Array(4, 5), Array(-1, -2, 0)); var rdd = sc.parallelize(array); rdd.flatMap(x=>x).collect(); > flatten function to get flat array (or map) column from array of array (or > array of map) column > --- > > Key: SPARK-16892 > URL: https://issues.apache.org/jira/browse/SPARK-16892 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Kapil Singh > > flatten(input) > Converts input of array of array type into flat array type by inserting > elements of all element arrays into single array. Example: > input: [[1, 2, 3], [4, 5], [-1, -2, 0]] > output: [1, 2, 3, 4, 5, -1, -2, 0] > Converts input of array of map type into flat map type by inserting key-value > pairs of all element maps into single map. Example: > input: [(1 -> "one", 2 -> "two"), (0 -> "zero"), (4 -> "four")] > output: (1 -> "one", 2 -> "two", 0 -> "zero", 4 -> "four") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18342) HDFSBackedStateStore can fail to rename files causing snapshotting and recovery to fail
Burak Yavuz created SPARK-18342: --- Summary: HDFSBackedStateStore can fail to rename files causing snapshotting and recovery to fail Key: SPARK-18342 URL: https://issues.apache.org/jira/browse/SPARK-18342 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.0.1 Reporter: Burak Yavuz Priority: Critical The HDFSBackedStateStore renames temporary files to delta files as it commits new versions. It however doesn't check whether the rename succeeded. If the rename fails, then recovery will not be possible. It should fail during the rename stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18341) Eliminate use of SingularMatrixException in WeightedLeastSquares logic
Joseph K. Bradley created SPARK-18341: - Summary: Eliminate use of SingularMatrixException in WeightedLeastSquares logic Key: SPARK-18341 URL: https://issues.apache.org/jira/browse/SPARK-18341 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor WeightedLeastSquares uses an Exception to implement fallback logic for which solver to use: [https://github.com/apache/spark/blob/6f3697136aa68dc39d3ce42f43a7af554d2a3bf9/mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala#L258] We should use an error code instead of an exception. * Note the error code should be internal, not a public API. * We may be able to eliminate the SingularMatrixException class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18298) HistoryServer use GMT time all time
[ https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646333#comment-15646333 ] Song Jun commented on SPARK-18298: -- [~WangTao] I test it,and the ui should show the user's local time which the 1.6.x did, so I think this is a bug, and I post a pull request. > HistoryServer use GMT time all time > --- > > Key: SPARK-18298 > URL: https://issues.apache.org/jira/browse/SPARK-18298 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1 > Environment: suse 11.3 with CST time >Reporter: Tao Wang > > When I started HistoryServer for reading event logs, the timestamp readed > will be parsed using local timezone like "CST"(confirmed via debug). > But the time related columns like "Started"/"Completed"/"Last Updated" in > History Server UI using "GMT" time, which is 8 hours earlier than "CST". > {quote} > App IDApp NameStarted Completed DurationSpark > User Last UpdatedEvent Log > local-1478225166651 Spark shell 2016-11-04 02:06:06 2016-11-07 > 01:33:30 71.5 h root2016-11-07 01:33:30 > {quote} > I've checked the REST api and found the result like: > {color:red} > [ { > "id" : "local-1478225166651", > "name" : "Spark shell", > "attempts" : [ { > "startTime" : "2016-11-04T02:06:06.020GMT", > "endTime" : "2016-11-07T01:33:30.265GMT", > "lastUpdated" : "2016-11-07T01:33:30.000GMT", > "duration" : 257244245, > "sparkUser" : "root", > "completed" : true, > "lastUpdatedEpoch" : 147848241, > "endTimeEpoch" : 1478482410265, > "startTimeEpoch" : 1478225166020 > } ] > }, { > "id" : "local-1478224925869", > "name" : "Spark Pi", > "attempts" : [ { > "startTime" : "2016-11-04T02:02:02.133GMT", > "endTime" : "2016-11-04T02:02:07.468GMT", > "lastUpdated" : "2016-11-04T02:02:07.000GMT", > "duration" : 5335, > "sparkUser" : "root", > "completed" : true, > ... > {color} > So maybe the change happened in transferring between server and browser? I > have no idea where to go from this point. > Hope guys can offer some help, or just fix it if it's easy? :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18298) HistoryServer use GMT time all time
[ https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646270#comment-15646270 ] Tao Wang commented on SPARK-18298: -- [~ajbozarth] Thanks for your attention. I think it's the issue 1, in which the time shown with timezone GMT all time whatever the server timezone is. For example, if the timestamp stored is `1478573043680`, we expected the time shown in HistoryServer looks like: "2016-11-08 10:44:03"(which is using CST time same as the server timezone), but not "2016-11-08 02:44:03"(in GMT time, ignoring timezone the server uses). and in my opnion, the time should be shown as the server timezone as people who run the server will use local timezone than the GMT one. > HistoryServer use GMT time all time > --- > > Key: SPARK-18298 > URL: https://issues.apache.org/jira/browse/SPARK-18298 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1 > Environment: suse 11.3 with CST time >Reporter: Tao Wang > > When I started HistoryServer for reading event logs, the timestamp readed > will be parsed using local timezone like "CST"(confirmed via debug). > But the time related columns like "Started"/"Completed"/"Last Updated" in > History Server UI using "GMT" time, which is 8 hours earlier than "CST". > {quote} > App IDApp NameStarted Completed DurationSpark > User Last UpdatedEvent Log > local-1478225166651 Spark shell 2016-11-04 02:06:06 2016-11-07 > 01:33:30 71.5 h root2016-11-07 01:33:30 > {quote} > I've checked the REST api and found the result like: > {color:red} > [ { > "id" : "local-1478225166651", > "name" : "Spark shell", > "attempts" : [ { > "startTime" : "2016-11-04T02:06:06.020GMT", > "endTime" : "2016-11-07T01:33:30.265GMT", > "lastUpdated" : "2016-11-07T01:33:30.000GMT", > "duration" : 257244245, > "sparkUser" : "root", > "completed" : true, > "lastUpdatedEpoch" : 147848241, > "endTimeEpoch" : 1478482410265, > "startTimeEpoch" : 1478225166020 > } ] > }, { > "id" : "local-1478224925869", > "name" : "Spark Pi", > "attempts" : [ { > "startTime" : "2016-11-04T02:02:02.133GMT", > "endTime" : "2016-11-04T02:02:07.468GMT", > "lastUpdated" : "2016-11-04T02:02:07.000GMT", > "duration" : 5335, > "sparkUser" : "root", > "completed" : true, > ... > {color} > So maybe the change happened in transferring between server and browser? I > have no idea where to go from this point. > Hope guys can offer some help, or just fix it if it's easy? :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18298) HistoryServer use GMT time all time
[ https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646269#comment-15646269 ] Tao Wang commented on SPARK-18298: -- [~ajbozarth] Thanks for your attention. I think it's the issue 1, in which the time shown with timezone GMT all time whatever the server timezone is. For example, if the timestamp stored is `1478573043680`, we expected the time shown in HistoryServer looks like: "2016-11-08 10:44:03"(which is using CST time same as the server timezone), but not "2016-11-08 02:44:03"(in GMT time, ignoring timezone the server uses). and in my opnion, the time should be shown as the server timezone as people who run the server will use local timezone than the GMT one. > HistoryServer use GMT time all time > --- > > Key: SPARK-18298 > URL: https://issues.apache.org/jira/browse/SPARK-18298 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1 > Environment: suse 11.3 with CST time >Reporter: Tao Wang > > When I started HistoryServer for reading event logs, the timestamp readed > will be parsed using local timezone like "CST"(confirmed via debug). > But the time related columns like "Started"/"Completed"/"Last Updated" in > History Server UI using "GMT" time, which is 8 hours earlier than "CST". > {quote} > App IDApp NameStarted Completed DurationSpark > User Last UpdatedEvent Log > local-1478225166651 Spark shell 2016-11-04 02:06:06 2016-11-07 > 01:33:30 71.5 h root2016-11-07 01:33:30 > {quote} > I've checked the REST api and found the result like: > {color:red} > [ { > "id" : "local-1478225166651", > "name" : "Spark shell", > "attempts" : [ { > "startTime" : "2016-11-04T02:06:06.020GMT", > "endTime" : "2016-11-07T01:33:30.265GMT", > "lastUpdated" : "2016-11-07T01:33:30.000GMT", > "duration" : 257244245, > "sparkUser" : "root", > "completed" : true, > "lastUpdatedEpoch" : 147848241, > "endTimeEpoch" : 1478482410265, > "startTimeEpoch" : 1478225166020 > } ] > }, { > "id" : "local-1478224925869", > "name" : "Spark Pi", > "attempts" : [ { > "startTime" : "2016-11-04T02:02:02.133GMT", > "endTime" : "2016-11-04T02:02:07.468GMT", > "lastUpdated" : "2016-11-04T02:02:07.000GMT", > "duration" : 5335, > "sparkUser" : "root", > "completed" : true, > ... > {color} > So maybe the change happened in transferring between server and browser? I > have no idea where to go from this point. > Hope guys can offer some help, or just fix it if it's easy? :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16575) partition calculation mismatch with sc.binaryFiles
[ https://issues.apache.org/jira/browse/SPARK-16575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16575. - Resolution: Fixed Assignee: Tarun Kumar Fix Version/s: 2.1.0 > partition calculation mismatch with sc.binaryFiles > -- > > Key: SPARK-16575 > URL: https://issues.apache.org/jira/browse/SPARK-16575 > Project: Spark > Issue Type: Bug > Components: Input/Output, Java API, Shuffle, Spark Core, Spark Shell >Affects Versions: 1.6.1, 1.6.2 >Reporter: Suhas >Assignee: Tarun Kumar >Priority: Critical > Fix For: 2.1.0 > > > sc.binaryFiles is always creating an RDD with number of partitions as 2. > Steps to reproduce: (Tested this bug on databricks community edition) > 1. Try to create an RDD using sc.binaryFiles. In this example, airlines > folder has 1922 files. > Ex: {noformat}val binaryRDD = > sc.binaryFiles("/databricks-datasets/airlines/*"){noformat} > 2. check the number of partitions of the above RDD > - binaryRDD.partitions.size = 2. (expected value is more than 2) > 3. If the RDD is created using sc.textFile, then the number of partitions are > 1921. > 4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 > version. > For explanation with screenshot, please look at the link below, > http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18263) Configuring spark.kryo.registrator programmatically doesn't take effect
[ https://issues.apache.org/jira/browse/SPARK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646224#comment-15646224 ] inred commented on SPARK-18263: --- config at builder take effect val spark = SparkSession .builder .config("spark.kryo.registrator", "org.bdgenomics.adam.serialization.ADAMKryoRegistrator") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > Configuring spark.kryo.registrator programmatically doesn't take effect > --- > > Key: SPARK-18263 > URL: https://issues.apache.org/jira/browse/SPARK-18263 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: spark-2.0.1-bin-hadoop2.6 > scala-2.11.8 >Reporter: inred > > it run ok with spark-shell --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf > spark.kryo.registrator=org.bdgenomics.adam.serialization.ADAMKryoRegistrator \ > but in IDE > val spark = SparkSession.builder.master("local[*]").appName("Anno > BDG").getOrCreate() > spark.conf.set("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > spark.conf.set("spark.kryo.registrator", > "org.bdgenomics.adam.serialization.ADAMKryoRegistrator") > it reports the following error: > java.io.NotSerializableException: org.bdgenomics.formats.avro.AlignmentRecord > Serialization stack: > object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, > value: {"readInFragment": 0, "contigName": "chr10", "start": 61758687, > "oldPosition": null, "end": 61758727, "mapq": 25, "readName": > "NB501244AR:119:HJY3WBGXY:2:2:6137:19359", "sequence": > "TACTGAGACTTATCAGAATTTCAGGCTAAAGCAACC", "qual": > "AAEA", "cigar": "40M", "oldCigar": null, > "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, > "properPair": false, "readMapped": true, "mateMapped": false, > "failedVendorQualityChecks": false, "duplicateRead": false, > "readNegativeStrand": false, "mateNegativeStrand": false, "primaryAlignment": > true, "secondaryAlignment": false, "supplementaryAlignment": false, > "mismatchingPositions": "40", "origQual": null, "attributes": > "XT:A:U\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX1:i:0\tX0:i:1", "recordGroupName": > null, "recordGroupSample": null, "mateAlignmentStart": null, > "mateAlignmentEnd": null, "mateContigName": null, "inferredInsertSize": null}) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > at > org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:135) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:185) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:150) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 0.0 in stage 2.0 (TID 9) > had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord > Serialization stack: > object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, > value: {"readInFragment": 0, "contigName": "chr1", "start": 10001, > "oldPosition": null, "end": 10041, "mapq": 0, "readName": > "NB501244AR:119:HJY3WBGXY:3:11508:7857:8792", "sequence": > "AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC", "qual": > "///E6EEEAEEE/AEAAA/A", "cigar": "40M", "oldCigar": null, > "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, > "properPair": false, "readMapped": true, "mateMapped": false, > "failedVendorQualityChecks": false, "duplicateRead": false, > "readNegativeStrand": true, "mateNegativeStrand": false, "primaryAlignment": > true, "secondaryAlignment": false, "supplementaryAlignment": false, > "mismatchingPositions": "40", "origQual": null, "attributes": > "XT:A:R\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX0:i:594", "recordGroupName": null, > "recordGroupSample": null, "mateAlignmentStart": null, "mateAlignmentEnd": > null, "mateContigName": null, "inferredInsertSize": null}); not retrying > 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 4.0 in stage 2.0 (TID 13) > had a not serializable
[jira] [Resolved] (SPARK-18263) Configuring spark.kryo.registrator programmatically doesn't take effect
[ https://issues.apache.org/jira/browse/SPARK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] inred resolved SPARK-18263. --- Resolution: Fixed > Configuring spark.kryo.registrator programmatically doesn't take effect > --- > > Key: SPARK-18263 > URL: https://issues.apache.org/jira/browse/SPARK-18263 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: spark-2.0.1-bin-hadoop2.6 > scala-2.11.8 >Reporter: inred > > it run ok with spark-shell --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf > spark.kryo.registrator=org.bdgenomics.adam.serialization.ADAMKryoRegistrator \ > but in IDE > val spark = SparkSession.builder.master("local[*]").appName("Anno > BDG").getOrCreate() > spark.conf.set("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > spark.conf.set("spark.kryo.registrator", > "org.bdgenomics.adam.serialization.ADAMKryoRegistrator") > it reports the following error: > java.io.NotSerializableException: org.bdgenomics.formats.avro.AlignmentRecord > Serialization stack: > object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, > value: {"readInFragment": 0, "contigName": "chr10", "start": 61758687, > "oldPosition": null, "end": 61758727, "mapq": 25, "readName": > "NB501244AR:119:HJY3WBGXY:2:2:6137:19359", "sequence": > "TACTGAGACTTATCAGAATTTCAGGCTAAAGCAACC", "qual": > "AAEA", "cigar": "40M", "oldCigar": null, > "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, > "properPair": false, "readMapped": true, "mateMapped": false, > "failedVendorQualityChecks": false, "duplicateRead": false, > "readNegativeStrand": false, "mateNegativeStrand": false, "primaryAlignment": > true, "secondaryAlignment": false, "supplementaryAlignment": false, > "mismatchingPositions": "40", "origQual": null, "attributes": > "XT:A:U\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX1:i:0\tX0:i:1", "recordGroupName": > null, "recordGroupSample": null, "mateAlignmentStart": null, > "mateAlignmentEnd": null, "mateContigName": null, "inferredInsertSize": null}) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > at > org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:135) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:185) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:150) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 0.0 in stage 2.0 (TID 9) > had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord > Serialization stack: > object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, > value: {"readInFragment": 0, "contigName": "chr1", "start": 10001, > "oldPosition": null, "end": 10041, "mapq": 0, "readName": > "NB501244AR:119:HJY3WBGXY:3:11508:7857:8792", "sequence": > "AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC", "qual": > "///E6EEEAEEE/AEAAA/A", "cigar": "40M", "oldCigar": null, > "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, > "properPair": false, "readMapped": true, "mateMapped": false, > "failedVendorQualityChecks": false, "duplicateRead": false, > "readNegativeStrand": true, "mateNegativeStrand": false, "primaryAlignment": > true, "secondaryAlignment": false, "supplementaryAlignment": false, > "mismatchingPositions": "40", "origQual": null, "attributes": > "XT:A:R\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX0:i:594", "recordGroupName": null, > "recordGroupSample": null, "mateAlignmentStart": null, "mateAlignmentEnd": > null, "mateContigName": null, "inferredInsertSize": null}); not retrying > 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 4.0 in stage 2.0 (TID 13) > had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord > Serialization stack: > object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, > value: {"readInFragment": 0, "contigName": "chr10", "start": 61758687, > "oldPosition": null, "end": 61758727,
[jira] [Closed] (SPARK-18263) Configuring spark.kryo.registrator programmatically doesn't take effect
[ https://issues.apache.org/jira/browse/SPARK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] inred closed SPARK-18263. - > Configuring spark.kryo.registrator programmatically doesn't take effect > --- > > Key: SPARK-18263 > URL: https://issues.apache.org/jira/browse/SPARK-18263 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 > Environment: spark-2.0.1-bin-hadoop2.6 > scala-2.11.8 >Reporter: inred > > it run ok with spark-shell --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf > spark.kryo.registrator=org.bdgenomics.adam.serialization.ADAMKryoRegistrator \ > but in IDE > val spark = SparkSession.builder.master("local[*]").appName("Anno > BDG").getOrCreate() > spark.conf.set("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > spark.conf.set("spark.kryo.registrator", > "org.bdgenomics.adam.serialization.ADAMKryoRegistrator") > it reports the following error: > java.io.NotSerializableException: org.bdgenomics.formats.avro.AlignmentRecord > Serialization stack: > object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, > value: {"readInFragment": 0, "contigName": "chr10", "start": 61758687, > "oldPosition": null, "end": 61758727, "mapq": 25, "readName": > "NB501244AR:119:HJY3WBGXY:2:2:6137:19359", "sequence": > "TACTGAGACTTATCAGAATTTCAGGCTAAAGCAACC", "qual": > "AAEA", "cigar": "40M", "oldCigar": null, > "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, > "properPair": false, "readMapped": true, "mateMapped": false, > "failedVendorQualityChecks": false, "duplicateRead": false, > "readNegativeStrand": false, "mateNegativeStrand": false, "primaryAlignment": > true, "secondaryAlignment": false, "supplementaryAlignment": false, > "mismatchingPositions": "40", "origQual": null, "attributes": > "XT:A:U\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX1:i:0\tX0:i:1", "recordGroupName": > null, "recordGroupSample": null, "mateAlignmentStart": null, > "mateAlignmentEnd": null, "mateContigName": null, "inferredInsertSize": null}) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > at > org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:135) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:185) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:150) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 0.0 in stage 2.0 (TID 9) > had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord > Serialization stack: > object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, > value: {"readInFragment": 0, "contigName": "chr1", "start": 10001, > "oldPosition": null, "end": 10041, "mapq": 0, "readName": > "NB501244AR:119:HJY3WBGXY:3:11508:7857:8792", "sequence": > "AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC", "qual": > "///E6EEEAEEE/AEAAA/A", "cigar": "40M", "oldCigar": null, > "basesTrimmedFromStart": 0, "basesTrimmedFromEnd": 0, "readPaired": false, > "properPair": false, "readMapped": true, "mateMapped": false, > "failedVendorQualityChecks": false, "duplicateRead": false, > "readNegativeStrand": true, "mateNegativeStrand": false, "primaryAlignment": > true, "secondaryAlignment": false, "supplementaryAlignment": false, > "mismatchingPositions": "40", "origQual": null, "attributes": > "XT:A:R\tXO:i:0\tXM:i:0\tNM:i:0\tXG:i:0\tX0:i:594", "recordGroupName": null, > "recordGroupSample": null, "mateAlignmentStart": null, "mateAlignmentEnd": > null, "mateContigName": null, "inferredInsertSize": null}); not retrying > 2016-11-04 10:30:56 ERROR TaskSetManager:70 - Task 4.0 in stage 2.0 (TID 13) > had a not serializable result: org.bdgenomics.formats.avro.AlignmentRecord > Serialization stack: > object not serializable (class: org.bdgenomics.formats.avro.AlignmentRecord, > value: {"readInFragment": 0, "contigName": "chr10", "start": 61758687, > "oldPosition": null, "end": 61758727, "mapq": 25, "readName": >
[jira] [Updated] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming
[ https://issues.apache.org/jira/browse/SPARK-18339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18339: - Target Version/s: 2.2.0 > Don't push down current_timestamp for filters in StructuredStreaming > > > Key: SPARK-18339 > URL: https://issues.apache.org/jira/browse/SPARK-18339 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Burak Yavuz > Labels: correctness > > For the following workflow: > 1. I have a column called time which is at minute level precision in a > Streaming DataFrame > 2. I want to perform groupBy time, count > 3. Then I want my MemorySink to only have the last 30 minutes of counts and I > perform this by > {code} > .where('time >= current_timestamp().cast("long") - 30 * 60) > {code} > what happens is that the `filter` gets pushed down before the aggregation, > and the filter happens on the source data for the aggregation instead of the > result of the aggregation (where I actually want to filter). > I guess the main issue here is that `current_timestamp` is non-deterministic > in the streaming context and shouldn't be pushed down the filter. > Does this require us to store the `current_timestamp` for each trigger of the > streaming job, that is something to discuss. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18217) Disallow creating permanent views based on temporary views or UDFs
[ https://issues.apache.org/jira/browse/SPARK-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18217. - Resolution: Fixed Fix Version/s: 2.1.0 > Disallow creating permanent views based on temporary views or UDFs > -- > > Key: SPARK-18217 > URL: https://issues.apache.org/jira/browse/SPARK-18217 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Xiao Li > Fix For: 2.1.0 > > > See the discussion in the parent ticket SPARK-18209. It doesn't really make > sense to create permanent views based on temporary views or UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16609) Single function for parsing timestamps/dates
[ https://issues.apache.org/jira/browse/SPARK-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16609: Target Version/s: 2.2.0 (was: 2.1.0) > Single function for parsing timestamps/dates > > > Key: SPARK-16609 > URL: https://issues.apache.org/jira/browse/SPARK-16609 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Reynold Xin > > Today, if you want to parse a date or timestamp, you have to use the unix > time function and then cast to a timestamp. Its a little odd there isn't a > single function that does both. I propose we add > {code} > to_date(, )/to_timestamp(, ). > {code} > For reference, in other systems there are: > MS SQL: {{convert(, )}}. See: > https://technet.microsoft.com/en-us/library/ms174450(v=sql.110).aspx > Netezza: {{to_timestamp(, )}}. See: > https://www.ibm.com/support/knowledgecenter/SSULQD_7.0.3/com.ibm.nz.dbu.doc/r_dbuser_ntz_sql_extns_conversion_funcs.html > Teradata has special casting functionality: {{cast( as timestamp > format '')}} > MySql: {{STR_TO_DATE(, )}}. This returns a datetime when you > define both date and time parts. See: > https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18298) HistoryServer use GMT time all time
[ https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18298: Assignee: (was: Apache Spark) > HistoryServer use GMT time all time > --- > > Key: SPARK-18298 > URL: https://issues.apache.org/jira/browse/SPARK-18298 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1 > Environment: suse 11.3 with CST time >Reporter: Tao Wang > > When I started HistoryServer for reading event logs, the timestamp readed > will be parsed using local timezone like "CST"(confirmed via debug). > But the time related columns like "Started"/"Completed"/"Last Updated" in > History Server UI using "GMT" time, which is 8 hours earlier than "CST". > {quote} > App IDApp NameStarted Completed DurationSpark > User Last UpdatedEvent Log > local-1478225166651 Spark shell 2016-11-04 02:06:06 2016-11-07 > 01:33:30 71.5 h root2016-11-07 01:33:30 > {quote} > I've checked the REST api and found the result like: > {color:red} > [ { > "id" : "local-1478225166651", > "name" : "Spark shell", > "attempts" : [ { > "startTime" : "2016-11-04T02:06:06.020GMT", > "endTime" : "2016-11-07T01:33:30.265GMT", > "lastUpdated" : "2016-11-07T01:33:30.000GMT", > "duration" : 257244245, > "sparkUser" : "root", > "completed" : true, > "lastUpdatedEpoch" : 147848241, > "endTimeEpoch" : 1478482410265, > "startTimeEpoch" : 1478225166020 > } ] > }, { > "id" : "local-1478224925869", > "name" : "Spark Pi", > "attempts" : [ { > "startTime" : "2016-11-04T02:02:02.133GMT", > "endTime" : "2016-11-04T02:02:07.468GMT", > "lastUpdated" : "2016-11-04T02:02:07.000GMT", > "duration" : 5335, > "sparkUser" : "root", > "completed" : true, > ... > {color} > So maybe the change happened in transferring between server and browser? I > have no idea where to go from this point. > Hope guys can offer some help, or just fix it if it's easy? :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18298) HistoryServer use GMT time all time
[ https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646209#comment-15646209 ] Apache Spark commented on SPARK-18298: -- User 'windpiger' has created a pull request for this issue: https://github.com/apache/spark/pull/15803 > HistoryServer use GMT time all time > --- > > Key: SPARK-18298 > URL: https://issues.apache.org/jira/browse/SPARK-18298 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1 > Environment: suse 11.3 with CST time >Reporter: Tao Wang > > When I started HistoryServer for reading event logs, the timestamp readed > will be parsed using local timezone like "CST"(confirmed via debug). > But the time related columns like "Started"/"Completed"/"Last Updated" in > History Server UI using "GMT" time, which is 8 hours earlier than "CST". > {quote} > App IDApp NameStarted Completed DurationSpark > User Last UpdatedEvent Log > local-1478225166651 Spark shell 2016-11-04 02:06:06 2016-11-07 > 01:33:30 71.5 h root2016-11-07 01:33:30 > {quote} > I've checked the REST api and found the result like: > {color:red} > [ { > "id" : "local-1478225166651", > "name" : "Spark shell", > "attempts" : [ { > "startTime" : "2016-11-04T02:06:06.020GMT", > "endTime" : "2016-11-07T01:33:30.265GMT", > "lastUpdated" : "2016-11-07T01:33:30.000GMT", > "duration" : 257244245, > "sparkUser" : "root", > "completed" : true, > "lastUpdatedEpoch" : 147848241, > "endTimeEpoch" : 1478482410265, > "startTimeEpoch" : 1478225166020 > } ] > }, { > "id" : "local-1478224925869", > "name" : "Spark Pi", > "attempts" : [ { > "startTime" : "2016-11-04T02:02:02.133GMT", > "endTime" : "2016-11-04T02:02:07.468GMT", > "lastUpdated" : "2016-11-04T02:02:07.000GMT", > "duration" : 5335, > "sparkUser" : "root", > "completed" : true, > ... > {color} > So maybe the change happened in transferring between server and browser? I > have no idea where to go from this point. > Hope guys can offer some help, or just fix it if it's easy? :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18298) HistoryServer use GMT time all time
[ https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18298: Assignee: Apache Spark > HistoryServer use GMT time all time > --- > > Key: SPARK-18298 > URL: https://issues.apache.org/jira/browse/SPARK-18298 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1 > Environment: suse 11.3 with CST time >Reporter: Tao Wang >Assignee: Apache Spark > > When I started HistoryServer for reading event logs, the timestamp readed > will be parsed using local timezone like "CST"(confirmed via debug). > But the time related columns like "Started"/"Completed"/"Last Updated" in > History Server UI using "GMT" time, which is 8 hours earlier than "CST". > {quote} > App IDApp NameStarted Completed DurationSpark > User Last UpdatedEvent Log > local-1478225166651 Spark shell 2016-11-04 02:06:06 2016-11-07 > 01:33:30 71.5 h root2016-11-07 01:33:30 > {quote} > I've checked the REST api and found the result like: > {color:red} > [ { > "id" : "local-1478225166651", > "name" : "Spark shell", > "attempts" : [ { > "startTime" : "2016-11-04T02:06:06.020GMT", > "endTime" : "2016-11-07T01:33:30.265GMT", > "lastUpdated" : "2016-11-07T01:33:30.000GMT", > "duration" : 257244245, > "sparkUser" : "root", > "completed" : true, > "lastUpdatedEpoch" : 147848241, > "endTimeEpoch" : 1478482410265, > "startTimeEpoch" : 1478225166020 > } ] > }, { > "id" : "local-1478224925869", > "name" : "Spark Pi", > "attempts" : [ { > "startTime" : "2016-11-04T02:02:02.133GMT", > "endTime" : "2016-11-04T02:02:07.468GMT", > "lastUpdated" : "2016-11-04T02:02:07.000GMT", > "duration" : 5335, > "sparkUser" : "root", > "completed" : true, > ... > {color} > So maybe the change happened in transferring between server and browser? I > have no idea where to go from this point. > Hope guys can offer some help, or just fix it if it's easy? :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18340) Inconsistent error messages in launching scripts and hanging in sparkr script for wrong options
[ https://issues.apache.org/jira/browse/SPARK-18340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-18340: - Description: It seems there are some problems with handling wrong options as below: *{{spark-submit}} script - this one looks fine {code} spark-submit --aabbcc Error: Unrecognized option: --aabbcc Usage: spark-submit [options] [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Usage: spark-submit run-example [options] example-class [example args] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. ... {code} *{{spark-sql}} script - this one looks fine {code} spark-sql --aabbcc Unrecognized option: --aabbcc usage: hive -d,--define
[jira] [Created] (SPARK-18340) Inconsistent error messages in launching scripts and hanging in sparkr script for wrong options
Hyukjin Kwon created SPARK-18340: Summary: Inconsistent error messages in launching scripts and hanging in sparkr script for wrong options Key: SPARK-18340 URL: https://issues.apache.org/jira/browse/SPARK-18340 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit Reporter: Hyukjin Kwon Priority: Minor It seems there are some problems with handling wrong options as below: *{{spark-submit}} script - this one looks fine {code} spark-submit --aabbcc Error: Unrecognized option: --aabbcc Usage: spark-submit [options] [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Usage: spark-submit run-example [options] example-class [example args] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. ... {code} *{{spark-sql}} script - this one looks fine {code} spark-sql --aabbcc Unrecognized option: --aabbcc usage: hive -d,--define
[jira] [Updated] (SPARK-17019) Expose off-heap memory usage in various places
[ https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17019: Target Version/s: 2.2.0 (was: 2.1.0) > Expose off-heap memory usage in various places > -- > > Key: SPARK-17019 > URL: https://issues.apache.org/jira/browse/SPARK-17019 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao >Priority: Minor > > With SPARK-13992, Spark supports persisting data into off-heap memory, but > the usage of off-heap is not exposed currently, it is not so convenient for > user to monitor and profile, so here propose to expose off-heap memory as > well as on-heap memory usage in various places: > 1. Spark UI's executor page will display both on-heap and off-heap memory > usage. > 2. REST request returns both on-heap and off-heap memory. > 3. Also these two memory usage can be obtained programmatically from > SparkListener. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16317) Add file filtering interface for FileFormat
[ https://issues.apache.org/jira/browse/SPARK-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16317: Target Version/s: 2.2.0 (was: 2.1.0) > Add file filtering interface for FileFormat > --- > > Key: SPARK-16317 > URL: https://issues.apache.org/jira/browse/SPARK-16317 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Priority: Minor > > {{FileFormat}} data sources like Parquet and Avro (provided by spark-avro) > have customized file filtering logics. For example, Parquet needs to filter > out summary files, while Avro provides a Hadoop configuration option to > filter out all files whose names don't end with ".avro". > It would be nice to have a general file filtering interface in {{FileFormat}} > to handle similar requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18261) Add statistics to MemorySink for joining
[ https://issues.apache.org/jira/browse/SPARK-18261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18261. - Resolution: Fixed Assignee: Liwei Lin Fix Version/s: 2.1.0 > Add statistics to MemorySink for joining > - > > Key: SPARK-18261 > URL: https://issues.apache.org/jira/browse/SPARK-18261 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.0.2 >Reporter: Burak Yavuz >Assignee: Liwei Lin > Fix For: 2.1.0 > > > Right now, there is no way to join the output of a memory sink with any table: > {code} > UnsupportedOperationException: LeafNode MemoryPlan must implement statistics > {code} > Being able to join snapshots of memory streams with tables would be nice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18086) Regression: Hive variables no longer work in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-18086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18086. - Resolution: Fixed Assignee: Ryan Blue Fix Version/s: 2.1.0 > Regression: Hive variables no longer work in Spark 2.0 > -- > > Key: SPARK-18086 > URL: https://issues.apache.org/jira/browse/SPARK-18086 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 2.1.0 > > > The behavior of variables in the SQL shell has changed from 1.6 to 2.0. > Specifically, --hivevar name=value and {{SET hivevar:name=value}} no longer > work. Queries that worked correctly in 1.6 will either fail or produce > unexpected results in 2.0 so I think this is a regression that should be > addressed. > Hive and Spark 1.6 work like this: > 1. Command-line args --hiveconf and --hivevar can be used to set session > properties. --hiveconf properties are added to the Hadoop Configuration. > 2. {{SET}} adds a Hive Configuration property, {{SET hivevar:=}} > adds a Hive var. > 3. Hive vars can be substituted into queries by name, and Configuration > properties can be substituted using {{hiveconf:name}}. > In 2.0, hiveconf, sparkconf, and conf variable prefixes are all removed, then > the value in SQLConf for the rest of the key is returned. SET adds properties > to the session config and (according to [a > comment|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala#L28]) > the Hadoop configuration "during I/O". > {code:title=Hive and Spark 1.6.1 behavior} > [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2 > spark-sql> select "${hiveconf:test.conf}"; > 1 > spark-sql> select "${test.conf}"; > ${test.conf} > spark-sql> select "${hivevar:test.var}"; > 2 > spark-sql> select "${test.var}"; > 2 > spark-sql> set test.set=3; > SET test.set=3 > spark-sql> select "${test.set}" > "${test.set}" > spark-sql> select "${hivevar:test.set}" > "${hivevar:test.set}" > spark-sql> select "${hiveconf:test.set}" > 3 > spark-sql> set hivevar:test.setvar=4; > SET hivevar:test.setvar=4 > spark-sql> select "${hivevar:test.setvar}"; > 4 > spark-sql> select "${test.setvar}"; > 4 > {code} > {code:title=Spark 2.0.0 behavior} > [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2 > spark-sql> select "${hiveconf:test.conf}"; > 1 > spark-sql> select "${test.conf}"; > 1 > spark-sql> select "${hivevar:test.var}"; > ${hivevar:test.var} > spark-sql> select "${test.var}"; > ${test.var} > spark-sql> set test.set=3; > test.set3 > spark-sql> select "${test.set}"; > 3 > spark-sql> set hivevar:test.setvar=4; > hivevar:test.setvar 4 > spark-sql> select "${hivevar:test.setvar}"; > 4 > spark-sql> select "${test.setvar}"; > ${test.setvar} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming
Burak Yavuz created SPARK-18339: --- Summary: Don't push down current_timestamp for filters in StructuredStreaming Key: SPARK-18339 URL: https://issues.apache.org/jira/browse/SPARK-18339 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.0.1 Reporter: Burak Yavuz For the following workflow: 1. I have a column called time which is at minute level precision in a Streaming DataFrame 2. I want to perform groupBy time, count 3. Then I want my MemorySink to only have the last 30 minutes of counts and I perform this by {code} .where('time >= current_timestamp().cast("long") - 30 * 60) {code} what happens is that the `filter` gets pushed down before the aggregation, and the filter happens on the source data for the aggregation instead of the result of the aggregation (where I actually want to filter). I guess the main issue here is that `current_timestamp` is non-deterministic in the streaming context and shouldn't be pushed down the filter. Does this require us to store the `current_timestamp` for each trigger of the streaming job, that is something to discuss. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18295) Match up to_json to from_json in null safety
[ https://issues.apache.org/jira/browse/SPARK-18295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18295. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15792 [https://github.com/apache/spark/pull/15792] > Match up to_json to from_json in null safety > > > Key: SPARK-18295 > URL: https://issues.apache.org/jira/browse/SPARK-18295 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon > Fix For: 2.1.0 > > > {code} > scala> val df = Seq(Some(Tuple1(Tuple1(1))), None).toDF("a") > df: org.apache.spark.sql.DataFrame = [a: struct<_1: int>] > scala> df.show() > ++ > | a| > ++ > | [1]| > |null| > ++ > scala> df.select(to_json($"a")).show() > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeFields(JacksonGenerator.scala:138) > at > org.apache.spark.sql.catalyst.json.JacksonGenerator$$anonfun$write$1.apply$mcV$sp(JacksonGenerator.scala:194) > at > org.apache.spark.sql.catalyst.json.JacksonGenerator.org$apache$spark$sql$catalyst$json$JacksonGenerator$$writeObject(JacksonGenerator.scala:131) > at > org.apache.spark.sql.catalyst.json.JacksonGenerator.write(JacksonGenerator.scala:193) > at > org.apache.spark.sql.catalyst.expressions.StructToJson.eval(jsonExpressions.scala:544) > at > org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142) > at > org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48) > at > org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18318) ML, Graph 2.1 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646011#comment-15646011 ] Yanbo Liang commented on SPARK-18318: - I'm interested in contributing this task. Thanks. > ML, Graph 2.1 QA: API: New Scala APIs, docs > --- > > Key: SPARK-18318 > URL: https://issues.apache.org/jira/browse/SPARK-18318 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18316) Spark MLlib, GraphX 2.1 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-18316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646006#comment-15646006 ] Yanbo Liang commented on SPARK-18316: - Typo? Here should be 2.1 rather than 2.0? > Spark MLlib, GraphX 2.1 QA umbrella > --- > > Key: SPARK-18316 > URL: https://issues.apache.org/jira/browse/SPARK-18316 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate: [SPARK-18329].* > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > * Major new algorithms: MinHash, RandomProjection > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18236) Reduce memory usage of Spark UI and HistoryServer by reducing duplicate objects
[ https://issues.apache.org/jira/browse/SPARK-18236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-18236. Resolution: Fixed Fix Version/s: 2.2.0 Merged into master (2.2.0). > Reduce memory usage of Spark UI and HistoryServer by reducing duplicate > objects > --- > > Key: SPARK-18236 > URL: https://issues.apache.org/jira/browse/SPARK-18236 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.2.0 > > > When profiling heap dumps from the Spark History Server and live Spark web > UIs, I found a tremendous amount of memory being wasted on duplicate objects > and strings. A few small changes can cut per-task UI memory by half or more. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18338) ObjectHashAggregateSuite fails under Maven builds
[ https://issues.apache.org/jira/browse/SPARK-18338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-18338: --- Description: Test case initialization order under Maven and SBT are different. Maven always creates instances of all test cases and then run them all together. This fails {{ObjectHashAggregateSuite}} because the randomized test cases there register a temporary Hive function right before creating a test case, and can be cleared while initializing other successive test cases. In SBT, this is fine since the created test case is executed immediately after creating the temporary function. To fix this issue, we should put initialization/destruction code into {{beforeAll()}} and {{afterAll()}}. was: Test case initialization order under Maven and SBT are different. Maven always creates instances of all test cases and then run them altogether. This fails {{ObjectHashAggregateSuite}} because the randomized test cases their registers a temporary Hive function right before creating a test case, and can be cleared while initializing other successive test cases. In SBT, this is fine since the created test case is executed immediately after creating the temporary function. To fix this issue, we should put initialization/destruction code into {{beforeAll()}} and {{afterAll()}}. > ObjectHashAggregateSuite fails under Maven builds > - > > Key: SPARK-18338 > URL: https://issues.apache.org/jira/browse/SPARK-18338 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Labels: flaky-test > > Test case initialization order under Maven and SBT are different. Maven > always creates instances of all test cases and then run them all together. > This fails {{ObjectHashAggregateSuite}} because the randomized test cases > there register a temporary Hive function right before creating a test case, > and can be cleared while initializing other successive test cases. > In SBT, this is fine since the created test case is executed immediately > after creating the temporary function. > To fix this issue, we should put initialization/destruction code into > {{beforeAll()}} and {{afterAll()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18338) ObjectHashAggregateSuite fails under Maven builds
[ https://issues.apache.org/jira/browse/SPARK-18338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645865#comment-15645865 ] Apache Spark commented on SPARK-18338: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/15802 > ObjectHashAggregateSuite fails under Maven builds > - > > Key: SPARK-18338 > URL: https://issues.apache.org/jira/browse/SPARK-18338 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Labels: flaky-test > > Test case initialization order under Maven and SBT are different. Maven > always creates instances of all test cases and then run them altogether. > This fails {{ObjectHashAggregateSuite}} because the randomized test cases > their registers a temporary Hive function right before creating a test case, > and can be cleared while initializing other successive test cases. > In SBT, this is fine since the created test case is executed immediately > after creating the temporary function. > To fix this issue, we should put initialization/destruction code into > {{beforeAll()}} and {{afterAll()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18338) ObjectHashAggregateSuite fails under Maven builds
[ https://issues.apache.org/jira/browse/SPARK-18338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18338: Assignee: Apache Spark (was: Cheng Lian) > ObjectHashAggregateSuite fails under Maven builds > - > > Key: SPARK-18338 > URL: https://issues.apache.org/jira/browse/SPARK-18338 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Apache Spark > Labels: flaky-test > > Test case initialization order under Maven and SBT are different. Maven > always creates instances of all test cases and then run them altogether. > This fails {{ObjectHashAggregateSuite}} because the randomized test cases > their registers a temporary Hive function right before creating a test case, > and can be cleared while initializing other successive test cases. > In SBT, this is fine since the created test case is executed immediately > after creating the temporary function. > To fix this issue, we should put initialization/destruction code into > {{beforeAll()}} and {{afterAll()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18338) ObjectHashAggregateSuite fails under Maven builds
[ https://issues.apache.org/jira/browse/SPARK-18338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18338: Assignee: Cheng Lian (was: Apache Spark) > ObjectHashAggregateSuite fails under Maven builds > - > > Key: SPARK-18338 > URL: https://issues.apache.org/jira/browse/SPARK-18338 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Labels: flaky-test > > Test case initialization order under Maven and SBT are different. Maven > always creates instances of all test cases and then run them altogether. > This fails {{ObjectHashAggregateSuite}} because the randomized test cases > their registers a temporary Hive function right before creating a test case, > and can be cleared while initializing other successive test cases. > In SBT, this is fine since the created test case is executed immediately > after creating the temporary function. > To fix this issue, we should put initialization/destruction code into > {{beforeAll()}} and {{afterAll()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18338) ObjectHashAggregateSuite fails under Maven builds
Cheng Lian created SPARK-18338: -- Summary: ObjectHashAggregateSuite fails under Maven builds Key: SPARK-18338 URL: https://issues.apache.org/jira/browse/SPARK-18338 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Cheng Lian Assignee: Cheng Lian Test case initialization order under Maven and SBT are different. Maven always creates instances of all test cases and then run them altogether. This fails {{ObjectHashAggregateSuite}} because the randomized test cases their registers a temporary Hive function right before creating a test case, and can be cleared while initializing other successive test cases. In SBT, this is fine since the created test case is executed immediately after creating the temporary function. To fix this issue, we should put initialization/destruction code into {{beforeAll()}} and {{afterAll()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17490) Optimize SerializeFromObject for primitive array
[ https://issues.apache.org/jira/browse/SPARK-17490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17490. --- Resolution: Fixed Assignee: Kazuaki Ishizaki Fix Version/s: 2.1.0 > Optimize SerializeFromObject for primitive array > > > Key: SPARK-17490 > URL: https://issues.apache.org/jira/browse/SPARK-17490 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.1.0 > > > In logical plan, {{SerializeFromObject}} for an array always use > {{GenericArrayData}} as a destination. {{UnsafeArrayData}} could be used for > an primitive array. This is a simple approach to solve issues that are > addressed by SPARK-16043. > Here is a motivating example. > {code} > sparkContext.parallelize(Seq(Array(1)), 1).toDS.map(e => e).show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix
[ https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645652#comment-15645652 ] Mark Tygert edited comment on SPARK-8614 at 11/7/16 10:36 PM: -- This remains a big issue, rendering the results produced by MLlib to be incorrect for most matrix decompositions and matrix-matrix multiplications when using multiple executors or workers. [~hl475] of Yale is working to fix the problem, and eventually ML for DataFrames will need to incorporate his solutions. was (Author: tygert): This remains a big issue, rendering the results produced by MLlib to be incorrect for most matrix decompositions and matrix-matrix multiplications when using multiple executors or workers. Huamin Li of Yale is working to fix the problem, and eventually ML for DataFrames will need to incorporate his solutions. > Row order preservation for operations on MLlib IndexedRowMatrix > --- > > Key: SPARK-8614 > URL: https://issues.apache.org/jira/browse/SPARK-8614 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Jan Luts > > In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are > dropped before calling the methods from RowMatrix. For example for > IndexedRowMatrix.computeSVD: >val svd = toRowMatrix().computeSVD(k, computeU, rCond) > and for IndexedRowMatrix.multiply: >val mat = toRowMatrix().multiply(B). > After computing these results, they are zipped with the original indices, > e.g. for IndexedRowMatrix.computeSVD >val indexedRows = indices.zip(svd.U.rows).map { case (i, v) => > IndexedRow(i, v) >} > and for IndexedRowMatrix.multiply: > >val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) => > IndexedRow(i, v) >} > I have experienced that for IndexedRowMatrix.computeSVD().U and > IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row > indices can get mixed (when running Spark jobs with multiple > executors/machines): i.e. the vectors and indices of the result do not seem > to correspond anymore. > To me it looks like this is caused by zipping RDDs that have a different > ordering? > For the IndexedRowMatrix.multiply I have observed that ordering within > partitions is preserved, but that it seems to get mixed up between > partitions. For example, for: > part1Index1 part1Vector1 > part1Index2 part1Vector2 > part2Index1 part2Vector1 > part2Index2 part2Vector2 > I got: > part2Index1 part1Vector1 > part2Index2 part1Vector2 > part1Index1 part2Vector1 > part1Index2 part2Vector2 > Another observation is that the mapPartitions in RowMatrix.multiply : > val AB = rows.mapPartitions { iter => > had an "preservesPartitioning = true" argument in version 1.0, but this is no > longer there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18337) Memory Sink should be able to recover from checkpoints in Complete OutputMode
[ https://issues.apache.org/jira/browse/SPARK-18337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645673#comment-15645673 ] Apache Spark commented on SPARK-18337: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/15801 > Memory Sink should be able to recover from checkpoints in Complete OutputMode > - > > Key: SPARK-18337 > URL: https://issues.apache.org/jira/browse/SPARK-18337 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Burak Yavuz > > Memory sinks are not meant to be fault tolerant, but there are certain cases, > where it would be nice that it can recover from checkpoints. In cases where > you may use a scalable StateStore in StructuredStreaming (when you have an > aggregation), and you add a filter based on a key or value in your state, > it's nice to be able to continue from where you left off after failures. > For correctness reasons, the output will ONLY be correct in Complete mode, so > we could support that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18337) Memory Sink should be able to recover from checkpoints in Complete OutputMode
[ https://issues.apache.org/jira/browse/SPARK-18337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18337: Assignee: (was: Apache Spark) > Memory Sink should be able to recover from checkpoints in Complete OutputMode > - > > Key: SPARK-18337 > URL: https://issues.apache.org/jira/browse/SPARK-18337 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Burak Yavuz > > Memory sinks are not meant to be fault tolerant, but there are certain cases, > where it would be nice that it can recover from checkpoints. In cases where > you may use a scalable StateStore in StructuredStreaming (when you have an > aggregation), and you add a filter based on a key or value in your state, > it's nice to be able to continue from where you left off after failures. > For correctness reasons, the output will ONLY be correct in Complete mode, so > we could support that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18337) Memory Sink should be able to recover from checkpoints in Complete OutputMode
[ https://issues.apache.org/jira/browse/SPARK-18337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18337: Assignee: Apache Spark > Memory Sink should be able to recover from checkpoints in Complete OutputMode > - > > Key: SPARK-18337 > URL: https://issues.apache.org/jira/browse/SPARK-18337 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.1 >Reporter: Burak Yavuz >Assignee: Apache Spark > > Memory sinks are not meant to be fault tolerant, but there are certain cases, > where it would be nice that it can recover from checkpoints. In cases where > you may use a scalable StateStore in StructuredStreaming (when you have an > aggregation), and you add a filter based on a key or value in your state, > it's nice to be able to continue from where you left off after failures. > For correctness reasons, the output will ONLY be correct in Complete mode, so > we could support that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18337) Memory Sink should be able to recover from checkpoints in Complete OutputMode
Burak Yavuz created SPARK-18337: --- Summary: Memory Sink should be able to recover from checkpoints in Complete OutputMode Key: SPARK-18337 URL: https://issues.apache.org/jira/browse/SPARK-18337 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.0.1 Reporter: Burak Yavuz Memory sinks are not meant to be fault tolerant, but there are certain cases, where it would be nice that it can recover from checkpoints. In cases where you may use a scalable StateStore in StructuredStreaming (when you have an aggregation), and you add a filter based on a key or value in your state, it's nice to be able to continue from where you left off after failures. For correctness reasons, the output will ONLY be correct in Complete mode, so we could support that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8614) Row order preservation for operations on MLlib IndexedRowMatrix
[ https://issues.apache.org/jira/browse/SPARK-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645652#comment-15645652 ] Mark Tygert commented on SPARK-8614: This remains a big issue, rendering the results produced by MLlib to be incorrect for most matrix decompositions and matrix-matrix multiplications when using multiple executors or workers. Huamin Li of Yale is working to fix the problem, and eventually ML for DataFrames will need to incorporate his solutions. > Row order preservation for operations on MLlib IndexedRowMatrix > --- > > Key: SPARK-8614 > URL: https://issues.apache.org/jira/browse/SPARK-8614 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Jan Luts > > In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are > dropped before calling the methods from RowMatrix. For example for > IndexedRowMatrix.computeSVD: >val svd = toRowMatrix().computeSVD(k, computeU, rCond) > and for IndexedRowMatrix.multiply: >val mat = toRowMatrix().multiply(B). > After computing these results, they are zipped with the original indices, > e.g. for IndexedRowMatrix.computeSVD >val indexedRows = indices.zip(svd.U.rows).map { case (i, v) => > IndexedRow(i, v) >} > and for IndexedRowMatrix.multiply: > >val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) => > IndexedRow(i, v) >} > I have experienced that for IndexedRowMatrix.computeSVD().U and > IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row > indices can get mixed (when running Spark jobs with multiple > executors/machines): i.e. the vectors and indices of the result do not seem > to correspond anymore. > To me it looks like this is caused by zipping RDDs that have a different > ordering? > For the IndexedRowMatrix.multiply I have observed that ordering within > partitions is preserved, but that it seems to get mixed up between > partitions. For example, for: > part1Index1 part1Vector1 > part1Index2 part1Vector2 > part2Index1 part2Vector1 > part2Index2 part2Vector2 > I got: > part2Index1 part1Vector1 > part2Index2 part1Vector2 > part1Index1 part2Vector1 > part1Index2 part2Vector2 > Another observation is that the mapPartitions in RowMatrix.multiply : > val AB = rows.mapPartitions { iter => > had an "preservesPartitioning = true" argument in version 1.0, but this is no > longer there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17993) Spark prints an avalanche of warning messages from Parquet when reading parquet files written by older versions of Parquet-mr
[ https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17993: --- Summary: Spark prints an avalanche of warning messages from Parquet when reading parquet files written by older versions of Parquet-mr (was: Spark spews a slew of harmless but annoying warning messages from Parquet when reading parquet files written by older versions of Parquet-mr) > Spark prints an avalanche of warning messages from Parquet when reading > parquet files written by older versions of Parquet-mr > - > > Key: SPARK-17993 > URL: https://issues.apache.org/jira/browse/SPARK-17993 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Allman > > It looks like https://github.com/apache/spark/pull/14690 broke parquet log > output redirection. After that patch, when querying parquet files written by > Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from > the Parquet reader: > {code} > Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: > Ignoring statistics because created_by could not be parsed (see PARQUET-251): > parquet-mr version 1.6.0 > org.apache.parquet.VersionParser$VersionParseException: Could not parse > created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) > )?\(build ?(.*)\) > at org.apache.parquet.VersionParser.parse(VersionParser.java:112) > at > org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This only happens during execution, not planning, and it doesn't matter what > log level the {{SparkContext}} is set to. > This is a regression I noted as something we needed to fix as a follow up to > PR 14690. I feel responsible, so I'm going to expedite a fix for it. I > suspect that PR broke Spark's Parquet log output redirection. That's the > premise I'm going by. -- This message was sent by
[jira] [Assigned] (SPARK-18334) MinHash should use binary hash distance
[ https://issues.apache.org/jira/browse/SPARK-18334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18334: Assignee: (was: Apache Spark) > MinHash should use binary hash distance > --- > > Key: SPARK-18334 > URL: https://issues.apache.org/jira/browse/SPARK-18334 > Project: Spark > Issue Type: Bug >Reporter: Yun Ni >Priority: Trivial > > MinHash currently is using the same `hashDistance` function as > RandomProjection. This does not make sense for MinHash because the Jaccard > distance of two sets is not relevant to the absolute distance of their hash > buckets indices. > This bug could affect accuracy of multi probing NN search for MinHash. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18334) MinHash should use binary hash distance
[ https://issues.apache.org/jira/browse/SPARK-18334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645540#comment-15645540 ] Apache Spark commented on SPARK-18334: -- User 'Yunni' has created a pull request for this issue: https://github.com/apache/spark/pull/15800 > MinHash should use binary hash distance > --- > > Key: SPARK-18334 > URL: https://issues.apache.org/jira/browse/SPARK-18334 > Project: Spark > Issue Type: Bug >Reporter: Yun Ni >Priority: Trivial > > MinHash currently is using the same `hashDistance` function as > RandomProjection. This does not make sense for MinHash because the Jaccard > distance of two sets is not relevant to the absolute distance of their hash > buckets indices. > This bug could affect accuracy of multi probing NN search for MinHash. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18334) MinHash should use binary hash distance
[ https://issues.apache.org/jira/browse/SPARK-18334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18334: Assignee: Apache Spark > MinHash should use binary hash distance > --- > > Key: SPARK-18334 > URL: https://issues.apache.org/jira/browse/SPARK-18334 > Project: Spark > Issue Type: Bug >Reporter: Yun Ni >Assignee: Apache Spark >Priority: Trivial > > MinHash currently is using the same `hashDistance` function as > RandomProjection. This does not make sense for MinHash because the Jaccard > distance of two sets is not relevant to the absolute distance of their hash > buckets indices. > This bug could affect accuracy of multi probing NN search for MinHash. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17993) Spark spews a slew of harmless but annoying warning messages from Parquet when reading parquet files written by older versions of Parquet-mr
[ https://issues.apache.org/jira/browse/SPARK-17993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645495#comment-15645495 ] Michael Allman commented on SPARK-17993: Thank you for your input, Keith. I agree this is a major issue, and I'm trying to get this resolved for 2.1. > Spark spews a slew of harmless but annoying warning messages from Parquet > when reading parquet files written by older versions of Parquet-mr > > > Key: SPARK-17993 > URL: https://issues.apache.org/jira/browse/SPARK-17993 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Allman > > It looks like https://github.com/apache/spark/pull/14690 broke parquet log > output redirection. After that patch, when querying parquet files written by > Parquet-mr 1.6.0 Spark prints a torrent of (harmless) warning messages from > the Parquet reader: > {code} > Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: > Ignoring statistics because created_by could not be parsed (see PARQUET-251): > parquet-mr version 1.6.0 > org.apache.parquet.VersionParser$VersionParseException: Could not parse > created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) > )?\(build ?(.*)\) > at org.apache.parquet.VersionParser.parse(VersionParser.java:112) > at > org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This only happens during execution, not planning, and it doesn't matter what > log level the {{SparkContext}} is set to. > This is a regression I noted as something we needed to fix as a follow up to > PR 14690. I feel responsible, so I'm going to expedite a fix for it. I > suspect that PR broke Spark's Parquet log output redirection. That's the > premise I'm going by. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe,