[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client
[ https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617645#comment-15617645 ] J.P Feng commented on SPARK-18107: -- I have tested the performance before and after the patch [https://github.com/apache/spark/pull/15667] . But it seems to improve a few after patching, where it costs 531 seconds before patching, and costs 518 seconds after patching. I will add the execution logs in work log later. > Insert overwrite statement runs much slower in spark-sql than it does in > hive-client > > > Key: SPARK-18107 > URL: https://issues.apache.org/jira/browse/SPARK-18107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: spark 2.0.0 > hive 2.0.1 >Reporter: J.P Feng > > I find insert overwrite statement running in spark-sql or spark-shell spends > much more time than it does in hive-client (i start it in > apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but > hive-client just costs less than 20 seconds. > These are the steps I took. > Test sql is : > insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21') > select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as > platform, 'mix' as pid, 'mix' as dev from tbllog_login where pt='mix_en' and > dt='2016-10-21' ; > there are 257128 lines of data in tbllog_login with > partition(pt='mix_en',dt='2016-10-21') > ps: > I'm sure it must be "insert overwrite" costing a lot of time in spark, may be > when doing overwrite, it need to spend a lot of time in io or in something > else. > I also compare the executing time between insert overwrite statement and > insert into statement. > 1. insert overwrite statement and insert into statement in spark: > insert overwrite statement costs about 10 minutes > insert into statement costs about 30 seconds > 2. insert into statement in spark and insert into statement in hive-client: > spark costs about 30 seconds > hive-client costs about 20 seconds > the difference is little that we can ignore > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client
[ https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617654#comment-15617654 ] J.P Feng commented on SPARK-18107: -- insert overwrite in spark 2.0.0, without patch scala> val befores =System.currentTimeMillis();spark.sql("insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as platform, 'mix' as pid, 'mix' as dev from tbllog_login where pt='mix_en' and dt='2016-10-21' ");val interval=System.currentTimeMillis()-befores;println(s"insertval is => ${interval/1000} seconds") 3549.075: [GC [PSYoungGen: 435328K->77744K(566784K)] 826575K->469112K(1232384K), 0.0463220 secs] [Times: user=0.27 sys=0.00, real=0.04 secs] 3549.394: [GC [PSYoungGen: 511920K->26018K(566784K)] 903288K->459991K(1232384K), 0.0804300 secs] [Times: user=0.44 sys=0.00, real=0.08 secs] [Stage 4:>(11 + 8) / 72]3549.698: [GC [PSYoungGen: 463266K->55486K(493056K)] 897239K->490323K(1158656K), 0.0359570 secs] [Times: user=0.17 sys=0.00, real=0.04 secs] [Stage 4:===> (19 + 8) / 72]3549.929: [GC [PSYoungGen: 492734K->82976K(563712K)] 927571K->522173K(1229312K), 0.0328120 secs] [Times: user=0.16 sys=0.01, real=0.03 secs] [Stage 4:===> (40 + 8) / 72]3550.166: [GC [PSYoungGen: 520736K->29045K(564736K)] 959933K->468617K(1230336K), 0.0244060 secs] [Times: user=0.12 sys=0.00, real=0.02 secs] [Stage 4:>(46 + 8) / 72]3550.392: [GC [PSYoungGen: 466309K->41380K(567296K)] 905881K->481176K(1232896K), 0.0320150 secs] [Times: user=0.16 sys=0.00, real=0.04 secs] [Stage 4:===> (55 + 8) / 72]3550.603: [GC [PSYoungGen: 486820K->96197K(541696K)] 926616K->536271K(1207296K), 0.0326490 secs] [Times: user=0.17 sys=0.00, real=0.03 secs] [Stage 4:=> (63 + 8) / 72]3550.868: [GC [PSYoungGen: 541637K->21696K(567296K)] 981711K->462242K(1232896K), 0.0259070 secs] [Times: user=0.11 sys=0.00, real=0.03 secs] [Stage 5:>(0 + 0) / 200]3551.127: [GC [PSYoungGen: 457912K->125270K(561664K)] 898458K->565984K(1227264K), 0.0497940 secs] [Times: user=0.25 sys=0.00, real=0.05 secs] 3551.328: [GC [PSYoungGen: 561494K->104705K(527872K)] 1002208K->552355K(1193472K), 0.0489880 secs] [Times: user=0.28 sys=0.00, real=0.05 secs] 3551.513: [GC [PSYoungGen: 494819K->94833K(485376K)] 942469K->544947K(1150976K), 0.0472640 secs] [Times: user=0.26 sys=0.00, real=0.05 secs] [Stage 5:> (17 + 8) / 200]3551.701: [GC [PSYoungGen: 484977K->90004K(545792K)] 935091K->544576K(1211392K), 0.0543700 secs] [Times: user=0.33 sys=0.00, real=0.06 secs] [Stage 5:===>(25 + 8) / 200]3551.878: [GC [PSYoungGen: 480005K->96725K(543744K)] 934833K->565661K(1209344K), 0.0475640 secs] [Times: user=0.24 sys=0.00, real=0.05 secs] [Stage 5:=> (34 + 8) / 200]3552.093: [GC [PSYoungGen: 486869K->86720K(539136K)] 964767K->567333K(1204736K), 0.0383360 secs] [Times: user=0.23 sys=0.00, real=0.04 secs] [Stage 5:===>(40 + 8) / 200]3552.351: [GC [PSYoungGen: 478363K->73147K(541696K)] 958976K->556404K(1207296K), 0.0401180 secs] [Times: user=0.21 sys=0.00, real=0.04 secs] [Stage 5:=> (48 + 8) / 200]3552.519: [GC [PSYoungGen: 464827K->74781K(541184K)] 948084K->560529K(1206784K), 0.0459060 secs] [Times: user=0.26 sys=0.00, real=0.05 secs] [Stage 5:> (60 + 8) / 200]3552.834: [GC [PSYoungGen: 480797K->68184K(543232K)] 966545K->555417K(1208832K), 0.0403320 secs] [Times: user=0.21 sys=0.00, real=0.05 secs] [Stage 5:==> (67 + 8) / 200]3553.031: [GC [PSYoungGen: 474200K->53137K(560640K)] 961433K->541548K(1226240K), 0.0306190 secs] [Times: user=0.15 sys=0.00, real=0.03 secs] [Stage 5:=> (77 + 8) / 200]3553.219: [GC [PSYoungGen: 481681K->54172K(559616K)] 970092K->543955K(1225216K), 0.0334520 secs] [Times: user=0.17 sys=0.00, real=0.03 secs] [Stage 5:> (88 + 8) / 200]3553.408: [GC [PSYoungGen: 482621K->61091K(564736K)] 972404K->552016K(1230336K), 0.0398400 secs] [Times: user=0.20 sys=0.00, real=0.04 secs] [Stage 5:==> (96 + 8) / 200]3553.596: [GC [PSYoungGen: 5
[jira] [Comment Edited] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client
[ https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617645#comment-15617645 ] J.P Feng edited comment on SPARK-18107 at 10/29/16 7:42 AM: I have tested the performance before and after the patch [https://github.com/apache/spark/pull/15667] . But it seems to improve a few after patching, where it costs 531 seconds before patching, and costs 518 seconds after patching. I will add the execution logs in a new comment later. was (Author: snodawn): I have tested the performance before and after the patch [https://github.com/apache/spark/pull/15667] . But it seems to improve a few after patching, where it costs 531 seconds before patching, and costs 518 seconds after patching. I will add the execution logs in work log later. > Insert overwrite statement runs much slower in spark-sql than it does in > hive-client > > > Key: SPARK-18107 > URL: https://issues.apache.org/jira/browse/SPARK-18107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: spark 2.0.0 > hive 2.0.1 >Reporter: J.P Feng > > I find insert overwrite statement running in spark-sql or spark-shell spends > much more time than it does in hive-client (i start it in > apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but > hive-client just costs less than 20 seconds. > These are the steps I took. > Test sql is : > insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21') > select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as > platform, 'mix' as pid, 'mix' as dev from tbllog_login where pt='mix_en' and > dt='2016-10-21' ; > there are 257128 lines of data in tbllog_login with > partition(pt='mix_en',dt='2016-10-21') > ps: > I'm sure it must be "insert overwrite" costing a lot of time in spark, may be > when doing overwrite, it need to spend a lot of time in io or in something > else. > I also compare the executing time between insert overwrite statement and > insert into statement. > 1. insert overwrite statement and insert into statement in spark: > insert overwrite statement costs about 10 minutes > insert into statement costs about 30 seconds > 2. insert into statement in spark and insert into statement in hive-client: > spark costs about 30 seconds > hive-client costs about 20 seconds > the difference is little that we can ignore > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client
[ https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617663#comment-15617663 ] J.P Feng commented on SPARK-18107: -- insert overwrite in spark 2.1.0, with patch scala> val befores =System.currentTimeMillis();spark.sql("insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as platform, 'mix' as pid, 'mix' as dev from tbllog_login where pt='mix_en' and dt='2016-10-21' ");val interval=System.currentTimeMillis()-befores;println(s"insertval is => ${interval/1000} seconds") 139.850: [GC [PSYoungGen: 656889K->42484K(656896K)] 789101K->206163K(1343488K), 0.0704840 secs] [Times: user=0.33 sys=0.09, real=0.07 secs] 16/10/29 14:24:12 WARN HiveConf: HiveConf of name hive.server2.auth.hadoop does not exist [Stage 0:> (0 + 8) / 72]144.774: [GC [PSYoungGen: 656884K->42494K(465408K)] 820563K->323651K(1152000K), 0.0995720 secs] [Times: user=0.47 sys=0.09, real=0.10 secs] [Stage 0:==> (13 + 8) / 72]145.697: [GC [PSYoungGen: 465406K->42661K(465920K)] 746563K->359222K(1152512K), 0.1012790 secs] [Times: user=0.54 sys=0.05, real=0.10 secs] [Stage 0:==> (23 + 8) / 72]146.232: [GC [PSYoungGen: 465573K->20917K(547840K)] 782134K->340686K(1234432K), 0.0262040 secs] [Times: user=0.12 sys=0.00, real=0.03 secs] [Stage 0:===> (40 + 8) / 72]146.725: [GC [PSYoungGen: 427445K->8807K(415744K)] 747214K->332722K(1102336K), 0.0243020 secs] [Times: user=0.08 sys=0.01, real=0.02 secs] [Stage 0:=> (53 + 8) / 72]147.178: [GC [PSYoungGen: 415335K->9865K(545792K)] 739250K->334857K(1232384K), 0.0186080 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] [Stage 1:>(0 + 8) / 200]148.353: [GC [PSYoungGen: 404916K->38320K(433664K)] 729907K->363760K(1120256K), 0.0260660 secs] [Times: user=0.10 sys=0.00, real=0.03 secs] [Stage 1:==> (8 + 8) / 200]149.077: [GC [PSYoungGen: 433417K->47237K(550912K)] 758856K->376034K(1237504K), 0.0219810 secs] [Times: user=0.09 sys=0.00, real=0.02 secs] [Stage 1:> (17 + 8) / 200]149.800: [GC [PSYoungGen: 450693K->17621K(544256K)] 779490K->349370K(1230848K), 0.0217030 secs] [Times: user=0.11 sys=0.00, real=0.02 secs] [Stage 1:=> (34 + 8) / 200]150.319: [GC [PSYoungGen: 420875K->25444K(553984K)] 752624K->359975K(1240576K), 0.0196560 secs] [Times: user=0.11 sys=0.01, real=0.02 secs] [Stage 1:==> (38 + 9) / 200]150.590: [GC [PSYoungGen: 440156K->51351K(550400K)] 774687K->388898K(1236992K), 0.0216630 secs] [Times: user=0.10 sys=0.01, real=0.02 secs] [Stage 1:> (58 + 8) / 200]150.883: [GC [PSYoungGen: 466010K->13379K(553472K)] 803557K->353656K(1240064K), 0.0202700 secs] [Times: user=0.09 sys=0.01, real=0.02 secs] [Stage 1:==> (67 + 8) / 200]151.106: [GC [PSYoungGen: 439211K->53013K(551936K)] 779488K->395761K(1238528K), 0.0248140 secs] [Times: user=0.11 sys=0.01, real=0.02 secs] [Stage 1:=> (75 + 8) / 200]151.401: [GC [PSYoungGen: 478804K->30308K(569344K)] 821551K->375786K(1255936K), 0.0210760 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] [Stage 1:===>(83 + 8) / 200]151.621: [GC [PSYoungGen: 480205K->55779K(566272K)] 825682K->404149K(1252864K), 0.0271460 secs] [Times: user=0.12 sys=0.00, real=0.03 secs] [Stage 1:===>(99 + 8) / 200]151.966: [GC [PSYoungGen: 505776K->44022K(577536K)] 854146K->395148K(1264128K), 0.0235970 secs] [Times: user=0.11 sys=0.00, real=0.02 secs] [Stage 1:==>(112 + 8) / 200]152.211: [GC [PSYoungGen: 513424K->22242K(576512K)] 864550K->376234K(1263104K), 0.0188640 secs] [Times: user=0.07 sys=0.01, real=0.02 secs] [Stage 1:=> (123 + 8) / 200]152.420: [GC [PSYoungGen: 491713K->72455K(584704K)] 845705K->427875K(1271296K), 0.0251320 secs] [Times: user=0.10 sys=0.00, real=0.02 secs] [Stage 1:> (132 + 8) / 200]152.672: [GC [PSYoungGen: 552126K->35637K(589312K)] 907546K->393850K(1275904K), 0.0223350 secs] [Times: user=0.09 sys=0.01, real=0.02 secs] [Stage 1:===
[jira] [Commented] (SPARK-18166) GeneralizedLinearRegression Wrong Value Range for Poisson Distribution
[ https://issues.apache.org/jira/browse/SPARK-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617995#comment-15617995 ] Sean Owen commented on SPARK-18166: --- Agree, feel free to open a PR to fix that. > GeneralizedLinearRegression Wrong Value Range for Poisson Distribution > > > Key: SPARK-18166 > URL: https://issues.apache.org/jira/browse/SPARK-18166 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0 >Reporter: Wayne Zhang > Original Estimate: 10m > Remaining Estimate: 10m > > The current implementation of Poisson GLM seems to allow only positive values > (See below). This is not correct since the support of Poisson includes the > origin. > override def initialize(y: Double, weight: Double): Double = { > require(y {color:red} > {color} 0.0, "The response variable of Poisson > family " + > s"should be positive, but got $y") > y > } > The fix is easy, just change it to > require(y {color:red} >= {color} 0.0, "The response variable of Poisson > family " + -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip
[ https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15618022#comment-15618022 ] Shuai Lin commented on SPARK-4563: -- To do that i think we need to add two extra options: {{spark.driver.advertisePort}} and {{spark.driver.blockManager.advertisePort}}, and pass them to the executors (instead of {{spark.driver.port}} and {{spark.driver.blockManager.port}}) when present. > Allow spark driver to bind to different ip then advertise ip > > > Key: SPARK-4563 > URL: https://issues.apache.org/jira/browse/SPARK-4563 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Long Nguyen >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.1.0 > > > Spark driver bind ip and advertise is not configurable. spark.driver.host is > only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option > to set advertised ip/hostname -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15616: - Comment: was deleted (was: I have updated the code and fixed the problem that you have pointed out. Thanks. I think you can try again.) > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15618475#comment-15618475 ] Lianhui Wang commented on SPARK-15616: -- I have updated the code and fixed the problem that you have pointed out. Thanks. I think you can try again. > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15618476#comment-15618476 ] Lianhui Wang commented on SPARK-15616: -- I have updated the code and fixed the problem that you have pointed out. Thanks. I think you can try again. > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14900) spark.ml classification metrics should include accuracy
[ https://issues.apache.org/jira/browse/SPARK-14900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15618637#comment-15618637 ] Nicholas Chammas commented on SPARK-14900: -- I don't know if this belongs in a separate issue, or if it was intended to be addressed as part of this work, but I can't find {{accuracy}} when I look at the methods and attributes available on {{pyspark.ml.classification.BinaryLogisticRegressionTrainingSummary}}. These are the attributes and methods I see in 2.0.1: {code} 'areaUnderROC', 'fMeasureByThreshold', 'featuresCol', 'labelCol', 'objectiveHistory', 'pr', 'precisionByThreshold', 'predictions', 'probabilityCol', 'recallByThreshold', 'roc', 'totalIterations' {code} Was this an oversight, or am I looking in the wrong place? > spark.ml classification metrics should include accuracy > --- > > Key: SPARK-14900 > URL: https://issues.apache.org/jira/browse/SPARK-14900 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Miao Wang >Priority: Minor > Fix For: 2.0.0 > > > To compute "accuracy" (0/1 classification accuracy), users can use > {{precision}} in MulticlassMetrics and > MulticlassClassificationEvaluator.metricName. We should also support > "accuracy" directly as an alias to help users familiar with that name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17990) ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition column names
[ https://issues.apache.org/jira/browse/SPARK-17990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15618686#comment-15618686 ] Michael Allman commented on SPARK-17990: Has a decision been made on how we want to handle this? I just tried this recipe again with the latest build from master and got the same behavior. > ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition > column names > --- > > Key: SPARK-17990 > URL: https://issues.apache.org/jira/browse/SPARK-17990 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Linux > Mac OS with a case-sensitive filesystem >Reporter: Michael Allman > > Writing partition data to an external table's file location and then adding > those as table partition metadata is a common use case. However, for tables > with partition column names with upper case letters, the SQL command {{ALTER > TABLE ... ADD PARTITION}} does not work, as illustrated in the following > example: > {code} > scala> sql("create external table mixed_case_partitioning (a bigint) > PARTITIONED BY (partCol bigint) STORED AS parquet LOCATION > '/tmp/mixed_case_partitioning'") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sqlContext.range(10).selectExpr("id as a", "id as > partCol").write.partitionBy("partCol").mode("overwrite").parquet("/tmp/mixed_case_partitioning") > {code} > At this point, doing a {{hadoop fs -ls /tmp/mixed_case_partitioning}} > produces the following: > {code} > [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning > Found 11 items > -rw-r--r-- 3 msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/_SUCCESS > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=0 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=1 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=2 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=3 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=4 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=5 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=6 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=7 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=8 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=9 > {code} > Returning to the Spark shell, we execute the following to add the partition > metadata: > {code} > scala> (0 to 9).foreach { p => sql(s"alter table mixed_case_partitioning add > partition(partCol=$p)") } > {code} > Examining the HDFS file listing again, we see: > {code} > [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning > Found 21 items > -rw-r--r-- 3 msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/_SUCCESS > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=0 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=1 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=2 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=3 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=4 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=5 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=6 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=7 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=8 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=9 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=0 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=1 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=2 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=3 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=4 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_p
[jira] [Created] (SPARK-18169) Suppress warnings when dropping views on a dropped table
Dongjoon Hyun created SPARK-18169: - Summary: Suppress warnings when dropping views on a dropped table Key: SPARK-18169 URL: https://issues.apache.org/jira/browse/SPARK-18169 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1, 2.0.0 Reporter: Dongjoon Hyun Priority: Minor Apache Spark 2.0.0 ~ 2.0.2-rc1 shows an inconsistent *AnalysisException* warning message when dropping a *view* on a dropped table. This does not happen on dropping *temporary views*. Also, Spark 1.6.x does not show warnings. We had better suppress this to be more consistent in Spark 2.x and with Spark 1.6.x. {code} scala> sql("create table t(a int)") scala> sql("create view v as select * from t") scala> sql("create temporary view tv as select * from t") scala> sql("drop table t") scala> sql("drop view tv") scala> sql("drop view v") 16/10/29 15:50:03 WARN DropTableCommand: org.apache.spark.sql.AnalysisException: Table or view not found: `default`.`t`; line 1 pos 91; 'SubqueryAlias v, `default`.`v` +- 'Project ['gen_attr_0 AS a#19] +- 'SubqueryAlias t +- 'Project ['gen_attr_0] +- 'SubqueryAlias gen_subquery_0 +- 'Project ['a AS gen_attr_0#18] +- 'UnresolvedRelation `default`.`t` org.apache.spark.sql.AnalysisException: Table or view not found: `default`.`t`; line 1 pos 91; 'SubqueryAlias v, `default`.`v` +- 'Project ['gen_attr_0 AS a#19] +- 'SubqueryAlias t +- 'Project ['gen_attr_0] +- 'SubqueryAlias gen_subquery_0 +- 'Project ['a AS gen_attr_0#18] +- 'UnresolvedRelation `default`.`t` ... res5: org.apache.spark.sql.DataFrame = [] {code} Note that this is different case of dropping non-exist view. For the non-exist view, Spark raises NoSuchTableException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18169) Suppress warnings when dropping views on a dropped table
[ https://issues.apache.org/jira/browse/SPARK-18169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18169: Assignee: Apache Spark > Suppress warnings when dropping views on a dropped table > > > Key: SPARK-18169 > URL: https://issues.apache.org/jira/browse/SPARK-18169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > Apache Spark 2.0.0 ~ 2.0.2-rc1 shows an inconsistent *AnalysisException* > warning message when dropping a *view* on a dropped table. This does not > happen on dropping *temporary views*. Also, Spark 1.6.x does not show > warnings. We had better suppress this to be more consistent in Spark 2.x and > with Spark 1.6.x. > {code} > scala> sql("create table t(a int)") > scala> sql("create view v as select * from t") > scala> sql("create temporary view tv as select * from t") > scala> sql("drop table t") > scala> sql("drop view tv") > scala> sql("drop view v") > 16/10/29 15:50:03 WARN DropTableCommand: > org.apache.spark.sql.AnalysisException: Table or view not found: > `default`.`t`; line 1 pos 91; > 'SubqueryAlias v, `default`.`v` > +- 'Project ['gen_attr_0 AS a#19] >+- 'SubqueryAlias t > +- 'Project ['gen_attr_0] > +- 'SubqueryAlias gen_subquery_0 > +- 'Project ['a AS gen_attr_0#18] >+- 'UnresolvedRelation `default`.`t` > org.apache.spark.sql.AnalysisException: Table or view not found: > `default`.`t`; line 1 pos 91; > 'SubqueryAlias v, `default`.`v` > +- 'Project ['gen_attr_0 AS a#19] >+- 'SubqueryAlias t > +- 'Project ['gen_attr_0] > +- 'SubqueryAlias gen_subquery_0 > +- 'Project ['a AS gen_attr_0#18] >+- 'UnresolvedRelation `default`.`t` > ... > res5: org.apache.spark.sql.DataFrame = [] > {code} > Note that this is different case of dropping non-exist view. For the > non-exist view, Spark raises NoSuchTableException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18169) Suppress warnings when dropping views on a dropped table
[ https://issues.apache.org/jira/browse/SPARK-18169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18169: Assignee: (was: Apache Spark) > Suppress warnings when dropping views on a dropped table > > > Key: SPARK-18169 > URL: https://issues.apache.org/jira/browse/SPARK-18169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Dongjoon Hyun >Priority: Minor > > Apache Spark 2.0.0 ~ 2.0.2-rc1 shows an inconsistent *AnalysisException* > warning message when dropping a *view* on a dropped table. This does not > happen on dropping *temporary views*. Also, Spark 1.6.x does not show > warnings. We had better suppress this to be more consistent in Spark 2.x and > with Spark 1.6.x. > {code} > scala> sql("create table t(a int)") > scala> sql("create view v as select * from t") > scala> sql("create temporary view tv as select * from t") > scala> sql("drop table t") > scala> sql("drop view tv") > scala> sql("drop view v") > 16/10/29 15:50:03 WARN DropTableCommand: > org.apache.spark.sql.AnalysisException: Table or view not found: > `default`.`t`; line 1 pos 91; > 'SubqueryAlias v, `default`.`v` > +- 'Project ['gen_attr_0 AS a#19] >+- 'SubqueryAlias t > +- 'Project ['gen_attr_0] > +- 'SubqueryAlias gen_subquery_0 > +- 'Project ['a AS gen_attr_0#18] >+- 'UnresolvedRelation `default`.`t` > org.apache.spark.sql.AnalysisException: Table or view not found: > `default`.`t`; line 1 pos 91; > 'SubqueryAlias v, `default`.`v` > +- 'Project ['gen_attr_0 AS a#19] >+- 'SubqueryAlias t > +- 'Project ['gen_attr_0] > +- 'SubqueryAlias gen_subquery_0 > +- 'Project ['a AS gen_attr_0#18] >+- 'UnresolvedRelation `default`.`t` > ... > res5: org.apache.spark.sql.DataFrame = [] > {code} > Note that this is different case of dropping non-exist view. For the > non-exist view, Spark raises NoSuchTableException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18169) Suppress warnings when dropping views on a dropped table
[ https://issues.apache.org/jira/browse/SPARK-18169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15618882#comment-15618882 ] Apache Spark commented on SPARK-18169: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/15682 > Suppress warnings when dropping views on a dropped table > > > Key: SPARK-18169 > URL: https://issues.apache.org/jira/browse/SPARK-18169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Dongjoon Hyun >Priority: Minor > > Apache Spark 2.0.0 ~ 2.0.2-rc1 shows an inconsistent *AnalysisException* > warning message when dropping a *view* on a dropped table. This does not > happen on dropping *temporary views*. Also, Spark 1.6.x does not show > warnings. We had better suppress this to be more consistent in Spark 2.x and > with Spark 1.6.x. > {code} > scala> sql("create table t(a int)") > scala> sql("create view v as select * from t") > scala> sql("create temporary view tv as select * from t") > scala> sql("drop table t") > scala> sql("drop view tv") > scala> sql("drop view v") > 16/10/29 15:50:03 WARN DropTableCommand: > org.apache.spark.sql.AnalysisException: Table or view not found: > `default`.`t`; line 1 pos 91; > 'SubqueryAlias v, `default`.`v` > +- 'Project ['gen_attr_0 AS a#19] >+- 'SubqueryAlias t > +- 'Project ['gen_attr_0] > +- 'SubqueryAlias gen_subquery_0 > +- 'Project ['a AS gen_attr_0#18] >+- 'UnresolvedRelation `default`.`t` > org.apache.spark.sql.AnalysisException: Table or view not found: > `default`.`t`; line 1 pos 91; > 'SubqueryAlias v, `default`.`v` > +- 'Project ['gen_attr_0 AS a#19] >+- 'SubqueryAlias t > +- 'Project ['gen_attr_0] > +- 'SubqueryAlias gen_subquery_0 > +- 'Project ['a AS gen_attr_0#18] >+- 'UnresolvedRelation `default`.`t` > ... > res5: org.apache.spark.sql.DataFrame = [] > {code} > Note that this is different case of dropping non-exist view. For the > non-exist view, Spark raises NoSuchTableException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15969) FileNotFoundException: Multiple arguments for py-files flag, (also jars) for spark-submit
[ https://issues.apache.org/jira/browse/SPARK-15969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kun Liu closed SPARK-15969. --- Resolution: Done Seems to be working. So close this JIRA. > FileNotFoundException: Multiple arguments for py-files flag, (also jars) for > spark-submit > - > > Key: SPARK-15969 > URL: https://issues.apache.org/jira/browse/SPARK-15969 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.5.0, 1.6.1 > Environment: Mac OS X 10.11.5 >Reporter: Kun Liu >Priority: Minor > Original Estimate: 120h > Remaining Estimate: 120h > > First time to open a JIRA issue. Newbie to the Spark community. Correct me if > I was wrong. Thanks. > An exception, java.io.FileNotFoundException, happened when multiple arguments > were specified for the -py-files (also -jars) flag. > I searched for a while but only found a similar issue on Windows OS: > https://issues.apache.org/jira/browse/SPARK-6435 > My experiments environment was Mac OS X and Spark version 1.5.0 and 1.6.1 > 1.1 Observations: > 1) Quotation does not make any difference for the arguments, the result will > always be the same > 2) The first path before comma, as long as valid, won’t be a problem whether > it is an absolute or a relative path > 3) The second and further py-files paths won’t be a problem if ALL of them > are: > a. are relative paths under the same directory as the working directory > ($PWD); OR > b. specified by using environment variable at the beginning, e.g. > $ENV_VAR/path/to/file; OR > c. preprocessed by $(echo path/to/*.py | tr ' ' ‘,’), no matter > absolute or relative paths, as long as valid > 4) The path of the driver program, assuming valid, does not matter, as it is > a single file > 1.2 Experiments: > Assuming main.py calls functions from helper1.py and helper2.py, and all > paths below are valid. > ~/Desktop/testpath: main.py, helper1.py, helper2.py > $SPARK_HOME/testpath: helper1.py, helper2.py > 1) Successful output: > a. Multiple python paths are relative paths under the same directory as > the working directory > cd $SPARK_HOME > bin/spark-submit --py-files testpath/helper1.py,testpath/helper2.py > ~/Desktop/testpath/main.py > cd ~/Desktop > $SPARK_HOME/bin/spark-submit --py-files > testpath/helper1.py,testpath/helper2.py testpath/main.py > b. Multiple python paths are specified by using environment variable > export TEST_DIR=~/Desktop/testpath > cd ~ > $SPARK_HOME/bin/spark-submit --py-files > $TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py > > cd ~/Documents > $SPARK_HOME/bin/spark-submit --py-files > $TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py > c. Multiple paths (absolute or relative) after being preprocessed: > $SPARK_HOME/bin/spark-submit --py-files $(echo > $SPARK_HOME/testpath/helper*.py | tr ' ' ',') ~/Desktop/testpath/main.py > cd ~/Desktop > $SPARK_HOME/bin/spark-submit --py-files $(echo testpath/helper*.py | tr > ' ' ',') ~/Desktop/testpath/main.py > (reference link: > http://stackoverflow.com/questions/24855368/spark-throws-classnotfoundexception-when-using-jars-option) > 2) Failure output: if the second python path is an absolute one; the same > problem will happen for further paths > cd ~/Documents > $SPARK_HOME/bin/spark-submit --py-files > ~/Desktop/testpath/helper1.py,~/Desktop/testpath/helper2.py > ~/Desktop/testpath/main.py > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.io.FileNotFoundException: Added file > file:/Users/kunliu/Documents/~/Desktop/testpath/helper2.py does not exist. > 1.3 Conclusions > I would suggest the py-files flag of spark-submit could support all absolute > paths arguments, not just relative path under the working directory. > If necessary, I would like to submit a pull request and start working on it > as my first contribution to the Spark community. > 1.4 Note > 1) I think the same issue will happen when multiple jar files delimited by > comma are passed to the —jars flag flag for Java applications. > 2) I suggest wildcard paths arguments should also be supported, as indicated > by https://issues.apache.org/jira/browse/SPARK-3451 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18166) GeneralizedLinearRegression Wrong Value Range for Poisson Distribution
[ https://issues.apache.org/jira/browse/SPARK-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18166: Assignee: Apache Spark > GeneralizedLinearRegression Wrong Value Range for Poisson Distribution > > > Key: SPARK-18166 > URL: https://issues.apache.org/jira/browse/SPARK-18166 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0 >Reporter: Wayne Zhang >Assignee: Apache Spark > Original Estimate: 10m > Remaining Estimate: 10m > > The current implementation of Poisson GLM seems to allow only positive values > (See below). This is not correct since the support of Poisson includes the > origin. > override def initialize(y: Double, weight: Double): Double = { > require(y {color:red} > {color} 0.0, "The response variable of Poisson > family " + > s"should be positive, but got $y") > y > } > The fix is easy, just change it to > require(y {color:red} >= {color} 0.0, "The response variable of Poisson > family " + -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18166) GeneralizedLinearRegression Wrong Value Range for Poisson Distribution
[ https://issues.apache.org/jira/browse/SPARK-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18166: Assignee: (was: Apache Spark) > GeneralizedLinearRegression Wrong Value Range for Poisson Distribution > > > Key: SPARK-18166 > URL: https://issues.apache.org/jira/browse/SPARK-18166 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0 >Reporter: Wayne Zhang > Original Estimate: 10m > Remaining Estimate: 10m > > The current implementation of Poisson GLM seems to allow only positive values > (See below). This is not correct since the support of Poisson includes the > origin. > override def initialize(y: Double, weight: Double): Double = { > require(y {color:red} > {color} 0.0, "The response variable of Poisson > family " + > s"should be positive, but got $y") > y > } > The fix is easy, just change it to > require(y {color:red} >= {color} 0.0, "The response variable of Poisson > family " + -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18166) GeneralizedLinearRegression Wrong Value Range for Poisson Distribution
[ https://issues.apache.org/jira/browse/SPARK-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15619152#comment-15619152 ] Apache Spark commented on SPARK-18166: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/15683 > GeneralizedLinearRegression Wrong Value Range for Poisson Distribution > > > Key: SPARK-18166 > URL: https://issues.apache.org/jira/browse/SPARK-18166 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0 >Reporter: Wayne Zhang > Original Estimate: 10m > Remaining Estimate: 10m > > The current implementation of Poisson GLM seems to allow only positive values > (See below). This is not correct since the support of Poisson includes the > origin. > override def initialize(y: Double, weight: Double): Double = { > require(y {color:red} > {color} 0.0, "The response variable of Poisson > family " + > s"should be positive, but got $y") > y > } > The fix is easy, just change it to > require(y {color:red} >= {color} 0.0, "The response variable of Poisson > family " + -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18170) Confusing error message when using rangeBetween without specifying an "orderBy"
Weiluo Ren created SPARK-18170: -- Summary: Confusing error message when using rangeBetween without specifying an "orderBy" Key: SPARK-18170 URL: https://issues.apache.org/jira/browse/SPARK-18170 Project: Spark Issue Type: Bug Components: SQL Reporter: Weiluo Ren Priority: Minor {code} spark.range(1,3).select(sum('id) over Window.rangeBetween(0,1)).show {code} throws runtime exception: {code} Non-Zero range offsets are not supported for windows with multiple order expressions. {code} which is confusing in this case because we don't have any order expression here. How about add a check on {code} orderSpec.isEmpty {code} at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L141 and throw an exception saying "no order expressions is specified"? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org