[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-29 Thread J.P Feng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617645#comment-15617645
 ] 

J.P Feng commented on SPARK-18107:
--

 I have tested the performance before and after the patch 
[https://github.com/apache/spark/pull/15667] . But it seems to improve a few 
after patching, where it costs 531 seconds before patching, and costs 518 
seconds after patching. 

I will add the execution logs in work log later.

> Insert overwrite statement runs much slower in spark-sql than it does in 
> hive-client
> 
>
> Key: SPARK-18107
> URL: https://issues.apache.org/jira/browse/SPARK-18107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 2.0.0
> hive 2.0.1
>Reporter: J.P Feng
>
> I find insert overwrite statement running in spark-sql or spark-shell spends 
> much more time than it does in  hive-client (i start it in 
> apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but 
> hive-client just costs less than 20 seconds.
> These are the steps I took.
> Test sql is :
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
> platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and 
>  dt='2016-10-21' ;
> there are 257128 lines of data in tbllog_login with 
> partition(pt='mix_en',dt='2016-10-21')
> ps:
> I'm sure it must be "insert overwrite" costing a lot of time in spark, may be 
> when doing overwrite, it need to spend a lot of time in io or in something 
> else.
> I also compare the executing time between insert overwrite statement and 
> insert into statement.
> 1. insert overwrite statement and insert into statement in spark:
> insert overwrite statement costs about 10 minutes
> insert into statement costs about 30 seconds
> 2. insert into statement in spark and insert into statement in hive-client:
> spark costs about 30 seconds
> hive-client costs about 20 seconds
> the difference is little that we can ignore
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-29 Thread J.P Feng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617654#comment-15617654
 ] 

J.P Feng commented on SPARK-18107:
--

insert overwrite in spark 2.0.0, without patch

scala> val befores =System.currentTimeMillis();spark.sql("insert overwrite 
table login4game partition(pt='mix_en',dt='2016-10-21')select distinct 
account_name,role_id,server,'1476979200' as recdate, 'mix' as platform, 'mix' 
as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 
");val interval=System.currentTimeMillis()-befores;println(s"insertval is => 
${interval/1000} seconds")
3549.075: [GC [PSYoungGen: 435328K->77744K(566784K)] 
826575K->469112K(1232384K), 0.0463220 secs] [Times: user=0.27 sys=0.00, 
real=0.04 secs] 
3549.394: [GC [PSYoungGen: 511920K->26018K(566784K)] 
903288K->459991K(1232384K), 0.0804300 secs] [Times: user=0.44 sys=0.00, 
real=0.08 secs] 
[Stage 4:>(11 + 8) / 
72]3549.698: [GC [PSYoungGen: 463266K->55486K(493056K)] 
897239K->490323K(1158656K), 0.0359570 secs] [Times: user=0.17 sys=0.00, 
real=0.04 secs] 
[Stage 4:===> (19 + 8) / 
72]3549.929: [GC [PSYoungGen: 492734K->82976K(563712K)] 
927571K->522173K(1229312K), 0.0328120 secs] [Times: user=0.16 sys=0.01, 
real=0.03 secs] 
[Stage 4:===> (40 + 8) / 
72]3550.166: [GC [PSYoungGen: 520736K->29045K(564736K)] 
959933K->468617K(1230336K), 0.0244060 secs] [Times: user=0.12 sys=0.00, 
real=0.02 secs] 
[Stage 4:>(46 + 8) / 
72]3550.392: [GC [PSYoungGen: 466309K->41380K(567296K)] 
905881K->481176K(1232896K), 0.0320150 secs] [Times: user=0.16 sys=0.00, 
real=0.04 secs] 
[Stage 4:===> (55 + 8) / 
72]3550.603: [GC [PSYoungGen: 486820K->96197K(541696K)] 
926616K->536271K(1207296K), 0.0326490 secs] [Times: user=0.17 sys=0.00, 
real=0.03 secs] 
[Stage 4:=>   (63 + 8) / 
72]3550.868: [GC [PSYoungGen: 541637K->21696K(567296K)] 
981711K->462242K(1232896K), 0.0259070 secs] [Times: user=0.11 sys=0.00, 
real=0.03 secs] 
[Stage 5:>(0 + 0) / 
200]3551.127: [GC [PSYoungGen: 457912K->125270K(561664K)] 
898458K->565984K(1227264K), 0.0497940 secs] [Times: user=0.25 sys=0.00, 
real=0.05 secs] 
3551.328: [GC [PSYoungGen: 561494K->104705K(527872K)] 
1002208K->552355K(1193472K), 0.0489880 secs] [Times: user=0.28 sys=0.00, 
real=0.05 secs] 
3551.513: [GC [PSYoungGen: 494819K->94833K(485376K)] 
942469K->544947K(1150976K), 0.0472640 secs] [Times: user=0.26 sys=0.00, 
real=0.05 secs] 
[Stage 5:>   (17 + 8) / 
200]3551.701: [GC [PSYoungGen: 484977K->90004K(545792K)] 
935091K->544576K(1211392K), 0.0543700 secs] [Times: user=0.33 sys=0.00, 
real=0.06 secs] 
[Stage 5:===>(25 + 8) / 
200]3551.878: [GC [PSYoungGen: 480005K->96725K(543744K)] 
934833K->565661K(1209344K), 0.0475640 secs] [Times: user=0.24 sys=0.00, 
real=0.05 secs] 
[Stage 5:=>  (34 + 8) / 
200]3552.093: [GC [PSYoungGen: 486869K->86720K(539136K)] 
964767K->567333K(1204736K), 0.0383360 secs] [Times: user=0.23 sys=0.00, 
real=0.04 secs] 
[Stage 5:===>(40 + 8) / 
200]3552.351: [GC [PSYoungGen: 478363K->73147K(541696K)] 
958976K->556404K(1207296K), 0.0401180 secs] [Times: user=0.21 sys=0.00, 
real=0.04 secs] 
[Stage 5:=>  (48 + 8) / 
200]3552.519: [GC [PSYoungGen: 464827K->74781K(541184K)] 
948084K->560529K(1206784K), 0.0459060 secs] [Times: user=0.26 sys=0.00, 
real=0.05 secs] 
[Stage 5:>   (60 + 8) / 
200]3552.834: [GC [PSYoungGen: 480797K->68184K(543232K)] 
966545K->555417K(1208832K), 0.0403320 secs] [Times: user=0.21 sys=0.00, 
real=0.05 secs] 
[Stage 5:==> (67 + 8) / 
200]3553.031: [GC [PSYoungGen: 474200K->53137K(560640K)] 
961433K->541548K(1226240K), 0.0306190 secs] [Times: user=0.15 sys=0.00, 
real=0.03 secs] 
[Stage 5:=>  (77 + 8) / 
200]3553.219: [GC [PSYoungGen: 481681K->54172K(559616K)] 
970092K->543955K(1225216K), 0.0334520 secs] [Times: user=0.17 sys=0.00, 
real=0.03 secs] 
[Stage 5:>   (88 + 8) / 
200]3553.408: [GC [PSYoungGen: 482621K->61091K(564736K)] 
972404K->552016K(1230336K), 0.0398400 secs] [Times: user=0.20 sys=0.00, 
real=0.04 secs] 
[Stage 5:==> (96 + 8) / 
200]3553.596: [GC [PSYoungGen: 5

[jira] [Comment Edited] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-29 Thread J.P Feng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617645#comment-15617645
 ] 

J.P Feng edited comment on SPARK-18107 at 10/29/16 7:42 AM:


 I have tested the performance before and after the patch 
[https://github.com/apache/spark/pull/15667] . But it seems to improve a few 
after patching, where it costs 531 seconds before patching, and costs 518 
seconds after patching. 

I will add the execution logs in a new comment later.


was (Author: snodawn):
 I have tested the performance before and after the patch 
[https://github.com/apache/spark/pull/15667] . But it seems to improve a few 
after patching, where it costs 531 seconds before patching, and costs 518 
seconds after patching. 

I will add the execution logs in work log later.

> Insert overwrite statement runs much slower in spark-sql than it does in 
> hive-client
> 
>
> Key: SPARK-18107
> URL: https://issues.apache.org/jira/browse/SPARK-18107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 2.0.0
> hive 2.0.1
>Reporter: J.P Feng
>
> I find insert overwrite statement running in spark-sql or spark-shell spends 
> much more time than it does in  hive-client (i start it in 
> apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but 
> hive-client just costs less than 20 seconds.
> These are the steps I took.
> Test sql is :
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
> platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and 
>  dt='2016-10-21' ;
> there are 257128 lines of data in tbllog_login with 
> partition(pt='mix_en',dt='2016-10-21')
> ps:
> I'm sure it must be "insert overwrite" costing a lot of time in spark, may be 
> when doing overwrite, it need to spend a lot of time in io or in something 
> else.
> I also compare the executing time between insert overwrite statement and 
> insert into statement.
> 1. insert overwrite statement and insert into statement in spark:
> insert overwrite statement costs about 10 minutes
> insert into statement costs about 30 seconds
> 2. insert into statement in spark and insert into statement in hive-client:
> spark costs about 30 seconds
> hive-client costs about 20 seconds
> the difference is little that we can ignore
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2016-10-29 Thread J.P Feng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617663#comment-15617663
 ] 

J.P Feng commented on SPARK-18107:
--

insert overwrite in spark 2.1.0, with patch

scala> val befores =System.currentTimeMillis();spark.sql("insert overwrite 
table login4game partition(pt='mix_en',dt='2016-10-21')select distinct 
account_name,role_id,server,'1476979200' as recdate, 'mix' as platform, 'mix' 
as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and  dt='2016-10-21' 
");val interval=System.currentTimeMillis()-befores;println(s"insertval is => 
${interval/1000} seconds")
139.850: [GC [PSYoungGen: 656889K->42484K(656896K)] 789101K->206163K(1343488K), 
0.0704840 secs] [Times: user=0.33 sys=0.09, real=0.07 secs] 
16/10/29 14:24:12 WARN HiveConf: HiveConf of name hive.server2.auth.hadoop does 
not exist
[Stage 0:> (0 + 8) / 
72]144.774: [GC [PSYoungGen: 656884K->42494K(465408K)] 
820563K->323651K(1152000K), 0.0995720 secs] [Times: user=0.47 sys=0.09, 
real=0.10 secs] 
[Stage 0:==>  (13 + 8) / 
72]145.697: [GC [PSYoungGen: 465406K->42661K(465920K)] 
746563K->359222K(1152512K), 0.1012790 secs] [Times: user=0.54 sys=0.05, 
real=0.10 secs] 
[Stage 0:==>  (23 + 8) / 
72]146.232: [GC [PSYoungGen: 465573K->20917K(547840K)] 
782134K->340686K(1234432K), 0.0262040 secs] [Times: user=0.12 sys=0.00, 
real=0.03 secs] 
[Stage 0:===> (40 + 8) / 
72]146.725: [GC [PSYoungGen: 427445K->8807K(415744K)] 
747214K->332722K(1102336K), 0.0243020 secs] [Times: user=0.08 sys=0.01, 
real=0.02 secs] 
[Stage 0:=>   (53 + 8) / 
72]147.178: [GC [PSYoungGen: 415335K->9865K(545792K)] 
739250K->334857K(1232384K), 0.0186080 secs] [Times: user=0.08 sys=0.00, 
real=0.02 secs] 
[Stage 1:>(0 + 8) / 
200]148.353: [GC [PSYoungGen: 404916K->38320K(433664K)] 
729907K->363760K(1120256K), 0.0260660 secs] [Times: user=0.10 sys=0.00, 
real=0.03 secs] 
[Stage 1:==>  (8 + 8) / 
200]149.077: [GC [PSYoungGen: 433417K->47237K(550912K)] 
758856K->376034K(1237504K), 0.0219810 secs] [Times: user=0.09 sys=0.00, 
real=0.02 secs] 
[Stage 1:>   (17 + 8) / 
200]149.800: [GC [PSYoungGen: 450693K->17621K(544256K)] 
779490K->349370K(1230848K), 0.0217030 secs] [Times: user=0.11 sys=0.00, 
real=0.02 secs] 
[Stage 1:=>  (34 + 8) / 
200]150.319: [GC [PSYoungGen: 420875K->25444K(553984K)] 
752624K->359975K(1240576K), 0.0196560 secs] [Times: user=0.11 sys=0.01, 
real=0.02 secs] 
[Stage 1:==> (38 + 9) / 
200]150.590: [GC [PSYoungGen: 440156K->51351K(550400K)] 
774687K->388898K(1236992K), 0.0216630 secs] [Times: user=0.10 sys=0.01, 
real=0.02 secs] 
[Stage 1:>   (58 + 8) / 
200]150.883: [GC [PSYoungGen: 466010K->13379K(553472K)] 
803557K->353656K(1240064K), 0.0202700 secs] [Times: user=0.09 sys=0.01, 
real=0.02 secs] 
[Stage 1:==> (67 + 8) / 
200]151.106: [GC [PSYoungGen: 439211K->53013K(551936K)] 
779488K->395761K(1238528K), 0.0248140 secs] [Times: user=0.11 sys=0.01, 
real=0.02 secs] 
[Stage 1:=>  (75 + 8) / 
200]151.401: [GC [PSYoungGen: 478804K->30308K(569344K)] 
821551K->375786K(1255936K), 0.0210760 secs] [Times: user=0.08 sys=0.00, 
real=0.02 secs] 
[Stage 1:===>(83 + 8) / 
200]151.621: [GC [PSYoungGen: 480205K->55779K(566272K)] 
825682K->404149K(1252864K), 0.0271460 secs] [Times: user=0.12 sys=0.00, 
real=0.03 secs] 
[Stage 1:===>(99 + 8) / 
200]151.966: [GC [PSYoungGen: 505776K->44022K(577536K)] 
854146K->395148K(1264128K), 0.0235970 secs] [Times: user=0.11 sys=0.00, 
real=0.02 secs] 
[Stage 1:==>(112 + 8) / 
200]152.211: [GC [PSYoungGen: 513424K->22242K(576512K)] 
864550K->376234K(1263104K), 0.0188640 secs] [Times: user=0.07 sys=0.01, 
real=0.02 secs] 
[Stage 1:=> (123 + 8) / 
200]152.420: [GC [PSYoungGen: 491713K->72455K(584704K)] 
845705K->427875K(1271296K), 0.0251320 secs] [Times: user=0.10 sys=0.00, 
real=0.02 secs] 
[Stage 1:>  (132 + 8) / 
200]152.672: [GC [PSYoungGen: 552126K->35637K(589312K)] 
907546K->393850K(1275904K), 0.0223350 secs] [Times: user=0.09 sys=0.01, 
real=0.02 secs] 
[Stage 1:===

[jira] [Commented] (SPARK-18166) GeneralizedLinearRegression Wrong Value Range for Poisson Distribution

2016-10-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617995#comment-15617995
 ] 

Sean Owen commented on SPARK-18166:
---

Agree, feel free to open a PR to fix that.

> GeneralizedLinearRegression Wrong Value Range for Poisson Distribution  
> 
>
> Key: SPARK-18166
> URL: https://issues.apache.org/jira/browse/SPARK-18166
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Wayne Zhang
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> The current implementation of Poisson GLM seems to allow only positive values 
> (See below). This is not correct since the support of Poisson includes the 
> origin. 
> override def initialize(y: Double, weight: Double): Double = {
>   require(y {color:red} > {color} 0.0, "The response variable of Poisson 
> family " +
> s"should be positive, but got $y")
>   y
> }
> The fix is easy, just change it to 
>  require(y {color:red} >= {color} 0.0, "The response variable of Poisson 
> family " +



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-10-29 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15618022#comment-15618022
 ] 

Shuai Lin commented on SPARK-4563:
--

To do that i think we need to add two extra options: 
{{spark.driver.advertisePort}} and {{spark.driver.blockManager.advertisePort}}, 
and pass them to the executors (instead of {{spark.driver.port}} and 
{{spark.driver.blockManager.port}}) when present.

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-29 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-15616:
-
Comment: was deleted

(was: I have updated the code and fixed the problem that you have pointed out. 
Thanks. I think you can try again.)

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-29 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15618475#comment-15618475
 ] 

Lianhui Wang commented on SPARK-15616:
--

I have updated the code and fixed the problem that you have pointed out. 
Thanks. I think you can try again.

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-29 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15618476#comment-15618476
 ] 

Lianhui Wang commented on SPARK-15616:
--

I have updated the code and fixed the problem that you have pointed out. 
Thanks. I think you can try again.

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14900) spark.ml classification metrics should include accuracy

2016-10-29 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15618637#comment-15618637
 ] 

Nicholas Chammas commented on SPARK-14900:
--

I don't know if this belongs in a separate issue, or if it was intended to be 
addressed as part of this work, but I can't find {{accuracy}} when I look at 
the methods and attributes available on 
{{pyspark.ml.classification.BinaryLogisticRegressionTrainingSummary}}.

These are the attributes and methods I see in 2.0.1:

{code}
 'areaUnderROC',
 'fMeasureByThreshold',
 'featuresCol',
 'labelCol',
 'objectiveHistory',
 'pr',
 'precisionByThreshold',
 'predictions',
 'probabilityCol',
 'recallByThreshold',
 'roc',
 'totalIterations'
{code}

Was this an oversight, or am I looking in the wrong place?

> spark.ml classification metrics should include accuracy
> ---
>
> Key: SPARK-14900
> URL: https://issues.apache.org/jira/browse/SPARK-14900
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.0.0
>
>
> To compute "accuracy" (0/1 classification accuracy), users can use 
> {{precision}} in MulticlassMetrics and 
> MulticlassClassificationEvaluator.metricName.  We should also support 
> "accuracy" directly as an alias to help users familiar with that name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17990) ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition column names

2016-10-29 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15618686#comment-15618686
 ] 

Michael Allman commented on SPARK-17990:


Has a decision been made on how we want to handle this? I just tried this 
recipe again with the latest build from master and got the same behavior.

> ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition 
> column names
> ---
>
> Key: SPARK-17990
> URL: https://issues.apache.org/jira/browse/SPARK-17990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Linux
> Mac OS with a case-sensitive filesystem
>Reporter: Michael Allman
>
> Writing partition data to an external table's file location and then adding 
> those as table partition metadata is a common use case. However, for tables 
> with partition column names with upper case letters, the SQL command {{ALTER 
> TABLE ... ADD PARTITION}} does not work, as illustrated in the following 
> example:
> {code}
> scala> sql("create external table mixed_case_partitioning (a bigint) 
> PARTITIONED BY (partCol bigint) STORED AS parquet LOCATION 
> '/tmp/mixed_case_partitioning'")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sqlContext.range(10).selectExpr("id as a", "id as 
> partCol").write.partitionBy("partCol").mode("overwrite").parquet("/tmp/mixed_case_partitioning")
> {code}
> At this point, doing a {{hadoop fs -ls /tmp/mixed_case_partitioning}} 
> produces the following:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 11 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=8
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=9
> {code}
> Returning to the Spark shell, we execute the following to add the partition 
> metadata:
> {code}
> scala> (0 to 9).foreach { p => sql(s"alter table mixed_case_partitioning add 
> partition(partCol=$p)") }
> {code}
> Examining the HDFS file listing again, we see:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 21 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=8
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=9
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_p

[jira] [Created] (SPARK-18169) Suppress warnings when dropping views on a dropped table

2016-10-29 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-18169:
-

 Summary: Suppress warnings when dropping views on a dropped table
 Key: SPARK-18169
 URL: https://issues.apache.org/jira/browse/SPARK-18169
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
Reporter: Dongjoon Hyun
Priority: Minor


Apache Spark 2.0.0 ~ 2.0.2-rc1 shows an inconsistent *AnalysisException* 
warning message when dropping a *view* on a dropped table. This does not happen 
on dropping *temporary views*. Also, Spark 1.6.x does not show warnings. We had 
better suppress this to be more consistent in Spark 2.x and with Spark 1.6.x.

{code}
scala> sql("create table t(a int)")

scala> sql("create view v as select * from t")

scala> sql("create temporary view tv as select * from t")

scala> sql("drop table t")

scala> sql("drop view tv")

scala> sql("drop view v")
16/10/29 15:50:03 WARN DropTableCommand: 
org.apache.spark.sql.AnalysisException: Table or view not found: `default`.`t`; 
line 1 pos 91;
'SubqueryAlias v, `default`.`v`
+- 'Project ['gen_attr_0 AS a#19]
   +- 'SubqueryAlias t
  +- 'Project ['gen_attr_0]
 +- 'SubqueryAlias gen_subquery_0
+- 'Project ['a AS gen_attr_0#18]
   +- 'UnresolvedRelation `default`.`t`

org.apache.spark.sql.AnalysisException: Table or view not found: `default`.`t`; 
line 1 pos 91;
'SubqueryAlias v, `default`.`v`
+- 'Project ['gen_attr_0 AS a#19]
   +- 'SubqueryAlias t
  +- 'Project ['gen_attr_0]
 +- 'SubqueryAlias gen_subquery_0
+- 'Project ['a AS gen_attr_0#18]
   +- 'UnresolvedRelation `default`.`t`
...
res5: org.apache.spark.sql.DataFrame = []
{code}

Note that this is different case of dropping non-exist view. For the non-exist 
view, Spark raises NoSuchTableException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18169) Suppress warnings when dropping views on a dropped table

2016-10-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18169:


Assignee: Apache Spark

> Suppress warnings when dropping views on a dropped table
> 
>
> Key: SPARK-18169
> URL: https://issues.apache.org/jira/browse/SPARK-18169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Apache Spark 2.0.0 ~ 2.0.2-rc1 shows an inconsistent *AnalysisException* 
> warning message when dropping a *view* on a dropped table. This does not 
> happen on dropping *temporary views*. Also, Spark 1.6.x does not show 
> warnings. We had better suppress this to be more consistent in Spark 2.x and 
> with Spark 1.6.x.
> {code}
> scala> sql("create table t(a int)")
> scala> sql("create view v as select * from t")
> scala> sql("create temporary view tv as select * from t")
> scala> sql("drop table t")
> scala> sql("drop view tv")
> scala> sql("drop view v")
> 16/10/29 15:50:03 WARN DropTableCommand: 
> org.apache.spark.sql.AnalysisException: Table or view not found: 
> `default`.`t`; line 1 pos 91;
> 'SubqueryAlias v, `default`.`v`
> +- 'Project ['gen_attr_0 AS a#19]
>+- 'SubqueryAlias t
>   +- 'Project ['gen_attr_0]
>  +- 'SubqueryAlias gen_subquery_0
> +- 'Project ['a AS gen_attr_0#18]
>+- 'UnresolvedRelation `default`.`t`
> org.apache.spark.sql.AnalysisException: Table or view not found: 
> `default`.`t`; line 1 pos 91;
> 'SubqueryAlias v, `default`.`v`
> +- 'Project ['gen_attr_0 AS a#19]
>+- 'SubqueryAlias t
>   +- 'Project ['gen_attr_0]
>  +- 'SubqueryAlias gen_subquery_0
> +- 'Project ['a AS gen_attr_0#18]
>+- 'UnresolvedRelation `default`.`t`
> ...
> res5: org.apache.spark.sql.DataFrame = []
> {code}
> Note that this is different case of dropping non-exist view. For the 
> non-exist view, Spark raises NoSuchTableException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18169) Suppress warnings when dropping views on a dropped table

2016-10-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18169:


Assignee: (was: Apache Spark)

> Suppress warnings when dropping views on a dropped table
> 
>
> Key: SPARK-18169
> URL: https://issues.apache.org/jira/browse/SPARK-18169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Apache Spark 2.0.0 ~ 2.0.2-rc1 shows an inconsistent *AnalysisException* 
> warning message when dropping a *view* on a dropped table. This does not 
> happen on dropping *temporary views*. Also, Spark 1.6.x does not show 
> warnings. We had better suppress this to be more consistent in Spark 2.x and 
> with Spark 1.6.x.
> {code}
> scala> sql("create table t(a int)")
> scala> sql("create view v as select * from t")
> scala> sql("create temporary view tv as select * from t")
> scala> sql("drop table t")
> scala> sql("drop view tv")
> scala> sql("drop view v")
> 16/10/29 15:50:03 WARN DropTableCommand: 
> org.apache.spark.sql.AnalysisException: Table or view not found: 
> `default`.`t`; line 1 pos 91;
> 'SubqueryAlias v, `default`.`v`
> +- 'Project ['gen_attr_0 AS a#19]
>+- 'SubqueryAlias t
>   +- 'Project ['gen_attr_0]
>  +- 'SubqueryAlias gen_subquery_0
> +- 'Project ['a AS gen_attr_0#18]
>+- 'UnresolvedRelation `default`.`t`
> org.apache.spark.sql.AnalysisException: Table or view not found: 
> `default`.`t`; line 1 pos 91;
> 'SubqueryAlias v, `default`.`v`
> +- 'Project ['gen_attr_0 AS a#19]
>+- 'SubqueryAlias t
>   +- 'Project ['gen_attr_0]
>  +- 'SubqueryAlias gen_subquery_0
> +- 'Project ['a AS gen_attr_0#18]
>+- 'UnresolvedRelation `default`.`t`
> ...
> res5: org.apache.spark.sql.DataFrame = []
> {code}
> Note that this is different case of dropping non-exist view. For the 
> non-exist view, Spark raises NoSuchTableException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18169) Suppress warnings when dropping views on a dropped table

2016-10-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15618882#comment-15618882
 ] 

Apache Spark commented on SPARK-18169:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15682

> Suppress warnings when dropping views on a dropped table
> 
>
> Key: SPARK-18169
> URL: https://issues.apache.org/jira/browse/SPARK-18169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Apache Spark 2.0.0 ~ 2.0.2-rc1 shows an inconsistent *AnalysisException* 
> warning message when dropping a *view* on a dropped table. This does not 
> happen on dropping *temporary views*. Also, Spark 1.6.x does not show 
> warnings. We had better suppress this to be more consistent in Spark 2.x and 
> with Spark 1.6.x.
> {code}
> scala> sql("create table t(a int)")
> scala> sql("create view v as select * from t")
> scala> sql("create temporary view tv as select * from t")
> scala> sql("drop table t")
> scala> sql("drop view tv")
> scala> sql("drop view v")
> 16/10/29 15:50:03 WARN DropTableCommand: 
> org.apache.spark.sql.AnalysisException: Table or view not found: 
> `default`.`t`; line 1 pos 91;
> 'SubqueryAlias v, `default`.`v`
> +- 'Project ['gen_attr_0 AS a#19]
>+- 'SubqueryAlias t
>   +- 'Project ['gen_attr_0]
>  +- 'SubqueryAlias gen_subquery_0
> +- 'Project ['a AS gen_attr_0#18]
>+- 'UnresolvedRelation `default`.`t`
> org.apache.spark.sql.AnalysisException: Table or view not found: 
> `default`.`t`; line 1 pos 91;
> 'SubqueryAlias v, `default`.`v`
> +- 'Project ['gen_attr_0 AS a#19]
>+- 'SubqueryAlias t
>   +- 'Project ['gen_attr_0]
>  +- 'SubqueryAlias gen_subquery_0
> +- 'Project ['a AS gen_attr_0#18]
>+- 'UnresolvedRelation `default`.`t`
> ...
> res5: org.apache.spark.sql.DataFrame = []
> {code}
> Note that this is different case of dropping non-exist view. For the 
> non-exist view, Spark raises NoSuchTableException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15969) FileNotFoundException: Multiple arguments for py-files flag, (also jars) for spark-submit

2016-10-29 Thread Kun Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kun Liu closed SPARK-15969.
---
Resolution: Done

Seems to be working. So close this JIRA.

> FileNotFoundException: Multiple arguments for py-files flag, (also jars) for 
> spark-submit
> -
>
> Key: SPARK-15969
> URL: https://issues.apache.org/jira/browse/SPARK-15969
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0, 1.6.1
> Environment: Mac OS X 10.11.5
>Reporter: Kun Liu
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> First time to open a JIRA issue. Newbie to the Spark community. Correct me if 
> I was wrong. Thanks.
> An exception, java.io.FileNotFoundException, happened when multiple arguments 
> were specified for the -py-files (also -jars) flag.
> I searched for a while but only found a similar issue on Windows OS: 
> https://issues.apache.org/jira/browse/SPARK-6435
> My experiments environment was Mac OS X and Spark version 1.5.0 and 1.6.1
> 1.1 Observations:
> 1) Quotation does not make any difference for the arguments, the result will 
> always be the same
> 2) The first path before comma, as long as valid, won’t be a problem whether 
> it is an absolute or a relative path
> 3) The second and further py-files paths won’t be a problem if ALL of them 
> are:
>   a. are relative paths under the same directory as the working directory 
> ($PWD); OR
>   b. specified by using environment variable at the beginning, e.g. 
> $ENV_VAR/path/to/file; OR
>   c. preprocessed by $(echo path/to/*.py | tr ' ' ‘,’), no matter 
> absolute or relative paths, as long as valid
> 4) The path of the driver program, assuming valid, does not matter, as it is 
> a single file
> 1.2 Experiments:
> Assuming main.py calls functions from helper1.py and helper2.py, and all 
> paths below are valid.
> ~/Desktop/testpath: main.py, helper1.py, helper2.py
> $SPARK_HOME/testpath: helper1.py, helper2.py
> 1) Successful output:
>   a. Multiple python paths are relative paths under the same directory as 
> the working directory
>   cd $SPARK_HOME
>   bin/spark-submit --py-files testpath/helper1.py,testpath/helper2.py 
> ~/Desktop/testpath/main.py
>   cd ~/Desktop
>   $SPARK_HOME/bin/spark-submit --py-files 
> testpath/helper1.py,testpath/helper2.py testpath/main.py
>   b. Multiple python paths are specified by using environment variable
>   export TEST_DIR=~/Desktop/testpath
>   cd ~
>   $SPARK_HOME/bin/spark-submit --py-files 
> $TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py
>   
>   cd ~/Documents
>   $SPARK_HOME/bin/spark-submit --py-files 
> $TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py
>   c. Multiple paths (absolute or relative) after being preprocessed:
>   $SPARK_HOME/bin/spark-submit --py-files $(echo 
> $SPARK_HOME/testpath/helper*.py | tr ' ' ',') ~/Desktop/testpath/main.py 
>   cd ~/Desktop
>   $SPARK_HOME/bin/spark-submit --py-files $(echo testpath/helper*.py | tr 
> ' ' ',') ~/Desktop/testpath/main.py 
>   (reference link: 
> http://stackoverflow.com/questions/24855368/spark-throws-classnotfoundexception-when-using-jars-option)
> 2) Failure output: if the second python path is an absolute one; the same 
> problem will happen for further paths
>   cd ~/Documents
>   $SPARK_HOME/bin/spark-submit --py-files 
> ~/Desktop/testpath/helper1.py,~/Desktop/testpath/helper2.py 
> ~/Desktop/testpath/main.py 
>   py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
>   : java.io.FileNotFoundException: Added file 
> file:/Users/kunliu/Documents/~/Desktop/testpath/helper2.py does not exist.
> 1.3 Conclusions
> I would suggest the py-files flag of spark-submit could support all absolute 
> paths arguments, not just relative path under the working directory.
> If necessary, I would like to submit a pull request and start working on it 
> as my first contribution to the Spark community.
> 1.4 Note
> 1) I think the same issue will happen when multiple jar files delimited by 
> comma are passed to the —jars flag flag for Java applications.
> 2) I suggest wildcard paths arguments should also be supported, as indicated 
> by https://issues.apache.org/jira/browse/SPARK-3451



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18166) GeneralizedLinearRegression Wrong Value Range for Poisson Distribution

2016-10-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18166:


Assignee: Apache Spark

> GeneralizedLinearRegression Wrong Value Range for Poisson Distribution  
> 
>
> Key: SPARK-18166
> URL: https://issues.apache.org/jira/browse/SPARK-18166
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Wayne Zhang
>Assignee: Apache Spark
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> The current implementation of Poisson GLM seems to allow only positive values 
> (See below). This is not correct since the support of Poisson includes the 
> origin. 
> override def initialize(y: Double, weight: Double): Double = {
>   require(y {color:red} > {color} 0.0, "The response variable of Poisson 
> family " +
> s"should be positive, but got $y")
>   y
> }
> The fix is easy, just change it to 
>  require(y {color:red} >= {color} 0.0, "The response variable of Poisson 
> family " +



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18166) GeneralizedLinearRegression Wrong Value Range for Poisson Distribution

2016-10-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18166:


Assignee: (was: Apache Spark)

> GeneralizedLinearRegression Wrong Value Range for Poisson Distribution  
> 
>
> Key: SPARK-18166
> URL: https://issues.apache.org/jira/browse/SPARK-18166
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Wayne Zhang
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> The current implementation of Poisson GLM seems to allow only positive values 
> (See below). This is not correct since the support of Poisson includes the 
> origin. 
> override def initialize(y: Double, weight: Double): Double = {
>   require(y {color:red} > {color} 0.0, "The response variable of Poisson 
> family " +
> s"should be positive, but got $y")
>   y
> }
> The fix is easy, just change it to 
>  require(y {color:red} >= {color} 0.0, "The response variable of Poisson 
> family " +



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18166) GeneralizedLinearRegression Wrong Value Range for Poisson Distribution

2016-10-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15619152#comment-15619152
 ] 

Apache Spark commented on SPARK-18166:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/15683

> GeneralizedLinearRegression Wrong Value Range for Poisson Distribution  
> 
>
> Key: SPARK-18166
> URL: https://issues.apache.org/jira/browse/SPARK-18166
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Wayne Zhang
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> The current implementation of Poisson GLM seems to allow only positive values 
> (See below). This is not correct since the support of Poisson includes the 
> origin. 
> override def initialize(y: Double, weight: Double): Double = {
>   require(y {color:red} > {color} 0.0, "The response variable of Poisson 
> family " +
> s"should be positive, but got $y")
>   y
> }
> The fix is easy, just change it to 
>  require(y {color:red} >= {color} 0.0, "The response variable of Poisson 
> family " +



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18170) Confusing error message when using rangeBetween without specifying an "orderBy"

2016-10-29 Thread Weiluo Ren (JIRA)
Weiluo Ren created SPARK-18170:
--

 Summary: Confusing error message when using rangeBetween without 
specifying an "orderBy"
 Key: SPARK-18170
 URL: https://issues.apache.org/jira/browse/SPARK-18170
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Weiluo Ren
Priority: Minor


{code}
spark.range(1,3).select(sum('id) over Window.rangeBetween(0,1)).show
{code}
throws runtime exception:
{code}
Non-Zero range offsets are not supported for windows with multiple order 
expressions.
{code}
which is confusing in this case because we don't have any order expression here.

How about add a check on
{code}
orderSpec.isEmpty
{code}
at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L141
and throw an exception saying "no order expressions is specified"?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org