date:20191119

[jira] [Resolved] (SPARK-29758) json_tuple truncates fields

2019-11-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29758.
-
Fix Version/s: 2.4.5
   Resolution: Fixed

Issue resolved by pull request 26563
[https://github.com/apache/spark/pull/26563]

> json_tuple truncates fields
> ---
>
> Key: SPARK-29758
> URL: https://issues.apache.org/jira/browse/SPARK-29758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4
> Environment: EMR 5.15.0 (Spark 2.3.0) And MacBook Pro (Mojave 
> 10.14.3, Spark 2.4.4)
> Jdk 8, Scala 2.11.12
>Reporter: Stanislav
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.5
>
>
> `json_tuple` has inconsistent behaviour with `from_json` - but only if json 
> string is longer than 2700 characters or so.
> This can be reproduced in spark-shell and on cluster, but not in scalatest, 
> for some reason.
> {code}
> import org.apache.spark.sql.functions.{from_json, json_tuple}
> import org.apache.spark.sql.types._
> val counterstring = 
> "*3*5*7*9*12*15*18*21*24*27*30*33*36*39*42*45*48*51*54*57*60*63*66*69*72*75*78*81*84*87*90*93*96*99*103*107*111*115*119*123*127*131*135*139*143*147*151*155*159*163*167*171*175*179*183*187*191*195*199*203*207*211*215*219*223*227*231*235*239*243*247*251*255*259*263*267*271*275*279*283*287*291*295*299*303*307*311*315*319*323*327*331*335*339*343*347*351*355*359*363*367*371*375*379*383*387*391*395*399*403*407*411*415*419*423*427*431*435*439*443*447*451*455*459*463*467*471*475*479*483*487*491*495*499*503*507*511*515*519*523*527*531*535*539*543*547*551*555*559*563*567*571*575*579*583*587*591*595*599*603*607*611*615*619*623*627*631*635*639*643*647*651*655*659*663*667*671*675*679*683*687*691*695*699*703*707*711*715*719*723*727*731*735*739*743*747*751*755*759*763*767*771*775*779*783*787*791*795*799*803*807*811*815*819*823*827*831*835*839*843*847*851*855*859*863*867*871*875*879*883*887*891*895*899*903*907*911*915*919*923*927*931*935*939*943*947*951*955*959*963*967*971*975*979*983*987*991*995*1000*1005*1010*1015*1020*1025*1030*1035*1040*1045*1050*1055*1060*1065*1070*1075*1080*1085*1090*1095*1100*1105*1110*1115*1120*1125*1130*1135*1140*1145*1150*1155*1160*1165*1170*1175*1180*1185*1190*1195*1200*1205*1210*1215*1220*1225*1230*1235*1240*1245*1250*1255*1260*1265*1270*1275*1280*1285*1290*1295*1300*1305*1310*1315*1320*1325*1330*1335*1340*1345*1350*1355*1360*1365*1370*1375*1380*1385*1390*1395*1400*1405*1410*1415*1420*1425*1430*1435*1440*1445*1450*1455*1460*1465*1470*1475*1480*1485*1490*1495*1500*1505*1510*1515*1520*1525*1530*1535*1540*1545*1550*1555*1560*1565*1570*1575*1580*1585*1590*1595*1600*1605*1610*1615*1620*1625*1630*1635*1640*1645*1650*1655*1660*1665*1670*1675*1680*1685*1690*1695*1700*1705*1710*1715*1720*1725*1730*1735*1740*1745*1750*1755*1760*1765*1770*1775*1780*1785*1790*1795*1800*1805*1810*1815*1820*1825*1830*1835*1840*1845*1850*1855*1860*1865*1870*1875*1880*1885*1890*1895*1900*1905*1910*1915*1920*1925*1930*1935*1940*1945*1950*1955*1960*1965*1970*1975*1980*1985*1990*1995*2000*2005*2010*2015*2020*2025*2030*2035*2040*2045*2050*2055*2060*2065*2070*2075*2080*2085*2090*2095*2100*2105*2110*2115*2120*2125*2130*2135*2140*2145*2150*2155*2160*2165*2170*2175*2180*2185*2190*2195*2200*2205*2210*2215*2220*2225*2230*2235*2240*2245*2250*2255*2260*2265*2270*2275*2280*2285*2290*2295*2300*2305*2310*2315*2320*2325*2330*2335*2340*2345*2350*2355*2360*2365*2370*2375*2380*2385*2390*2395*2400*2405*2410*2415*2420*2425*2430*2435*2440*2445*2450*2455*2460*2465*2470*2475*2480*2485*2490*2495*2500*2505*2510*2515*2520*2525*2530*2535*2540*2545*2550*2555*2560*2565*2570*2575*2580*2585*2590*2595*2600*2605*2610*2615*2620*2625*2630*2635*2640*2645*2650*2655*2660*2665*2670*2675*2680*2685*2690*2695*2700*2705*2710*2715*2720*2725*2730*2735*2740*2745*2750*2755*2760*2765*2770*2775*2780*2785*2790*2795*2800*"
> val json_tuple_result = Seq(s"""{"test":"$counterstring"}""").toDF("json")
>   .withColumn("result", json_tuple('json, "test"))
>   .select('result)
>   .as[String].head.length
> val from_json_result = Seq(s"""{"test":"$counterstring"}""").toDF("json")
>   .withColumn("parsed", from_json('json, StructType(Seq(StructField("test", 
> StringType)
>   .withColumn("result", $"parsed.test")
>   .select('result)
>   .as[String].head.length
> scala> json_tuple_result
> res62: Int = 2791
> scala> from_json_result
> res63: Int = 2800
> {code}
> Result is influenced by the total length of the json string at the moment of 
> parsing:
> {code}
> val json_tuple_result_with_prefix = Seq(s"""{"prefix": "dummy", 
> "test":"$counterstring"}""").toDF("json")
>   .withColumn("result", json_tuple('json, "test"))
>   .select('result)
>   .as[String].head.length
> scala> json_tuple_result_with_prefix
> res64: Int = 2772
> {code}



--
This message was sent by

[jira] [Assigned] (SPARK-29758) json_tuple truncates fields

2019-11-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29758:
---

Assignee: Maxim Gekk

> json_tuple truncates fields
> ---
>
> Key: SPARK-29758
> URL: https://issues.apache.org/jira/browse/SPARK-29758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4
> Environment: EMR 5.15.0 (Spark 2.3.0) And MacBook Pro (Mojave 
> 10.14.3, Spark 2.4.4)
> Jdk 8, Scala 2.11.12
>Reporter: Stanislav
>Assignee: Maxim Gekk
>Priority: Major
>
> `json_tuple` has inconsistent behaviour with `from_json` - but only if json 
> string is longer than 2700 characters or so.
> This can be reproduced in spark-shell and on cluster, but not in scalatest, 
> for some reason.
> {code}
> import org.apache.spark.sql.functions.{from_json, json_tuple}
> import org.apache.spark.sql.types._
> val counterstring = 
> "*3*5*7*9*12*15*18*21*24*27*30*33*36*39*42*45*48*51*54*57*60*63*66*69*72*75*78*81*84*87*90*93*96*99*103*107*111*115*119*123*127*131*135*139*143*147*151*155*159*163*167*171*175*179*183*187*191*195*199*203*207*211*215*219*223*227*231*235*239*243*247*251*255*259*263*267*271*275*279*283*287*291*295*299*303*307*311*315*319*323*327*331*335*339*343*347*351*355*359*363*367*371*375*379*383*387*391*395*399*403*407*411*415*419*423*427*431*435*439*443*447*451*455*459*463*467*471*475*479*483*487*491*495*499*503*507*511*515*519*523*527*531*535*539*543*547*551*555*559*563*567*571*575*579*583*587*591*595*599*603*607*611*615*619*623*627*631*635*639*643*647*651*655*659*663*667*671*675*679*683*687*691*695*699*703*707*711*715*719*723*727*731*735*739*743*747*751*755*759*763*767*771*775*779*783*787*791*795*799*803*807*811*815*819*823*827*831*835*839*843*847*851*855*859*863*867*871*875*879*883*887*891*895*899*903*907*911*915*919*923*927*931*935*939*943*947*951*955*959*963*967*971*975*979*983*987*991*995*1000*1005*1010*1015*1020*1025*1030*1035*1040*1045*1050*1055*1060*1065*1070*1075*1080*1085*1090*1095*1100*1105*1110*1115*1120*1125*1130*1135*1140*1145*1150*1155*1160*1165*1170*1175*1180*1185*1190*1195*1200*1205*1210*1215*1220*1225*1230*1235*1240*1245*1250*1255*1260*1265*1270*1275*1280*1285*1290*1295*1300*1305*1310*1315*1320*1325*1330*1335*1340*1345*1350*1355*1360*1365*1370*1375*1380*1385*1390*1395*1400*1405*1410*1415*1420*1425*1430*1435*1440*1445*1450*1455*1460*1465*1470*1475*1480*1485*1490*1495*1500*1505*1510*1515*1520*1525*1530*1535*1540*1545*1550*1555*1560*1565*1570*1575*1580*1585*1590*1595*1600*1605*1610*1615*1620*1625*1630*1635*1640*1645*1650*1655*1660*1665*1670*1675*1680*1685*1690*1695*1700*1705*1710*1715*1720*1725*1730*1735*1740*1745*1750*1755*1760*1765*1770*1775*1780*1785*1790*1795*1800*1805*1810*1815*1820*1825*1830*1835*1840*1845*1850*1855*1860*1865*1870*1875*1880*1885*1890*1895*1900*1905*1910*1915*1920*1925*1930*1935*1940*1945*1950*1955*1960*1965*1970*1975*1980*1985*1990*1995*2000*2005*2010*2015*2020*2025*2030*2035*2040*2045*2050*2055*2060*2065*2070*2075*2080*2085*2090*2095*2100*2105*2110*2115*2120*2125*2130*2135*2140*2145*2150*2155*2160*2165*2170*2175*2180*2185*2190*2195*2200*2205*2210*2215*2220*2225*2230*2235*2240*2245*2250*2255*2260*2265*2270*2275*2280*2285*2290*2295*2300*2305*2310*2315*2320*2325*2330*2335*2340*2345*2350*2355*2360*2365*2370*2375*2380*2385*2390*2395*2400*2405*2410*2415*2420*2425*2430*2435*2440*2445*2450*2455*2460*2465*2470*2475*2480*2485*2490*2495*2500*2505*2510*2515*2520*2525*2530*2535*2540*2545*2550*2555*2560*2565*2570*2575*2580*2585*2590*2595*2600*2605*2610*2615*2620*2625*2630*2635*2640*2645*2650*2655*2660*2665*2670*2675*2680*2685*2690*2695*2700*2705*2710*2715*2720*2725*2730*2735*2740*2745*2750*2755*2760*2765*2770*2775*2780*2785*2790*2795*2800*"
> val json_tuple_result = Seq(s"""{"test":"$counterstring"}""").toDF("json")
>   .withColumn("result", json_tuple('json, "test"))
>   .select('result)
>   .as[String].head.length
> val from_json_result = Seq(s"""{"test":"$counterstring"}""").toDF("json")
>   .withColumn("parsed", from_json('json, StructType(Seq(StructField("test", 
> StringType)
>   .withColumn("result", $"parsed.test")
>   .select('result)
>   .as[String].head.length
> scala> json_tuple_result
> res62: Int = 2791
> scala> from_json_result
> res63: Int = 2800
> {code}
> Result is influenced by the total length of the json string at the moment of 
> parsing:
> {code}
> val json_tuple_result_with_prefix = Seq(s"""{"prefix": "dummy", 
> "test":"$counterstring"}""").toDF("json")
>   .withColumn("result", json_tuple('json, "test"))
>   .select('result)
>   .as[String].head.length
> scala> json_tuple_result_with_prefix
> res64: Int = 2772
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail:

[jira] [Resolved] (SPARK-29913) Improve Exception in postgreCastToBoolean

2019-11-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29913.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26546
[https://github.com/apache/spark/pull/26546]

> Improve Exception in postgreCastToBoolean 
> --
>
> Key: SPARK-29913
> URL: https://issues.apache.org/jira/browse/SPARK-29913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: jobit mathew
>Priority: Minor
> Fix For: 3.0.0
>
>
> Improve Exception in postgreCastToBoolean 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29913) Improve Exception in postgreCastToBoolean

2019-11-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29913:
---

Assignee: jobit mathew

> Improve Exception in postgreCastToBoolean 
> --
>
> Key: SPARK-29913
> URL: https://issues.apache.org/jira/browse/SPARK-29913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: jobit mathew
>Priority: Minor
>
> Improve Exception in postgreCastToBoolean 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29947) Improve ResolveRelations and ResolveTables performance

2019-11-19 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978091#comment-16978091
 ] 

Yuming Wang commented on SPARK-29947:
-


{code:sql}
CREATE TABLE table1 (`rev_rollup_id` SMALLINT, `rev_rollup` STRING, 
`rev_rollup_name` STRING, `curncy_id` DECIMAL(4,0), `holding_company` 
DECIMAL(9,0), `company_code` DECIMAL(9,0), `business_unit` DECIMAL(9,0), 
`dept_code` STRING, `online` DECIMAL(9,0), `premier` DECIMAL(9,0), `partner` 
DECIMAL(9,0), `real_estate` DECIMAL(9,0), `legal_entity_id` DECIMAL(4,0), 
`src_rev_rollup_id` SMALLINT, `prft_cntr` DECIMAL(9,0), `mngrl_cntry_id` 
STRING, `cre_date` DATE, `upd_date` TIMESTAMP, `cre_user` STRING, `upd_user` 
STRING, `pb_min_succ_bid_count` INT, `pb_enbld_yn_id` TINYINT, 
`sap_prft_cntr_id` INT)
USING parquet;

CREATE TABLE table2 (`cntry_id` DECIMAL(4,0), `curncy_id` DECIMAL(4,0), 
`cntry_desc` STRING, `cntry_code` STRING, `iso_cntry_code` STRING, `cultural` 
STRING, `cntry_busn_unit` STRING, `high_vol_cntry_yn_id` TINYINT, `check_sil` 
TINYINT, `rev_rollup_id` SMALLINT, `rev_rollup` STRING, `prft_cntr_id` INT, 
`prft_cntr` STRING, `cre_date` DATE, `upd_date` TIMESTAMP, `cre_user` STRING, 
`upd_user` STRING)
USING parquet;

CREATE TABLE table3 (`CURNCY_ID` DECIMAL(9,0), `CURNCY_PLAN_RATE` 
DECIMAL(18,6), `CRE_DATE` DATE, `CRE_USER` STRING, `UPD_DATE` TIMESTAMP, 
`UPD_USER` STRING)
USING parquet;

CREATE TABLE table4(`adj_type_id` tinyint, `byr_cntry_id` decimal(4,0), 
`sap_category_id` decimal(9,0), `lstg_site_id` decimal(9,0), `lstg_type_code` 
decimal(4,0), `offrd_slng_chnl_grp_id` smallint, `slr_cntry_id` decimal(4,0), 
`sold_slng_chnl_grp_id` smallint, `bin_lstg_yn_id` tinyint, `bin_sold_yn_id` 
tinyint, `lstg_curncy_id` decimal(4,0), `blng_curncy_id` decimal(4,0), 
`bid_count` decimal(18,0), `ck_trans_count` decimal(18,0), `ended_bid_count` 
decimal(18,0), `new_lstg_count` decimal(18,0), `ended_lstg_count` 
decimal(18,0), `ended_success_lstg_count` decimal(18,0), `item_sold_count` 
decimal(18,0), `gmv_us_amt` decimal(18,2), `gmv_byr_lc_amt` decimal(18,2), 
`gmv_slr_lc_amt` decimal(18,2), `gmv_lstg_curncy_amt` decimal(18,2), 
`gmv_us_m_amt` decimal(18,2), `rvnu_insrtn_fee_us_amt` decimal(18,6), 
`rvnu_insrtn_fee_lc_amt` decimal(18,6), `rvnu_insrtn_fee_bc_amt` decimal(18,6), 
`rvnu_insrtn_fee_us_m_amt` decimal(18,6), `rvnu_insrtn_crd_us_amt` 
decimal(18,6), `rvnu_insrtn_crd_lc_amt` decimal(18,6), `rvnu_insrtn_crd_bc_amt` 
decimal(18,6), `rvnu_insrtn_crd_us_m_amt` decimal(18,6), `rvnu_fetr_fee_us_amt` 
decimal(18,6), `rvnu_fetr_fee_lc_amt` decimal(18,6), `rvnu_fetr_fee_bc_amt` 
decimal(18,6), `rvnu_fetr_fee_us_m_amt` decimal(18,6), `rvnu_fetr_crd_us_amt` 
decimal(18,6), `rvnu_fetr_crd_lc_amt` decimal(18,6), `rvnu_fetr_crd_bc_amt` 
decimal(18,6), `rvnu_fetr_crd_us_m_amt` decimal(18,6), `rvnu_fv_fee_us_amt` 
decimal(18,6), `rvnu_fv_fee_slr_lc_amt` decimal(18,6), `rvnu_fv_fee_byr_lc_amt` 
decimal(18,6), `rvnu_fv_fee_bc_amt` decimal(18,6), `rvnu_fv_fee_us_m_amt` 
decimal(18,6), `rvnu_fv_crd_us_amt` decimal(18,6), `rvnu_fv_crd_byr_lc_amt` 
decimal(18,6), `rvnu_fv_crd_slr_lc_amt` decimal(18,6), `rvnu_fv_crd_bc_amt` 
decimal(18,6), `rvnu_fv_crd_us_m_amt` decimal(18,6), `rvnu_othr_l_fee_us_amt` 
decimal(18,6), `rvnu_othr_l_fee_lc_amt` decimal(18,6), `rvnu_othr_l_fee_bc_amt` 
decimal(18,6), `rvnu_othr_l_fee_us_m_amt` decimal(18,6), 
`rvnu_othr_l_crd_us_amt` decimal(18,6), `rvnu_othr_l_crd_lc_amt` decimal(18,6), 
`rvnu_othr_l_crd_bc_amt` decimal(18,6), `rvnu_othr_l_crd_us_m_amt` 
decimal(18,6), `rvnu_othr_nl_fee_us_amt` decimal(18,6), 
`rvnu_othr_nl_fee_lc_amt` decimal(18,6), `rvnu_othr_nl_fee_bc_amt` 
decimal(18,6), `rvnu_othr_nl_fee_us_m_amt` decimal(18,6), 
`rvnu_othr_nl_crd_us_amt` decimal(18,6), `rvnu_othr_nl_crd_lc_amt` 
decimal(18,6), `rvnu_othr_nl_crd_bc_amt` decimal(18,6), 
`rvnu_othr_nl_crd_us_m_amt` decimal(18,6), `rvnu_slr_tools_fee_us_amt` 
decimal(18,6), `rvnu_slr_tools_fee_lc_amt` decimal(18,6), 
`rvnu_slr_tools_fee_bc_amt` decimal(18,6), `rvnu_slr_tools_fee_us_m_amt` 
decimal(18,6), `rvnu_slr_tools_crd_us_amt` decimal(18,6), 
`rvnu_slr_tools_crd_lc_amt` decimal(18,6), `rvnu_slr_tools_crd_bc_amt` 
decimal(18,6), `rvnu_slr_tools_crd_us_m_amt` decimal(18,6), 
`rvnu_unasgnd_us_amt` decimal(18,6), `rvnu_unasgnd_lc_amt` decimal(18,6), 
`rvnu_unasgnd_bc_amt` decimal(18,6), `rvnu_unasgnd_us_m_amt` decimal(18,6), 
`rvnu_ad_fee_us_amt` decimal(18,6), `rvnu_ad_fee_lc_amt` decimal(18,6), 
`rvnu_ad_fee_bc_amt` decimal(18,6), `rvnu_ad_fee_us_m_amt` decimal(18,6), 
`rvnu_ad_crd_us_amt` decimal(18,6), `rvnu_ad_crd_lc_amt` decimal(18,6), 
`rvnu_ad_crd_bc_amt` decimal(18,6), `rvnu_ad_crd_us_m_amt` decimal(18,6), 
`rvnu_othr_ad_fee_us_amt` decimal(18,6), `rvnu_othr_ad_fee_lc_amt` 
decimal(18,6), `rvnu_othr_ad_fee_bc_amt` decimal(18,6), 
`rvnu_othr_ad_fee_us_m_amt` decimal(18,6), `cre_date` date, `cre_user` string, 
`upd_date` timestamp, `upd_user` string,

[jira] [Commented] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large

2019-11-19 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978080#comment-16978080
 ] 

L. C. Hsieh commented on SPARK-24666:
-

[~zhongyu09] Thanks! I will look into this and see if I can reproduce it.

> Word2Vec generate infinity vectors when numIterations are large
> ---
>
> Key: SPARK-24666
> URL: https://issues.apache.org/jira/browse/SPARK-24666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.1, 2.4.4
> Environment:  2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
>Reporter: ZhongYu
>Priority: Critical
>
> We found that Word2Vec generate large absolute value vectors when 
> numIterations are large, and if numIterations are large enough (>20), the 
> vector's value many be *infinity(or -**infinity)***, resulting in useless 
> vectors.
> In normal situations, vectors values are mainly around -1.0~1.0 when 
> numIterations = 1.
> The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
> There are already issues report this bug: 
> https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works 
> seems missing.
> Other people's reports:
> [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]
> [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]
> ===
> Here are the code to reproduce the issue. You can download title.akas.tsv 
> from [https://datasets.imdbws.com/] and upload to hdfs.
>  
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.ml.feature.Word2Vec
> case class Sentences(name: String, words: Array[String])
> import spark.implicits._
> // IMDB raw data title.akas.tsv download from https://datasets.imdbws.com/
> val dataset = spark.read
>   .option("header", "true").option("sep", "\t")
>   .option("quote", "").option("nullValue", "\\N")
>   .csv("/tmp/word2vec/title.akas.tsv")
>   .filter("region = 'US' or language = 'en'")
>   .select("title")
>   .as[String]
>   .map(s => Sentences(s, s.split(' ')))
>   .persist()
> println("Training model...")
> val word2Vec = new Word2Vec()
>   .setInputCol("words")
>   .setOutputCol("vector")
>   .setVectorSize(64)
>   .setWindowSize(4)
>   .setNumPartitions(50)
>   .setMinCount(5)
>   .setMaxIter(20)
> val model = word2Vec.fit(dataset)
> model.getVectors.show()
> {code}
> When set maxIter to 30, you will get the result.
> {code:java}
> scala> model.getVectors.show()
> +-++
> | word|  vector|
> +-++
> | Unspoken|[-Infinity,-Infin...|
> |   Talent|[Infinity,-Infini...|
> |Hourglass|[1.09657520526310...|
> |Nickelodeon's|[2.20436549446219...|
> |  Priests|[-1.9625896848389...|
> |Religion:|[-3.8815759928213...|
> |   Bu|[-7.9722236466752...|
> |  Totoro:|[-4.1829056206528...|
> | Trouble,|[2.51985378203136...|
> |   Hatter|[8.49108115961009...|
> |  '79|[-5.4560309784650...|
> | Vile|[-1.2059769646379...|
> | 9/11|[Infinity,-Infini...|
> |  Santino|[6.30405421282099...|
> |  Motives|[1.96207712570869...|
> |  '13|[-1.7641987324084...|
> |   Fierce|[-Infinity,Infini...|
> |   Stover|[5.10057474120744...|
> |  'It|[1.08629989605664...|
> |Butts|[Infinity,Infinit...|
> +-++
> only showing top 20 rows
> {code}
> In this case, set maxIter to 20 may not generate Infinity but very large 
> absolute values. It depends on the training data sample and other 
> configurations.
> {code:java}
> scala> model.getVectors.show(2,false)
>

[jira] [Commented] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large

2019-11-19 Thread ZhongYu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978077#comment-16978077
 ] 

ZhongYu commented on SPARK-24666:
-

Hi [~viirya] and [~holden], I put data and code to reproduce this issues.

> Word2Vec generate infinity vectors when numIterations are large
> ---
>
> Key: SPARK-24666
> URL: https://issues.apache.org/jira/browse/SPARK-24666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.1, 2.4.4
> Environment:  2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
>Reporter: ZhongYu
>Priority: Critical
>
> We found that Word2Vec generate large absolute value vectors when 
> numIterations are large, and if numIterations are large enough (>20), the 
> vector's value many be *infinity(or -**infinity)***, resulting in useless 
> vectors.
> In normal situations, vectors values are mainly around -1.0~1.0 when 
> numIterations = 1.
> The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
> There are already issues report this bug: 
> https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works 
> seems missing.
> Other people's reports:
> [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]
> [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]
> ===
> Here are the code to reproduce the issue. You can download title.akas.tsv 
> from [https://datasets.imdbws.com/] and upload to hdfs.
>  
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.ml.feature.Word2Vec
> case class Sentences(name: String, words: Array[String])
> import spark.implicits._
> // IMDB raw data title.akas.tsv download from https://datasets.imdbws.com/
> val dataset = spark.read
>   .option("header", "true").option("sep", "\t")
>   .option("quote", "").option("nullValue", "\\N")
>   .csv("/tmp/word2vec/title.akas.tsv")
>   .filter("region = 'US' or language = 'en'")
>   .select("title")
>   .as[String]
>   .map(s => Sentences(s, s.split(' ')))
>   .persist()
> println("Training model...")
> val word2Vec = new Word2Vec()
>   .setInputCol("words")
>   .setOutputCol("vector")
>   .setVectorSize(64)
>   .setWindowSize(4)
>   .setNumPartitions(50)
>   .setMinCount(5)
>   .setMaxIter(20)
> val model = word2Vec.fit(dataset)
> model.getVectors.show()
> {code}
> When set maxIter to 30, you will get the result.
> {code:java}
> scala> model.getVectors.show()
> +-++
> | word|  vector|
> +-++
> | Unspoken|[-Infinity,-Infin...|
> |   Talent|[Infinity,-Infini...|
> |Hourglass|[1.09657520526310...|
> |Nickelodeon's|[2.20436549446219...|
> |  Priests|[-1.9625896848389...|
> |Religion:|[-3.8815759928213...|
> |   Bu|[-7.9722236466752...|
> |  Totoro:|[-4.1829056206528...|
> | Trouble,|[2.51985378203136...|
> |   Hatter|[8.49108115961009...|
> |  '79|[-5.4560309784650...|
> | Vile|[-1.2059769646379...|
> | 9/11|[Infinity,-Infini...|
> |  Santino|[6.30405421282099...|
> |  Motives|[1.96207712570869...|
> |  '13|[-1.7641987324084...|
> |   Fierce|[-Infinity,Infini...|
> |   Stover|[5.10057474120744...|
> |  'It|[1.08629989605664...|
> |Butts|[Infinity,Infinit...|
> +-++
> only showing top 20 rows
> {code}
> In this case, set maxIter to 20 may not generate Infinity but very large 
> absolute values. It depends on the training data sample and other 
> configurations.
> {code:java}
> scala> model.getVectors.show(2,false)
>

[jira] [Updated] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large

2019-11-19 Thread ZhongYu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhongYu updated SPARK-24666:

Description: 
We found that Word2Vec generate large absolute value vectors when numIterations 
are large, and if numIterations are large enough (>20), the vector's value many 
be *infinity(or -**infinity)***, resulting in useless vectors.

In normal situations, vectors values are mainly around -1.0~1.0 when 
numIterations = 1.

The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X

There are already issues report this bug: 
https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works seems 
missing.

Other people's reports:

[https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]

[http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]

===

Here are the code to reproduce the issue. You can download title.akas.tsv from 
[https://datasets.imdbws.com/] and upload to hdfs.

 
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.Word2Vec

case class Sentences(name: String, words: Array[String])

import spark.implicits._

// IMDB raw data title.akas.tsv download from https://datasets.imdbws.com/
val dataset = spark.read
  .option("header", "true").option("sep", "\t")
  .option("quote", "").option("nullValue", "\\N")
  .csv("/tmp/word2vec/title.akas.tsv")
  .filter("region = 'US' or language = 'en'")
  .select("title")
  .as[String]
  .map(s => Sentences(s, s.split(' ')))
  .persist()

println("Training model...")
val word2Vec = new Word2Vec()
  .setInputCol("words")
  .setOutputCol("vector")
  .setVectorSize(64)
  .setWindowSize(4)
  .setNumPartitions(50)
  .setMinCount(5)
  .setMaxIter(20)
val model = word2Vec.fit(dataset)

model.getVectors.show()
{code}
When set maxIter to 30, you will get the result.
{code:java}
scala> model.getVectors.show()
+-++
| word|  vector|
+-++
| Unspoken|[-Infinity,-Infin...|
|   Talent|[Infinity,-Infini...|
|Hourglass|[1.09657520526310...|
|Nickelodeon's|[2.20436549446219...|
|  Priests|[-1.9625896848389...|
|Religion:|[-3.8815759928213...|
|   Bu|[-7.9722236466752...|
|  Totoro:|[-4.1829056206528...|
| Trouble,|[2.51985378203136...|
|   Hatter|[8.49108115961009...|
|  '79|[-5.4560309784650...|
| Vile|[-1.2059769646379...|
| 9/11|[Infinity,-Infini...|
|  Santino|[6.30405421282099...|
|  Motives|[1.96207712570869...|
|  '13|[-1.7641987324084...|
|   Fierce|[-Infinity,Infini...|
|   Stover|[5.10057474120744...|
|  'It|[1.08629989605664...|
|Butts|[Infinity,Infinit...|
+-++
only showing top 20 rows
{code}
In this case, set maxIter to 20 may not generate Infinity but very large 
absolute values. It depends on the training data sample and other 
configurations.
{code:java}
scala> model.getVectors.show(2,false)
++---+
|word|vector

[jira] [Updated] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large

2019-11-19 Thread ZhongYu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhongYu updated SPARK-24666:

Description: 
We found that Word2Vec generate large absolute value vectors when numIterations 
are large, and if numIterations are large enough (>20), the vector's value many 
be *infinity(or -**infinity)***, resulting in useless vectors.

In normal situations, vectors values are mainly around -1.0~1.0 when 
numIterations = 1.

The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X

There are already issues report this bug: 
https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works seems 
missing.

Other people's reports:

[https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]

[http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]

===

Here are the code to reproduce the issue. You can download title.akas.tsv from 
[https://datasets.imdbws.com/] and upload to hdfs.

 
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.Word2Vec

case class Sentences(name: String, words: Array[String])
// IMDB raw data title.akas.tsv download from https://datasets.imdbws.com/
val dataset = spark.read
  .option("header", "true").option("sep", "\t")
  .option("quote", "").option("nullValue", "\\N")
  .csv("/tmp/word2vec/title.akas.tsv")
  .filter("region = 'US' or language = 'en'")
  .select("title")
  .as[String]
  .map(s => Sentences(s, s.split(' ')))
  .persist()

println("Training model...")
val word2Vec = new Word2Vec()
  .setInputCol("words")
  .setOutputCol("vector")
  .setVectorSize(64)
  .setWindowSize(4)
  .setNumPartitions(50)
  .setMinCount(5)
  .setMaxIter(30)
val model = word2Vec.fit(dataset)

model.getVectors.show()
{code}
When set maxIter to 30, you will get the result
{code:java}
scala> model.getVectors.show()
+-++
| word|  vector|
+-++
| Unspoken|[-Infinity,-Infin...|
|   Talent|[Infinity,-Infini...|
|Hourglass|[1.09657520526310...|
|Nickelodeon's|[2.20436549446219...|
|  Priests|[-1.9625896848389...|
|Religion:|[-3.8815759928213...|
|   Bu|[-7.9722236466752...|
|  Totoro:|[-4.1829056206528...|
| Trouble,|[2.51985378203136...|
|   Hatter|[8.49108115961009...|
|  '79|[-5.4560309784650...|
| Vile|[-1.2059769646379...|
| 9/11|[Infinity,-Infini...|
|  Santino|[6.30405421282099...|
|  Motives|[1.96207712570869...|
|  '13|[-1.7641987324084...|
|   Fierce|[-Infinity,Infini...|
|   Stover|[5.10057474120744...|
|  'It|[1.08629989605664...|
|Butts|[Infinity,Infinit...|
+-++
only showing top 20 rows
{code}
 

  was:
We found that Word2Vec generate large absolute value vectors when numIterations 
are large, and if numIterations are large enough (>20), the vector's value many 
be *infinity(or -**infinity)***, resulting in useless vectors.

In normal situations, vectors values are mainly around -1.0~1.0 when 
numIterations = 1.

The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X.

There are already issues report this bug: 
https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works seems 
missing.

Other people's reports:

[https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]

[http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]

 

 


> Word2Vec generate infinity vectors when numIterations are large
> ---
>
> Key: SPARK-24666
> URL: https://issues.apache.org/jira/browse/SPARK-24666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.1, 2.4.4
> Environment:  2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
>Reporter: ZhongYu
>Priority: Critical
>
> We found that Word2Vec generate large absolute value vectors when 
> numIterations are large, and if numIterations are large enough (>20), the 
> vector's value many be *infinity(or -**infinity)***, resulting in useless 
> vectors.
> In normal situations, vectors values are mainly around -1.0~1.0 when 
> numIterations = 1.
> The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
> There are already issues report this bug: 
> https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works 
> seems missing.
> Other people's reports:
> [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]
>

[jira] [Updated] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large

2019-11-19 Thread ZhongYu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhongYu updated SPARK-24666:

Affects Version/s: 2.4.4

> Word2Vec generate infinity vectors when numIterations are large
> ---
>
> Key: SPARK-24666
> URL: https://issues.apache.org/jira/browse/SPARK-24666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.1, 2.4.4
> Environment:  2.0.X, 2.1.X, 2.2.X, 2.3.X
>Reporter: ZhongYu
>Priority: Critical
>
> We found that Word2Vec generate large absolute value vectors when 
> numIterations are large, and if numIterations are large enough (>20), the 
> vector's value many be *infinity(or -**infinity)***, resulting in useless 
> vectors.
> In normal situations, vectors values are mainly around -1.0~1.0 when 
> numIterations = 1.
> The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X.
> There are already issues report this bug: 
> https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works 
> seems missing.
> Other people's reports:
> [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]
> [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large

2019-11-19 Thread ZhongYu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhongYu updated SPARK-24666:

Environment:  2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X  (was:  2.0.X, 2.1.X, 
2.2.X, 2.3.X)

> Word2Vec generate infinity vectors when numIterations are large
> ---
>
> Key: SPARK-24666
> URL: https://issues.apache.org/jira/browse/SPARK-24666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.1, 2.4.4
> Environment:  2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
>Reporter: ZhongYu
>Priority: Critical
>
> We found that Word2Vec generate large absolute value vectors when 
> numIterations are large, and if numIterations are large enough (>20), the 
> vector's value many be *infinity(or -**infinity)***, resulting in useless 
> vectors.
> In normal situations, vectors values are mainly around -1.0~1.0 when 
> numIterations = 1.
> The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X.
> There are already issues report this bug: 
> https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works 
> seems missing.
> Other people's reports:
> [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]
> [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21232) New built-in SQL function - Data_Type

2019-11-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21232.
--
Resolution: Duplicate

> New built-in SQL function - Data_Type
> -
>
> Key: SPARK-21232
> URL: https://issues.apache.org/jira/browse/SPARK-21232
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.1.1
>Reporter: Mario Molina
>Priority: Minor
>
> This function returns the data type of a given column.
> {code:java}
> data_type("a")
> // returns string
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21232) New built-in SQL function - Data_Type

2019-11-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-21232:
--

> New built-in SQL function - Data_Type
> -
>
> Key: SPARK-21232
> URL: https://issues.apache.org/jira/browse/SPARK-21232
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.1.1
>Reporter: Mario Molina
>Priority: Minor
>
> This function returns the data type of a given column.
> {code:java}
> data_type("a")
> // returns string
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29029) PhysicalOperation.collectProjectsAndFilters should use AttributeMap while substituting aliases

2019-11-19 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29029:
-

Assignee: Nikita Konda

> PhysicalOperation.collectProjectsAndFilters should use AttributeMap while 
> substituting aliases
> --
>
> Key: SPARK-29029
> URL: https://issues.apache.org/jira/browse/SPARK-29029
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.3.0
>Reporter: Nikita Konda
>Assignee: Nikita Konda
>Priority: Major
>
> We have a specific use case where in we are trying insert a custom logical 
> operator in our logical plan to avoid some of the Spark’s optimization rules. 
> However, we remove this logical operator as part of custom optimization rule 
> before we send this to SparkStrategies.
> However, we are hitting issue in the following scenario:
> Analyzed plan:
> {code:java}
> [1] Project [userid#0]
> +- [2] SubqueryAlias tmp6
>+- [3] Project [videoid#47L, avebitrate#2, userid#0]
>   +- [4] Filter NOT (videoid#47L = cast(30 as bigint))
>  +- [5] SubqueryAlias tmp5
> +- [6] CustomBarrier
>+- [7] Project [videoid#47L, avebitrate#2, userid#0]
>   +- [8] Filter (avebitrate#2 < 10)
>  +- [9] SubqueryAlias tmp3
> +- [10] Project [avebitrate#2, factorial(videoid#1) 
> AS videoid#47L, userid#0]
>+- [11] SubqueryAlias tmp2
>   +- [12] Project [userid#0, videoid#1, 
> avebitrate#2]
>  +- [13] SubqueryAlias tmp1
> +- [14] Project [userid#0, videoid#1, 
> avebitrate#2]
>+- [15] SubqueryAlias views
>   +- [16] 
> Relation[userid#0,videoid#1,avebitrate#2] 
> {code}
>  
> Optimized Plan:
> {code:java}
> [1] Project [userid#0]
> +- [2] Filter (isnotnull(videoid#47L) && NOT (videoid#47L = 30))
>+- [3] Project [factorial(videoid#1) AS videoid#47L, userid#0]
>   +- [4] Filter (isnotnull(avebitrate#2) && (avebitrate#2 < 10))
>  +- [5] Relation[userid#0,videoid#1,avebitrate#2]
> {code}
>  
>  When this plan is passed into *PhysicalOperation* in *DataSourceStrategy*, 
> the collectProjectsAndFilters collects filters as 
> List[[+AttributeReference("videoid#47L"), 
> AttributeReference("avebitrate#2")]+|#47L), 
> AttributeReference(avebitrate#2)]. However, at this stage the base relation 
> only has videoid#1 and hence it throws exception saying *key not found: 
> videoid#47L.*
>  On looking further, noticed that the alias map in 
> *PhysicalOperation.substitute* does have the entry with key *videoid#47L* -> 
> Aliases Map((videoid#47L, factorial(videoid#1))). However, the substitute 
> alias is not substituting the expression for alias videoid#47L because they 
> differ in qualifier parameter.
>  Attribute key in Alias: AttributeReference("videoid", LongType, nullable = 
> true)(ExprId(47, _), *"None"*)
>  Attribute in Filter condition: AttributeReference("videoid", LongType, 
> nullable = true)(ExprId(47, _), *"Some(tmp5)"*)
> Both differ only in the qualifier, however for alias map if we use 
> AttributeMap instead of Map[Attribute, Expression], we can get rid of the 
> above issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29029) PhysicalOperation.collectProjectsAndFilters should use AttributeMap while substituting aliases

2019-11-19 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29029.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25761
[https://github.com/apache/spark/pull/25761]

> PhysicalOperation.collectProjectsAndFilters should use AttributeMap while 
> substituting aliases
> --
>
> Key: SPARK-29029
> URL: https://issues.apache.org/jira/browse/SPARK-29029
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.3.0
>Reporter: Nikita Konda
>Assignee: Nikita Konda
>Priority: Major
> Fix For: 3.0.0
>
>
> We have a specific use case where in we are trying insert a custom logical 
> operator in our logical plan to avoid some of the Spark’s optimization rules. 
> However, we remove this logical operator as part of custom optimization rule 
> before we send this to SparkStrategies.
> However, we are hitting issue in the following scenario:
> Analyzed plan:
> {code:java}
> [1] Project [userid#0]
> +- [2] SubqueryAlias tmp6
>+- [3] Project [videoid#47L, avebitrate#2, userid#0]
>   +- [4] Filter NOT (videoid#47L = cast(30 as bigint))
>  +- [5] SubqueryAlias tmp5
> +- [6] CustomBarrier
>+- [7] Project [videoid#47L, avebitrate#2, userid#0]
>   +- [8] Filter (avebitrate#2 < 10)
>  +- [9] SubqueryAlias tmp3
> +- [10] Project [avebitrate#2, factorial(videoid#1) 
> AS videoid#47L, userid#0]
>+- [11] SubqueryAlias tmp2
>   +- [12] Project [userid#0, videoid#1, 
> avebitrate#2]
>  +- [13] SubqueryAlias tmp1
> +- [14] Project [userid#0, videoid#1, 
> avebitrate#2]
>+- [15] SubqueryAlias views
>   +- [16] 
> Relation[userid#0,videoid#1,avebitrate#2] 
> {code}
>  
> Optimized Plan:
> {code:java}
> [1] Project [userid#0]
> +- [2] Filter (isnotnull(videoid#47L) && NOT (videoid#47L = 30))
>+- [3] Project [factorial(videoid#1) AS videoid#47L, userid#0]
>   +- [4] Filter (isnotnull(avebitrate#2) && (avebitrate#2 < 10))
>  +- [5] Relation[userid#0,videoid#1,avebitrate#2]
> {code}
>  
>  When this plan is passed into *PhysicalOperation* in *DataSourceStrategy*, 
> the collectProjectsAndFilters collects filters as 
> List[[+AttributeReference("videoid#47L"), 
> AttributeReference("avebitrate#2")]+|#47L), 
> AttributeReference(avebitrate#2)]. However, at this stage the base relation 
> only has videoid#1 and hence it throws exception saying *key not found: 
> videoid#47L.*
>  On looking further, noticed that the alias map in 
> *PhysicalOperation.substitute* does have the entry with key *videoid#47L* -> 
> Aliases Map((videoid#47L, factorial(videoid#1))). However, the substitute 
> alias is not substituting the expression for alias videoid#47L because they 
> differ in qualifier parameter.
>  Attribute key in Alias: AttributeReference("videoid", LongType, nullable = 
> true)(ExprId(47, _), *"None"*)
>  Attribute in Filter condition: AttributeReference("videoid", LongType, 
> nullable = true)(ExprId(47, _), *"Some(tmp5)"*)
> Both differ only in the qualifier, however for alias map if we use 
> AttributeMap instead of Map[Attribute, Expression], we can get rid of the 
> above issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29969) parse_url function result in incorrect result

2019-11-19 Thread Victor Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Zhang updated SPARK-29969:
-
Attachment: spark-result.jpg
hive-result.jpg

> parse_url function result in incorrect result
> -
>
> Key: SPARK-29969
> URL: https://issues.apache.org/jira/browse/SPARK-29969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.4
>Reporter: Victor Zhang
>Priority: Major
> Attachments: hive-result.jpg, spark-result.jpg
>
>
> In this Jira using java.net.URI instead of java.net.URL for performance 
> reason.
> https://issues.apache.org/jira/browse/SPARK-16826
> However, in the case of some unconventional parameters, it can lead to 
> incorrect results.
> For example, when the URL is encoded, the function cannot resolve the correct 
> result.
>  
> 0: jdbc:hive2://localhost:1> SELECT 
> parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%',
>  'HOST');
> ++--+
> | 
> parse_url(http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%,
>  HOST) |
> ++--+
> | NULL |
> ++--+
> 1 row selected (0.094 seconds)
>  
> hive> SELECT 
> parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%',
>  'HOST');
> OK
> HEADER: _c0
> uzzf.down.gsxzq.com
> Time taken: 4.423 seconds, Fetched: 1 row(s)
>  
> Here's a similar problem.
> https://issues.apache.org/jira/browse/SPARK-23056
> Our team used the spark function to run data for months, but now we have to 
> run it again.
> It's just too painful.:(:(:(
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29969) parse_url function result in incorrect result

2019-11-19 Thread Victor Zhang (Jira)

Victor Zhang created SPARK-29969:


 Summary: parse_url function result in incorrect result
 Key: SPARK-29969
 URL: https://issues.apache.org/jira/browse/SPARK-29969
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 2.3.1
Reporter: Victor Zhang


In this Jira using java.net.URI instead of java.net.URL for performance reason.

https://issues.apache.org/jira/browse/SPARK-16826

However, in the case of some unconventional parameters, it can lead to 
incorrect results.

For example, when the URL is encoded, the function cannot resolve the correct 
result.

 

0: jdbc:hive2://localhost:1> SELECT 
parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%',
 'HOST');
++--+
| 
parse_url(http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%,
 HOST) |
++--+
| NULL |
++--+
1 row selected (0.094 seconds)

 

hive> SELECT 
parse_url('http://uzzf.down.gsxzq.com/download/%E5%B8%B8%E7%94%A8%E9%98%80%E9%97%A8CAD%E5%9B%BE%E7%BA%B8%E5%A4%',
 'HOST');

OK

HEADER: _c0

uzzf.down.gsxzq.com

Time taken: 4.423 seconds, Fetched: 1 row(s)

 

Here's a similar problem.

https://issues.apache.org/jira/browse/SPARK-23056

Our team used the spark function to run data for months, but now we have to run 
it again.

It's just too painful.:(:(:(

 

 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29968) Remove the Predicate code from SparkPlan

2019-11-19 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29968:
-
Description: 
This is to refactor Predicate code; it mainly intends to remove 
{{newPredicate}} from {{SparkPlan}}.
Modifications are listed below;
 * Move {{Predicate}} from 
{{o.a.s.sqlcatalyst.expressions.codegen.GeneratePredicate.scala}} to 
{{o.a.s.sqlcatalyst.expressions.predicates.scala}}
 * To resolve the name conflict, rename 
{{o.a.s.sqlcatalyst.expressions.codegen.Predicate}} to 
{{o.a.s.sqlcatalyst.expressions.BasePredicate}}
 * Extend {{CodeGeneratorWithInterpretedFallback }}for {{BasePredicate}}

> Remove the Predicate code from SparkPlan
> 
>
> Key: SPARK-29968
> URL: https://issues.apache.org/jira/browse/SPARK-29968
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This is to refactor Predicate code; it mainly intends to remove 
> {{newPredicate}} from {{SparkPlan}}.
> Modifications are listed below;
>  * Move {{Predicate}} from 
> {{o.a.s.sqlcatalyst.expressions.codegen.GeneratePredicate.scala}} to 
> {{o.a.s.sqlcatalyst.expressions.predicates.scala}}
>  * To resolve the name conflict, rename 
> {{o.a.s.sqlcatalyst.expressions.codegen.Predicate}} to 
> {{o.a.s.sqlcatalyst.expressions.BasePredicate}}
>  * Extend {{CodeGeneratorWithInterpretedFallback }}for {{BasePredicate}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29968) Remove the Predicate code from SparkPlan

2019-11-19 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29968:
-
Description: 
This is to refactor Predicate code; it mainly intends to remove 
{{newPredicate}} from {{SparkPlan}}.
 Modifications are listed below;
 * Move {{Predicate}} from 
{{o.a.s.sqlcatalyst.expressions.codegen.GeneratePredicate.scala}} to 
{{o.a.s.sqlcatalyst.expressions.predicates.scala}}
 * To resolve the name conflict, rename 
{{o.a.s.sqlcatalyst.expressions.codegen.Predicate}} to 
{{o.a.s.sqlcatalyst.expressions.BasePredicate}}
 * Extend {{CodeGeneratorWithInterpretedFallback for BasePredicate}}

  was:
This is to refactor Predicate code; it mainly intends to remove 
{{newPredicate}} from {{SparkPlan}}.
Modifications are listed below;
 * Move {{Predicate}} from 
{{o.a.s.sqlcatalyst.expressions.codegen.GeneratePredicate.scala}} to 
{{o.a.s.sqlcatalyst.expressions.predicates.scala}}
 * To resolve the name conflict, rename 
{{o.a.s.sqlcatalyst.expressions.codegen.Predicate}} to 
{{o.a.s.sqlcatalyst.expressions.BasePredicate}}
 * Extend {{CodeGeneratorWithInterpretedFallback }}for {{BasePredicate}}


> Remove the Predicate code from SparkPlan
> 
>
> Key: SPARK-29968
> URL: https://issues.apache.org/jira/browse/SPARK-29968
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This is to refactor Predicate code; it mainly intends to remove 
> {{newPredicate}} from {{SparkPlan}}.
>  Modifications are listed below;
>  * Move {{Predicate}} from 
> {{o.a.s.sqlcatalyst.expressions.codegen.GeneratePredicate.scala}} to 
> {{o.a.s.sqlcatalyst.expressions.predicates.scala}}
>  * To resolve the name conflict, rename 
> {{o.a.s.sqlcatalyst.expressions.codegen.Predicate}} to 
> {{o.a.s.sqlcatalyst.expressions.BasePredicate}}
>  * Extend {{CodeGeneratorWithInterpretedFallback for BasePredicate}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29968) Remove the Predicate code from SparkPlan

2019-11-19 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29968:


 Summary: Remove the Predicate code from SparkPlan
 Key: SPARK-29968
 URL: https://issues.apache.org/jira/browse/SPARK-29968
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29945) do not handle negative sign specially in the parser

2019-11-19 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-29945.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/26578]

> do not handle negative sign specially in the parser
> ---
>
> Key: SPARK-29945
> URL: https://issues.apache.org/jira/browse/SPARK-29945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29963) Check formatting timestamps up to microsecond precision by JSON/CSV datasource

2019-11-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29963.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26601
[https://github.com/apache/spark/pull/26601]

> Check formatting timestamps up to microsecond precision by JSON/CSV datasource
> --
>
> Key: SPARK-29963
> URL: https://issues.apache.org/jira/browse/SPARK-29963
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Port tests added for 2.4 by the commit: 
> https://github.com/apache/spark/commit/47cb1f359af62383e24198dbbaa0b4503348cd04



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29963) Check formatting timestamps up to microsecond precision by JSON/CSV datasource

2019-11-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29963:


Assignee: Maxim Gekk

> Check formatting timestamps up to microsecond precision by JSON/CSV datasource
> --
>
> Key: SPARK-29963
> URL: https://issues.apache.org/jira/browse/SPARK-29963
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Port tests added for 2.4 by the commit: 
> https://github.com/apache/spark/commit/47cb1f359af62383e24198dbbaa0b4503348cd04



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29967) KMeans support instance weighting

2019-11-19 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-29967:


 Summary: KMeans support instance weighting
 Key: SPARK-29967
 URL: https://issues.apache.org/jira/browse/SPARK-29967
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


Since https://issues.apache.org/jira/browse/SPARK-9610, we start to support 
instance weighting in ML.

However, Clustering and other impl in features still do not support instance 
weighting.

I think we need to start support weighting in KMeans, like what scikit-learn 
does.

It will contains three parts:

1, move the impl from .mllib to .ml

2, make .mllib.KMeans as a wrapper of .ml.KMeans

3, support instance weighting in the .ml.KMeans



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29966) Add version method in TableCatalog to avoid load table twice

2019-11-19 Thread ulysses you (Jira)

ulysses you created SPARK-29966:
---

 Summary: Add version method in TableCatalog to avoid load table 
twice
 Key: SPARK-29966
 URL: https://issues.apache.org/jira/browse/SPARK-29966
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: ulysses you


Now resolve logic plan will load table twice which are in ResolveTables and 
ResolveRelations. The ResolveRelations is old code path, and ResolveTables is 
v2 code path, and the reason why load table twice is that ResolveTables will 
load table and rollback v1 table to ResolveRelations code path.
The same scene also exists in ResolveSessionCatalog.

It affect that execute command will cost double time than spark 2.4.

Here is the idea that add a table version method in TableCatalog, and rules 
should always get table version firstly without load table.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29965) Race in executor shutdown handling can lead to executor never fully unregistering

2019-11-19 Thread Marcelo Masiero Vanzin (Jira)

Marcelo Masiero Vanzin created SPARK-29965:
--

 Summary: Race in executor shutdown handling can lead to executor 
never fully unregistering
 Key: SPARK-29965
 URL: https://issues.apache.org/jira/browse/SPARK-29965
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Marcelo Masiero Vanzin


I ran into a situation that I had never noticed before, but I seem to be able 
to hit with just a few retries when using K8S with dynamic allocation.

Basically, there's a race when killing an executor, where it may send a 
heartbeat to the driver right at the wrong time during shutdown, e.g.:

{noformat}
19/11/19 21:14:05 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
19/11/19 21:14:05 INFO Executor: Told to re-register on heartbeat
19/11/19 21:14:05 INFO BlockManager: BlockManager BlockManagerId(10, 
192.168.3.99, 39923, None) re-registering with master
19/11/19 21:14:05 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(10, 192.168.3.99, 39923, None)
19/11/19 21:14:05 INFO BlockManagerMaster: Registered BlockManager 
BlockManagerId(10, 192.168.3.99, 39923, None)
19/11/19 21:14:06 INFO BlockManager: Reporting 0 blocks to the master.
{noformat}

On the driver side it will happily re-register the executor (time diff is just 
because of time zone in log4j config):

{noformat}
19/11/19 13:14:05 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 
from BlockManagerMaster.
19/11/19 13:14:05 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(10, 192.168.3.99, 39923, None)
19/11/19 13:14:05 INFO BlockManagerMaster: Removed 10 successfully in 
removeExecutor
19/11/19 13:14:05 INFO DAGScheduler: Shuffle files lost for executor: 10 (epoch 
18)
{noformat}

And a little later:

{noformat}
19/11/19 13:14:05 DEBUG HeartbeatReceiver: Received heartbeat from unknown 
executor 10
19/11/19 13:14:05 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.3.99:39923 with 413.9 MiB RAM, BlockManagerId(10, 192.168.3.99, 39923, 
None)
{noformat}

This becomes a problem later, where you start to see period exceptions in the 
driver's logs:

{noformat}
19/11/19 13:14:39 WARN BlockManagerMasterEndpoint: Error trying to remove 
broadcast 4 from block manager BlockManagerId(10, 192.168.3.99, 39923, None)
java.io.IOException: Failed to send RPC RPC 4999007301825869809 to 
/10.65.55.240:14233: java.nio.channels.ClosedChannelException
at 
org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:362)
at 
org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:339)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
{noformat}

That happens every time some code calls into the block manager to request stuff 
from all executors. Meaning that the dead executor re-registered, and then was 
never removed from the block manager.

I found a few races in the code that can lead to this situation. I'll post a PR 
once I test it more.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29964) lintr github action failed due to buggy GnuPG

2019-11-19 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29964:
-

Assignee: L. C. Hsieh

> lintr github action failed due to buggy GnuPG
> -
>
> Key: SPARK-29964
> URL: https://issues.apache.org/jira/browse/SPARK-29964
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Linter (R) github action failed like:
> https://github.com/apache/spark/pull/26509/checks?check_run_id=310718016
> Failed message:
> {code}
> Executing: /tmp/apt-key-gpghome.8r74rQNEjj/gpg.1.sh --keyserver 
> keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
> gpg: connecting dirmngr at '/tmp/apt-key-gpghome.8r74rQNEjj/S.dirmngr' 
> failed: IPC connect call failed
> gpg: keyserver receive failed: No dirmngr
> ##[error]Process completed with exit code 2.
> {code}
> It is due to a buggy GnuPG. Context:
> https://github.com/sbt/website/pull/825
> https://github.com/sbt/sbt/issues/4261
> https://github.com/microsoft/WSL/issues/3286



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29964) lintr github action failed due to buggy GnuPG

2019-11-19 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29964.
---
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 26602
[https://github.com/apache/spark/pull/26602]

> lintr github action failed due to buggy GnuPG
> -
>
> Key: SPARK-29964
> URL: https://issues.apache.org/jira/browse/SPARK-29964
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Linter (R) github action failed like:
> https://github.com/apache/spark/pull/26509/checks?check_run_id=310718016
> Failed message:
> {code}
> Executing: /tmp/apt-key-gpghome.8r74rQNEjj/gpg.1.sh --keyserver 
> keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
> gpg: connecting dirmngr at '/tmp/apt-key-gpghome.8r74rQNEjj/S.dirmngr' 
> failed: IPC connect call failed
> gpg: keyserver receive failed: No dirmngr
> ##[error]Process completed with exit code 2.
> {code}
> It is due to a buggy GnuPG. Context:
> https://github.com/sbt/website/pull/825
> https://github.com/sbt/sbt/issues/4261
> https://github.com/microsoft/WSL/issues/3286



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29933) ThriftServerQueryTestSuite runs tests with wrong settings

2019-11-19 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29933:
-
Component/s: Tests

> ThriftServerQueryTestSuite runs tests with wrong settings
> -
>
> Key: SPARK-29933
> URL: https://issues.apache.org/jira/browse/SPARK-29933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: filter_tests.patch
>
>
> ThriftServerQueryTestSuite must run ANSI tests in the Spark dialect but it 
> keeps settings from previous runs. And in fact, it run `ansi/interval.sql` in 
> the PostgreSQL dialect. See 
> https://github.com/apache/spark/pull/26473#issuecomment-554510643



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2019-11-19 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977904#comment-16977904
 ] 

Dongjoon Hyun commented on SPARK-20202:
---

Hi, All.
I set the target version to `3.1.0`. Please join the discussion if you have any 
concerns.
- 
https://lists.apache.org/thread.html/eca4e55c717f35f41c029e227fa9be0a7ee2c8a6f378fcce8f9fd4ff@%3Cdev.spark.apache.org%3E

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Owen O'Malley
>Priority: Major
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29964) lintr github action failed due to buggy GnuPG

2019-11-19 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-29964:
---

 Summary: lintr github action failed due to buggy GnuPG
 Key: SPARK-29964
 URL: https://issues.apache.org/jira/browse/SPARK-29964
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: L. C. Hsieh


Linter (R) github action failed like:

https://github.com/apache/spark/pull/26509/checks?check_run_id=310718016

Failed message:
{code}
Executing: /tmp/apt-key-gpghome.8r74rQNEjj/gpg.1.sh --keyserver 
keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
gpg: connecting dirmngr at '/tmp/apt-key-gpghome.8r74rQNEjj/S.dirmngr' failed: 
IPC connect call failed
gpg: keyserver receive failed: No dirmngr
##[error]Process completed with exit code 2.
{code}

It is due to a buggy GnuPG. Context:
https://github.com/sbt/website/pull/825
https://github.com/sbt/sbt/issues/4261
https://github.com/microsoft/WSL/issues/3286




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20202) Remove references to org.spark-project.hive

2019-11-19 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20202:
--
Target Version/s: 3.1.0

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Owen O'Malley
>Priority: Major
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20202) Remove references to org.spark-project.hive

2019-11-19 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20202:
--
Affects Version/s: 3.0.0
   2.2.3
   2.3.4
   2.4.4

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Owen O'Malley
>Priority: Major
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29935) Remove `Spark QA Compile` Jenkins Dashboard (and jobs)

2019-11-19 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977884#comment-16977884
 ] 

Dongjoon Hyun commented on SPARK-29935:
---

Great!

> Remove `Spark QA Compile` Jenkins Dashboard (and jobs)
> --
>
> Key: SPARK-29935
> URL: https://issues.apache.org/jira/browse/SPARK-29935
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Minor
>
> The following dashboard has 6 jobs.
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
> Those 6 jobs are a subset of GitHub Action now. So, we can save our Jenkins 
> computing resources and reduces our maintenance efforts.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.6/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-19 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-29691:


Assignee: John Bauer

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Assignee: John Bauer
>Priority: Minor
> Fix For: 3.0.0
>
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29691) Estimator fit method fails to copy params (in PySpark)

2019-11-19 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-29691.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26527
[https://github.com/apache/spark/pull/26527]

> Estimator fit method fails to copy params (in PySpark)
> --
>
> Key: SPARK-29691
> URL: https://issues.apache.org/jira/browse/SPARK-29691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: John Bauer
>Priority: Minor
> Fix For: 3.0.0
>
>
> Estimator `fit` method is supposed to copy a dictionary of params, 
> overwriting the estimator's previous values, before fitting the model. 
> However, the parameter values are not updated.  This was observed in PySpark, 
> but may be present in the Java objects, as the PySpark code appears to be 
> functioning correctly.   (The copy method that interacts with Java is 
> actually implemented in Params.)
> For example, this prints
> Before: 0.8
> After: 0.8
> but After should be 0.75
> {code:python}
> from pyspark.ml.classification import LogisticRegression
> # Load training data
> training = spark \
> .read \
> .format("libsvm") \
> .load("data/mllib/sample_multiclass_classification_data.txt")
> lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
> print("Before:", lr.getOrDefault("elasticNetParam"))
> # Fit the model, but with an updated parameter setting:
> lrModel = lr.fit(training, params={"elasticNetParam" : 0.75})
> print("After:", lr.getOrDefault("elasticNetParam"))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29935) Remove `Spark QA Compile` Jenkins Dashboard (and jobs)

2019-11-19 Thread Shane Knapp (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-29935.
-
Resolution: Fixed

> Remove `Spark QA Compile` Jenkins Dashboard (and jobs)
> --
>
> Key: SPARK-29935
> URL: https://issues.apache.org/jira/browse/SPARK-29935
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Minor
>
> The following dashboard has 6 jobs.
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
> Those 6 jobs are a subset of GitHub Action now. So, we can save our Jenkins 
> computing resources and reduces our maintenance efforts.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.6/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29935) Remove `Spark QA Compile` Jenkins Dashboard (and jobs)

2019-11-19 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977877#comment-16977877
 ] 

Shane Knapp commented on SPARK-29935:
-

ok, jobs and configs deleted!

> Remove `Spark QA Compile` Jenkins Dashboard (and jobs)
> --
>
> Key: SPARK-29935
> URL: https://issues.apache.org/jira/browse/SPARK-29935
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Minor
>
> The following dashboard has 6 jobs.
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
> Those 6 jobs are a subset of GitHub Action now. So, we can save our Jenkins 
> computing resources and reduces our maintenance efforts.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.6/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29963) Check formatting timestamps up to microsecond precision by JSON/CSV datasource

2019-11-19 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-29963:
--

 Summary: Check formatting timestamps up to microsecond precision 
by JSON/CSV datasource
 Key: SPARK-29963
 URL: https://issues.apache.org/jira/browse/SPARK-29963
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Port tests added for 2.4 by the commit: 
https://github.com/apache/spark/commit/47cb1f359af62383e24198dbbaa0b4503348cd04



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29935) Remove `Spark QA Compile` Jenkins Dashboard (and jobs)

2019-11-19 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977761#comment-16977761
 ] 

Dongjoon Hyun commented on SPARK-29935:
---

Thank you, [~shaneknapp]. 

> Remove `Spark QA Compile` Jenkins Dashboard (and jobs)
> --
>
> Key: SPARK-29935
> URL: https://issues.apache.org/jira/browse/SPARK-29935
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Minor
>
> The following dashboard has 6 jobs.
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
> Those 6 jobs are a subset of GitHub Action now. So, we can save our Jenkins 
> computing resources and reduces our maintenance efforts.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.6/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation

2019-11-19 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977749#comment-16977749
 ] 

Bryan Cutler commented on SPARK-29748:
--

[~zero323] and [~jhereth] this is targeted for Spark 3.0 and I agree, the 
behavior of Row should be very well defined to avoid any further confusion.

bq. Introducing {{LegacyRow}} seems to make little sense if implementation of 
{{Row}} stays the same otherwise. Sorting or not, depending on the config, 
should be enough.

LegacyRow isn't meant to be public and the user will not be aware of it. The 
reasons for it are to separate different implementations and make for a clean 
removal in the future without affecting the standard Row class. Having a 
separate implementation will make it easier to debug and diagnose problems - I 
don't want to get in the situation where a Row could sort fields or not, and 
then getting bug reports not knowing which way it was configured.

bq. I don't think we should introduce such behavior now, when 3.5 is 
deprecated. Having yet another way to initialize Row will be confusing at best 

That's reasonable. I'm not crazy about an option for OrderedDict as input, but 
I think users of Python < 3.6 should have a way to create a Row with ordered 
fields other than the 2-step process in the pydoc. We can explore other options 
for this.

bq. Make legacy behavior the only option for Python < 3.6.

I don't think we should have 2 very different behaviors that are chosen based 
on your Python verison. The user should be aware of what is happening and need 
to make the decision to use the legacy sorting. Some users will not know this, 
then upgrade their Python version and see Rows breaking. We should allow users 
with Python < 3.6 to make Rows with ordered fields and then be able to upgrade 
Python version without breaking their Spark app.

bq. For Python 3.6 let's introduce legacy sorting mechanism (keeping only 
single Row) class, enabled by default and deprecated.

Yeah, I'm not sure if we should enable the legacy sorting as default or not, 
what do others think?
 

> Remove sorting of fields in PySpark SQL Row creation
> 
>
> Key: SPARK-29748
> URL: https://issues.apache.org/jira/browse/SPARK-29748
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, when a PySpark Row is created with keyword arguments, the fields 
> are sorted alphabetically. This has created a lot of confusion with users 
> because it is not obvious (although it is stated in the pydocs) that they 
> will be sorted alphabetically, and then an error can occur later when 
> applying a schema and the field order does not match.
> The original reason for sorting fields is because kwargs in python < 3.6 are 
> not guaranteed to be in the same order that they were entered. Sorting 
> alphabetically would ensure a consistent order.  Matters are further 
> complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to 
> to be referenced by name when made by kwargs, but this flag is not serialized 
> with the Row and leads to inconsistent behavior.
> This JIRA proposes that any sorting of the Fields is removed. Users with 
> Python 3.6+ creating Rows with kwargs can continue to do so since Python will 
> ensure the order is the same as entered. Users with Python < 3.6 will have to 
> create Rows with an OrderedDict or by using the Row class as a factory 
> (explained in the pydoc).  If kwargs are used, an error will be raised or 
> based on a conf setting it can fall back to a LegacyRow that will sort the 
> fields as before. This LegacyRow will be immediately deprecated and removed 
> once support for Python < 3.6 is dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29935) Remove `Spark QA Compile` Jenkins Dashboard (and jobs)

2019-11-19 Thread Shane Knapp (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp reassigned SPARK-29935:
---

Assignee: Shane Knapp

> Remove `Spark QA Compile` Jenkins Dashboard (and jobs)
> --
>
> Key: SPARK-29935
> URL: https://issues.apache.org/jira/browse/SPARK-29935
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Minor
>
> The following dashboard has 6 jobs.
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
> Those 6 jobs are a subset of GitHub Action now. So, we can save our Jenkins 
> computing resources and reduces our maintenance efforts.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.6/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29935) Remove `Spark QA Compile` Jenkins Dashboard (and jobs)

2019-11-19 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977740#comment-16977740
 ] 

Shane Knapp commented on SPARK-29935:
-

these jobs are actually pretty cheap, resource-wise...  but i am *always* down 
for paring down the number and types of jenkins jobs.  i'll try and get to this 
later today or this week when i have some spare time @ kubecon.

> Remove `Spark QA Compile` Jenkins Dashboard (and jobs)
> --
>
> Key: SPARK-29935
> URL: https://issues.apache.org/jira/browse/SPARK-29935
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> The following dashboard has 6 jobs.
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
> Those 6 jobs are a subset of GitHub Action now. So, we can save our Jenkins 
> computing resources and reduces our maintenance efforts.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.6/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-compile-maven-hadoop-2.7/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.4-lint/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29906) Reading of csv file fails with adaptive execution turned on

2019-11-19 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-29906.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Reading of csv file fails with adaptive execution turned on
> ---
>
> Key: SPARK-29906
> URL: https://issues.apache.org/jira/browse/SPARK-29906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: build from master today nov 14
> commit fca0a6c394990b86304a8f9a64bf4c7ec58abbd6 (HEAD -> master, 
> upstream/master, upstream/HEAD)
> Author: Kevin Yu 
> Date:   Thu Nov 14 14:58:32 2019 -0600
> build using:
> $ dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.4 -Pyarn
> deployed on AWS EMR 5.28 with 10 m5.xlarge slaves 
> in spark-env.sh:
> HADOOP_CONF_DIR=/etc/hadoop/conf
> in spark-defaults.conf:
> spark.master yarn
> spark.submit.deployMode client
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.hadoop.yarn.timeline-service.enabled false
> spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.driver.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
> spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.executor.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
>Reporter: koert kuipers
>Assignee: Wenchen Fan
>Priority: Minor
>  Labels: correctness
> Fix For: 3.0.0
>
>
> we observed an issue where spark seems to confuse a data line (not the first 
> line of the csv file) for the csv header when it creates the schema.
> {code}
> $ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP
> $ unzip PGYR13_P062819.ZIP
> $ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv
> $ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf 
> spark.sql.adaptive.enabled=true --num-executors 10
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor 
> spark.yarn.archive is set, falling back to uploading libraries under 
> SPARK_HOME.
> Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040
> Spark context available as 'sc' (master = yarn, app id = 
> application_1573772077642_0006).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.format("csv").option("header", 
> true).option("enforceSchema", 
> false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1)
> 19/11/15 00:27:10 WARN util.package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> [Stage 2:>(0 + 10) / 
> 17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 2.0 (TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): 
> java.lang.IllegalArgumentException: CSV header does not conform to the schema.
>  Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, 
> Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, 
> Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, 
> Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, 
> Recipient_Primary_Business_Street_Address_Line2, Recipient_City, 
> Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, 
> Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, 
> Physician_License_State_code1, Physician_License_State_code2, 
> Physician_License_State_code3, Physician_License_State_code4, 
> Physician_License_State_code5, 
> Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, 
> Total_Amount_of_Payment_USDollars, Date_of_Payment, 
> Number_of_Payments_Included_in_Total_Amount, 
> Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, 
> City_of_Travel, State_of_Travel, Country_of_Travel, 
> Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, 
> Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value, 
> Charity_Indicator,

[jira] [Assigned] (SPARK-29906) Reading of csv file fails with adaptive execution turned on

2019-11-19 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-29906:
---

Assignee: Wenchen Fan

> Reading of csv file fails with adaptive execution turned on
> ---
>
> Key: SPARK-29906
> URL: https://issues.apache.org/jira/browse/SPARK-29906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: build from master today nov 14
> commit fca0a6c394990b86304a8f9a64bf4c7ec58abbd6 (HEAD -> master, 
> upstream/master, upstream/HEAD)
> Author: Kevin Yu 
> Date:   Thu Nov 14 14:58:32 2019 -0600
> build using:
> $ dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.4 -Pyarn
> deployed on AWS EMR 5.28 with 10 m5.xlarge slaves 
> in spark-env.sh:
> HADOOP_CONF_DIR=/etc/hadoop/conf
> in spark-defaults.conf:
> spark.master yarn
> spark.submit.deployMode client
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.hadoop.yarn.timeline-service.enabled false
> spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.driver.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
> spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.executor.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
>Reporter: koert kuipers
>Assignee: Wenchen Fan
>Priority: Minor
>  Labels: correctness
>
> we observed an issue where spark seems to confuse a data line (not the first 
> line of the csv file) for the csv header when it creates the schema.
> {code}
> $ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP
> $ unzip PGYR13_P062819.ZIP
> $ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv
> $ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf 
> spark.sql.adaptive.enabled=true --num-executors 10
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor 
> spark.yarn.archive is set, falling back to uploading libraries under 
> SPARK_HOME.
> Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040
> Spark context available as 'sc' (master = yarn, app id = 
> application_1573772077642_0006).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.format("csv").option("header", 
> true).option("enforceSchema", 
> false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1)
> 19/11/15 00:27:10 WARN util.package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> [Stage 2:>(0 + 10) / 
> 17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 2.0 (TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): 
> java.lang.IllegalArgumentException: CSV header does not conform to the schema.
>  Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, 
> Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, 
> Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, 
> Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, 
> Recipient_Primary_Business_Street_Address_Line2, Recipient_City, 
> Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, 
> Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, 
> Physician_License_State_code1, Physician_License_State_code2, 
> Physician_License_State_code3, Physician_License_State_code4, 
> Physician_License_State_code5, 
> Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, 
> Total_Amount_of_Payment_USDollars, Date_of_Payment, 
> Number_of_Payments_Included_in_Total_Amount, 
> Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, 
> City_of_Travel, State_of_Travel, Country_of_Travel, 
> Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, 
> Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value, 
> Charity_Indicator, Third_Party_Equals_Covered_Recipient_Indicator, 
>

[jira] [Commented] (SPARK-29927) Parse timestamps in microsecond precision by `to_timestamp`, `to_unix_timestamp`, `unix_timestamp`

2019-11-19 Thread YoungGyu Chun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977675#comment-16977675
 ] 

YoungGyu Chun commented on SPARK-29927:
---

I am working on this :)

> Parse timestamps in microsecond precision by `to_timestamp`, 
> `to_unix_timestamp`, `unix_timestamp`
> --
>
> Key: SPARK-29927
> URL: https://issues.apache.org/jira/browse/SPARK-29927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, the `to_timestamp`, `to_unix_timestamp`, `unix_timestamp` 
> functions uses SimpleDateFormat to parse strings to timestamps. 
> SimpleDateFormat is able to parse only in millisecond precision if an user 
> specified `SSS` in a pattern. The ticket aims to support parsing up to the 
> microsecond precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-29927) Parse timestamps in microsecond precision by `to_timestamp`, `to_unix_timestamp`, `unix_timestamp`

2019-11-19 Thread YoungGyu Chun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YoungGyu Chun updated SPARK-29927:
--
Comment: was deleted

(was: Hi [~maxgekk], 

I am working on this :))

> Parse timestamps in microsecond precision by `to_timestamp`, 
> `to_unix_timestamp`, `unix_timestamp`
> --
>
> Key: SPARK-29927
> URL: https://issues.apache.org/jira/browse/SPARK-29927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, the `to_timestamp`, `to_unix_timestamp`, `unix_timestamp` 
> functions uses SimpleDateFormat to parse strings to timestamps. 
> SimpleDateFormat is able to parse only in millisecond precision if an user 
> specified `SSS` in a pattern. The ticket aims to support parsing up to the 
> microsecond precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29927) Parse timestamps in microsecond precision by `to_timestamp`, `to_unix_timestamp`, `unix_timestamp`

2019-11-19 Thread YoungGyu Chun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977611#comment-16977611
 ] 

YoungGyu Chun commented on SPARK-29927:
---

Hi [~maxgekk], 

I am working on this :)

> Parse timestamps in microsecond precision by `to_timestamp`, 
> `to_unix_timestamp`, `unix_timestamp`
> --
>
> Key: SPARK-29927
> URL: https://issues.apache.org/jira/browse/SPARK-29927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, the `to_timestamp`, `to_unix_timestamp`, `unix_timestamp` 
> functions uses SimpleDateFormat to parse strings to timestamps. 
> SimpleDateFormat is able to parse only in millisecond precision if an user 
> specified `SSS` in a pattern. The ticket aims to support parsing up to the 
> microsecond precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29209) Print build environment variables to Github

2019-11-19 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-29209.
-
Resolution: Won't Fix

> Print build environment variables to Github
> ---
>
> Key: SPARK-29209
> URL: https://issues.apache.org/jira/browse/SPARK-29209
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Makes it support print AMPLAB_JENKINS_BUILD_TOOL, 
> AMPLAB_JENKINS_BUILD_PROFILE and JAVA_HOME environment to Github once this 
> test finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29946) Special serialization for certain key types of Map type in JacksonGenerator

2019-11-19 Thread YoungGyu Chun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976921#comment-16976921
 ] 

YoungGyu Chun edited comment on SPARK-29946 at 11/19/19 1:46 PM:
-

Hi [~viirya]

I am trying to sort out this issue. What I found so far is that the mapType is 
human-readable by calling toString but the UnsafeRow doesn't. I want to make 
sure what I found is correct:

!image-2019-11-18-16-54-46-341.png!


was (Author: younggyuchun):
Hi [~viirya]

 

I am trying to sort out this issue. What I found so far is that the mapType is 
human-readable by calling toString but the UnsafeRow doesn't. I want to make 
sure what I found is correct:

!image-2019-11-18-16-54-46-341.png!

> Special serialization for certain key types of Map type in JacksonGenerator
> ---
>
> Key: SPARK-29946
> URL: https://issues.apache.org/jira/browse/SPARK-29946
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Priority: Major
> Attachments: image-2019-11-18-16-54-46-341.png
>
>
> Currently JacksonGenerator serializes MapType to JSON, the key of map is 
> serialized by calling toString() of the key. For some types, like UnsafeRow, 
> the toString is not human readable. So currently the map key in JSON is not 
> very useful for some types.
> We should do special serialization for certain key types of Map.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29926) interval `1. second` should be invalid as PostgreSQL

2019-11-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29926:
---

Assignee: Kent Yao

> interval `1. second` should be invalid as PostgreSQL
> 
>
> Key: SPARK-29926
> URL: https://issues.apache.org/jira/browse/SPARK-29926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>
> Spark 
> {code:sql}
> -- !query 134
> select interval '1. second'
> -- !query 134 schema
> struct<1 seconds:interval>
> -- !query 134 output
> 1 seconds
> -- !query 135
> select cast('1. second' as interval)
> -- !query 135 schema
> struct
> -- !query 135 output
> 1 seconds
> {code}
> PostgreSQL
> {code:sql}
> ostgres=# select interval '1. seconds';
> ERROR:  invalid input syntax for type interval: "1. seconds"
> LINE 1: select interval '1. seconds';
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29926) interval `1. second` should be invalid as PostgreSQL

2019-11-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29926.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26573
[https://github.com/apache/spark/pull/26573]

> interval `1. second` should be invalid as PostgreSQL
> 
>
> Key: SPARK-29926
> URL: https://issues.apache.org/jira/browse/SPARK-29926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Spark 
> {code:sql}
> -- !query 134
> select interval '1. second'
> -- !query 134 schema
> struct<1 seconds:interval>
> -- !query 134 output
> 1 seconds
> -- !query 135
> select cast('1. second' as interval)
> -- !query 135 schema
> struct
> -- !query 135 output
> 1 seconds
> {code}
> PostgreSQL
> {code:sql}
> ostgres=# select interval '1. seconds';
> ERROR:  invalid input syntax for type interval: "1. seconds"
> LINE 1: select interval '1. seconds';
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29962) Avoid changing merge join to broadcast join if one side is non-shuffle and the other side can be broadcast

2019-11-19 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29962:

Description: 
It seem SortMergeJoin faster than BroadcastJoin if one side is non-shuffle and 
the other side can be broadcast:

https://issues.apache.org/jira/secure/attachment/12985671/BroadcastJoin.jpeg
https://issues.apache.org/jira/secure/attachment/12985670/SortMergeJoin.jpeg

> Avoid changing merge join to broadcast join if one side is non-shuffle and 
> the other side can be broadcast
> --
>
> Key: SPARK-29962
> URL: https://issues.apache.org/jira/browse/SPARK-29962
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> It seem SortMergeJoin faster than BroadcastJoin if one side is non-shuffle 
> and the other side can be broadcast:
> https://issues.apache.org/jira/secure/attachment/12985671/BroadcastJoin.jpeg
> https://issues.apache.org/jira/secure/attachment/12985670/SortMergeJoin.jpeg



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29962) Avoid changing merge join to broadcast join if one side is non-shuffle and the other side can be broadcast

2019-11-19 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-29962:
---

 Summary: Avoid changing merge join to broadcast join if one side 
is non-shuffle and the other side can be broadcast
 Key: SPARK-29962
 URL: https://issues.apache.org/jira/browse/SPARK-29962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29961) Implement typeof builtin function

2019-11-19 Thread Kent Yao (Jira)

Kent Yao created SPARK-29961:


 Summary: Implement typeof builtin function
 Key: SPARK-29961
 URL: https://issues.apache.org/jira/browse/SPARK-29961
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


Add typeof function in Spark to illastrate the underlying type of a value.
{code:sql}
-- !query 0
select typeof(1)
-- !query 0 schema
struct
-- !query 0 output
int


-- !query 1
select typeof(1.2)
-- !query 1 schema
struct
-- !query 1 output
decimal(2,1)


-- !query 2
select typeof(array(1, 2))
-- !query 2 schema
struct
-- !query 2 output
array


-- !query 3
select typeof(a) from (values (1), (2), (3.1)) t(a)
-- !query 3 schema
struct
-- !query 3 output
decimal(11,1)
decimal(11,1)
decimal(11,1)
{code}

presto
{code:sql}
resto> select typeof(array[1]);
 _col0

 array(integer)
(1 row)
{code}

PostgreSQL

{code:sql}
postgres=# select pg_typeof(a) from (values (1), (2), (3.0)) t(a);
 pg_typeof
---
 numeric
 numeric
 numeric
(3 rows)
{code}






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29893) Improve the local reader performance by changing the task number from 1 to multi

2019-11-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29893.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26516
[https://github.com/apache/spark/pull/26516]

> Improve the local reader performance by changing the task number from 1 to 
> multi
> 
>
> Key: SPARK-29893
> URL: https://issues.apache.org/jira/browse/SPARK-29893
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> The currently local reader read all the partition of map stage only using 1 
> task, which may cause the performance degradation. This PR will improve the 
> performance by using multi tasks instead of one task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29893) Improve the local reader performance by changing the task number from 1 to multi

2019-11-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29893:
---

Assignee: Ke Jia

> Improve the local reader performance by changing the task number from 1 to 
> multi
> 
>
> Key: SPARK-29893
> URL: https://issues.apache.org/jira/browse/SPARK-29893
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
>
> The currently local reader read all the partition of map stage only using 1 
> task, which may cause the performance degradation. This PR will improve the 
> performance by using multi tasks instead of one task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29960) MulticlassClassificationEvaluator support hammingLoss

2019-11-19 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-29960:


 Summary: MulticlassClassificationEvaluator support hammingLoss
 Key: SPARK-29960
 URL: https://issues.apache.org/jira/browse/SPARK-29960
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


MulticlassClassificationEvaluator should support hammingLoss, like Scikit-learn



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29959) Summarizer support more metrics

2019-11-19 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-29959:


 Summary: Summarizer support more metrics
 Key: SPARK-29959
 URL: https://issues.apache.org/jira/browse/SPARK-29959
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


Summarizer support more metrics: sum, sumL2, weightSum, numFeatures, std



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29949) JSON/CSV formats timestamps incorrectly

2019-11-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29949.
-
Fix Version/s: 2.4.5
   Resolution: Fixed

Issue resolved by pull request 26582
[https://github.com/apache/spark/pull/26582]

> JSON/CSV formats timestamps incorrectly
> ---
>
> Key: SPARK-29949
> URL: https://issues.apache.org/jira/browse/SPARK-29949
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.5
>
>
> For example:
> {code}
> scala> val t = java.sql.Timestamp.valueOf("2019-11-18 11:56:00.123456")
> t: java.sql.Timestamp = 2019-11-18 11:56:00.123456
> scala> Seq(t).toDF("t").select(to_json(struct($"t"), Map("timestampFormat" -> 
> "-MM-dd HH:mm:ss.SS"))).show(false)
> +-+
> |structstojson(named_struct(NamePlaceholder(), t))|
> +-+
> |{"t":"2019-11-18 11:56:00.000123"}   |
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29949) JSON/CSV formats timestamps incorrectly

2019-11-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29949:
---

Assignee: Maxim Gekk

> JSON/CSV formats timestamps incorrectly
> ---
>
> Key: SPARK-29949
> URL: https://issues.apache.org/jira/browse/SPARK-29949
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> For example:
> {code}
> scala> val t = java.sql.Timestamp.valueOf("2019-11-18 11:56:00.123456")
> t: java.sql.Timestamp = 2019-11-18 11:56:00.123456
> scala> Seq(t).toDF("t").select(to_json(struct($"t"), Map("timestampFormat" -> 
> "-MM-dd HH:mm:ss.SS"))).show(false)
> +-+
> |structstojson(named_struct(NamePlaceholder(), t))|
> +-+
> |{"t":"2019-11-18 11:56:00.000123"}   |
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29958) Backtick does not work when used with STORED AS ORC

2019-11-19 Thread Chano Kim (Jira)

Chano Kim created SPARK-29958:
-

 Summary: Backtick does not work when used with STORED AS ORC
 Key: SPARK-29958
 URL: https://issues.apache.org/jira/browse/SPARK-29958
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Chano Kim


Column name escape back-tick(`) does not work when ORC format is specified on 
CREATE TABLE statement.

 

*This works:*
{code:sql}
spark-sql> CREATE TABLE test01 (`my-column` STRING);
{code}
*This does not work:*
{code:sql}
spark-sql> CREATE TABLE test02 (`my-column` STRING) STORED AS ORC;
Error in query: Column name "my-column" contains invalid character(s). Please 
use alias to rename it.; 
{code}
>From above example, back-tick is used to escape dash('{{-'}}) character in 
>column name. But it does not work when _{{STORED AS ORC}}_ is specified. This 
>must be a bug.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29553) This problemis about using native BLAS to improvement ML/MLLIB performance

2019-11-19 Thread WuZeyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977236#comment-16977236
 ] 

WuZeyi edited comment on SPARK-29553 at 11/19/19 8:26 AM:
--

Thanks for Looking.

I modify spark-env.sh,and the flame graph is like this:

!image-2019-11-19-16-11-43-130.png!

Then I modify {color:#172b4d}spark.conf, and the flame graph is this:{color}

{color:#172b4d}!image-2019-11-19-16-13-30-723.png!{color}

It means that it is still multi-thread if I modify spark-env.sh.

If I modify spark.conf to set  
{color:#ff}spark.executorEnv.OPENBLAS_NUM_THREADS=1{color}，it works and the 
the performance improve.

IMHO, spark-env.sh only set the evn of the spark-sumbit process, but it doesn't 
work in the executor processes.


was (Author: zeyiii):
Thanks for Looking.

I modify spark-env.sh,and the flame graph is like this:

!image-2019-11-19-16-11-43-130.png!

Then I modify {color:#172b4d}spark.conf, and the flame graph is this:{color}

{color:#172b4d}!image-2019-11-19-16-13-30-723.png!{color}

It means that it is still multi-thread if I modify spark-env.sh.

If I modify spark.conf to set  
{color:#ff}spark.executorEnv.OPENBLAS_NUM_THREADS=1{color}，it works and the 
the performance improve.

IMHO, spark-env,sh only set the evn of the spark-sumbit process, but it doesn't 
work in the executor processes.

> This problemis about using native BLAS to improvement ML/MLLIB performance
> --
>
> Key: SPARK-29553
> URL: https://issues.apache.org/jira/browse/SPARK-29553
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0, 2.4.4
>Reporter: WuZeyi
>Priority: Minor
>  Labels: performance
> Attachments: image-2019-11-19-16-11-43-130.png, 
> image-2019-11-19-16-13-30-723.png
>
>
> I use {color:#ff}native BLAS{color} to improvement ML/MLLIB performance 
> on Yarn.
> The file {color:#ff}spark-env.sh{color} which is modified by SPARK-21305 
> said that I should set {color:#ff}OPENBLAS_NUM_THREADS=1{color} to 
> disable multi-threading of OpenBLAS, but it does not take effect.
> I modify {color:#ff}spark.conf{color} to set  
> {color:#FF}spark.executorEnv.OPENBLAS_NUM_THREADS=1{color}，and the 
> performance improve.
>   
>   
>  I think MKL_NUM_THREADS is the same.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29957) Bump MiniKdc to 3.2.0

2019-11-19 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29957:
--
Description: 
Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
New encryption types of es128-cts-hmac-sha256-128 and 
aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
these encryption types and does not work well when these encryption types are 
enabled, which results in the authentication failure.

  was:
ince MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
New encryption types of es128-cts-hmac-sha256-128 and 
aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
these encryption types and does not work well when these encryption types are 
enabled, which results in the authentication failure.


> Bump MiniKdc to 3.2.0
> -
>
> Key: SPARK-29957
> URL: https://issues.apache.org/jira/browse/SPARK-29957
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Since MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
> New encryption types of es128-cts-hmac-sha256-128 and 
> aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
> Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
> these encryption types and does not work well when these encryption types are 
> enabled, which results in the authentication failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29957) Bump MiniKdc to 3.2.0

2019-11-19 Thread angerszhu (Jira)

angerszhu created SPARK-29957:
-

 Summary: Bump MiniKdc to 3.2.0
 Key: SPARK-29957
 URL: https://issues.apache.org/jira/browse/SPARK-29957
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.0.0
Reporter: angerszhu


ince MiniKdc version lower than hadoop-3.0 can't work well in jdk11.
New encryption types of es128-cts-hmac-sha256-128 and 
aes256-cts-hmac-sha384-192 (for Kerberos 5) enabled by default were added in 
Java 11, while version of MiniKdc under 3.0.0 used by Spark does not support 
these encryption types and does not work well when these encryption types are 
enabled, which results in the authentication failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29553) This problemis about using native BLAS to improvement ML/MLLIB performance

2019-11-19 Thread WuZeyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977236#comment-16977236
 ] 

WuZeyi commented on SPARK-29553:


Thanks for Looking.

I modify spark-env.sh,and the flame graph is like this:

!image-2019-11-19-16-11-43-130.png!

Then I modify {color:#172b4d}spark.conf, and the flame graph is this:{color}

{color:#172b4d}!image-2019-11-19-16-13-30-723.png!{color}

It means that it is still multi-thread if I modify spark-env.sh.

If I modify spark.conf to set  
{color:#ff}spark.executorEnv.OPENBLAS_NUM_THREADS=1{color}，it works and the 
the performance improve.

IMHO, spark-env,sh only set the evn of the spark-sumbit process, but it doesn't 
work in the executor processes.

> This problemis about using native BLAS to improvement ML/MLLIB performance
> --
>
> Key: SPARK-29553
> URL: https://issues.apache.org/jira/browse/SPARK-29553
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0, 2.4.4
>Reporter: WuZeyi
>Priority: Minor
>  Labels: performance
> Attachments: image-2019-11-19-16-11-43-130.png, 
> image-2019-11-19-16-13-30-723.png
>
>
> I use {color:#ff}native BLAS{color} to improvement ML/MLLIB performance 
> on Yarn.
> The file {color:#ff}spark-env.sh{color} which is modified by SPARK-21305 
> said that I should set {color:#ff}OPENBLAS_NUM_THREADS=1{color} to 
> disable multi-threading of OpenBLAS, but it does not take effect.
> I modify {color:#ff}spark.conf{color} to set  
> {color:#FF}spark.executorEnv.OPENBLAS_NUM_THREADS=1{color}，and the 
> performance improve.
>   
>   
>  I think MKL_NUM_THREADS is the same.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long

2019-11-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29918.
-
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 26548
[https://github.com/apache/spark/pull/26548]

> RecordBinaryComparator should check endianness when compared by long
> 
>
> Key: SPARK-29918
> URL: https://issues.apache.org/jira/browse/SPARK-29918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Assignee: EdisonWang
>Priority: Minor
>  Labels: correctness
> Fix For: 2.4.5, 3.0.0
>
>
> If the architecture supports unaligned or the offset is 8 bytes aligned, 
> RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a 
> long. Otherwise, it will compare bytes by bytes. 
> However, on little-endian machine,  the result of compared by a long value 
> and compared bytes by bytes maybe different. If the architectures in a yarn 
> cluster is different(Some is unaligned-access capable while others not), then 
> the sequence of two records after sorted is undetermined, which will result 
> in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long

2019-11-19 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29918:
---

Assignee: EdisonWang

> RecordBinaryComparator should check endianness when compared by long
> 
>
> Key: SPARK-29918
> URL: https://issues.apache.org/jira/browse/SPARK-29918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Assignee: EdisonWang
>Priority: Minor
>  Labels: correctness
>
> If the architecture supports unaligned or the offset is 8 bytes aligned, 
> RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a 
> long. Otherwise, it will compare bytes by bytes. 
> However, on little-endian machine,  the result of compared by a long value 
> and compared bytes by bytes maybe different. If the architectures in a yarn 
> cluster is different(Some is unaligned-access capable while others not), then 
> the sequence of two records after sorted is undetermined, which will result 
> in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29553) This problemis about using native BLAS to improvement ML/MLLIB performance

2019-11-19 Thread WuZeyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WuZeyi updated SPARK-29553:
---
Attachment: image-2019-11-19-16-13-30-723.png

> This problemis about using native BLAS to improvement ML/MLLIB performance
> --
>
> Key: SPARK-29553
> URL: https://issues.apache.org/jira/browse/SPARK-29553
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0, 2.4.4
>Reporter: WuZeyi
>Priority: Minor
>  Labels: performance
> Attachments: image-2019-11-19-16-11-43-130.png, 
> image-2019-11-19-16-13-30-723.png
>
>
> I use {color:#ff}native BLAS{color} to improvement ML/MLLIB performance 
> on Yarn.
> The file {color:#ff}spark-env.sh{color} which is modified by SPARK-21305 
> said that I should set {color:#ff}OPENBLAS_NUM_THREADS=1{color} to 
> disable multi-threading of OpenBLAS, but it does not take effect.
> I modify {color:#ff}spark.conf{color} to set  
> {color:#FF}spark.executorEnv.OPENBLAS_NUM_THREADS=1{color}，and the 
> performance improve.
>   
>   
>  I think MKL_NUM_THREADS is the same.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29553) This problemis about using native BLAS to improvement ML/MLLIB performance

2019-11-19 Thread WuZeyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WuZeyi updated SPARK-29553:
---
Attachment: image-2019-11-19-16-11-43-130.png

> This problemis about using native BLAS to improvement ML/MLLIB performance
> --
>
> Key: SPARK-29553
> URL: https://issues.apache.org/jira/browse/SPARK-29553
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0, 2.4.4
>Reporter: WuZeyi
>Priority: Minor
>  Labels: performance
> Attachments: image-2019-11-19-16-11-43-130.png
>
>
> I use {color:#ff}native BLAS{color} to improvement ML/MLLIB performance 
> on Yarn.
> The file {color:#ff}spark-env.sh{color} which is modified by SPARK-21305 
> said that I should set {color:#ff}OPENBLAS_NUM_THREADS=1{color} to 
> disable multi-threading of OpenBLAS, but it does not take effect.
> I modify {color:#ff}spark.conf{color} to set  
> {color:#FF}spark.executorEnv.OPENBLAS_NUM_THREADS=1{color}，and the 
> performance improve.
>   
>   
>  I think MKL_NUM_THREADS is the same.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

70 matches

Mail list logo