[jira] [Updated] (SPARK-19035) rand() function in case when cause will failed
[ https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Yuan updated SPARK-19035: -- Description: *In this case:* select case when a=1 then 1 else concat(a,cast(rand() as string)) end b,count(1) from yuanfeng1_a group by case when a=1 then 1 else concat(a,cast(rand() as string)) end; *Throw error:* Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(8090243936131101651) as string)) END AS b#2074] +- MetastoreRelation default, yuanfeng1_a select case when a=1 then 1 else rand() end b,count(1) from yuanfeng1_a group by case when a=1 then rand() end also output this *Notice*: If replace rand() as 1,it work. was: *In this case:* select case when a=1 then 1 else concat(a,cast(rand() as string)) end b,count(1) from yuanfeng1_a group by case when a=1 then 1 else concat(a,cast(rand() as string)) end; *Throw error:* Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(8090243936131101651) as string)) END AS b#2074] +- MetastoreRelation default, yuanfeng1_a *Notice*: If replace rand() as 1,it work. > rand() function in case when cause will failed > -- > > Key: SPARK-19035 > URL: https://issues.apache.org/jira/browse/SPARK-19035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Feng Yuan > > *In this case:* >select >case when a=1 then 1 else concat(a,cast(rand() as > string)) end b,count(1) >from >yuanfeng1_a >group by >case when a=1 then 1 else concat(a,cast(rand() as > string)) end; > *Throw error:* > Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group > by, nor is it an aggregate function. Add to group by or wrap in first() (or > first_value) if you don't care which value you get.;; > Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE > concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) > END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 > as string), cast(rand(8090243936131101651) as string)) END AS b#2074] > +- MetastoreRelation default, yuanfeng1_a > select case when a=1 then 1 else rand() end b,count(1) from yuanfeng1_a group > by case when a=1 then rand() end also output this > *Notice*: > If replace rand() as 1,it work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19035) rand() function in case when cause will failed
[ https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Yuan updated SPARK-19035: -- Summary: rand() function in case when cause will failed (was: nested functions in case when statement will failed) > rand() function in case when cause will failed > -- > > Key: SPARK-19035 > URL: https://issues.apache.org/jira/browse/SPARK-19035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Feng Yuan > > *In this case:* >select >case when a=1 then 1 else concat(a,cast(rand() as > string)) end b,count(1) >from >yuanfeng1_a >group by >case when a=1 then 1 else concat(a,cast(rand() as > string)) end; > *Throw error:* > Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group > by, nor is it an aggregate function. Add to group by or wrap in first() (or > first_value) if you don't care which value you get.;; > Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE > concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) > END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 > as string), cast(rand(8090243936131101651) as string)) END AS b#2074] > +- MetastoreRelation default, yuanfeng1_a > *Notice*: > If replace rand() as 1,it work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19035) nested functions in case when statement will failed
[ https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Yuan updated SPARK-19035: -- Description: In this case: select case when a=1 then 1 else concat(a,cast(rand() as string)) end b,count(1) from yuanfeng1_a group by case when a=1 then 1 else concat(a,cast(rand() as string)) end; Throw error: Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(8090243936131101651) as string)) END AS b#2074] +- MetastoreRelation default, yuanfeng1_a *Notice*: If replace rand() as 1,it work. was: In this case: select case when a=1 then 1 else concat(a,cast(rand() as string)) end b,count(1) from yuanfeng1_a group by case when a=1 then 1 else concat(a,cast(rand() as string)) end; Throw error: Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(8090243936131101651) as string)) END AS b#2074] +- MetastoreRelation default, yuanfeng1_a > nested functions in case when statement will failed > --- > > Key: SPARK-19035 > URL: https://issues.apache.org/jira/browse/SPARK-19035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Feng Yuan > > In this case: >select >case when a=1 then 1 else concat(a,cast(rand() as > string)) end b,count(1) >from >yuanfeng1_a >group by >case when a=1 then 1 else concat(a,cast(rand() as > string)) end; > Throw error: > Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group > by, nor is it an aggregate function. Add to group by or wrap in first() (or > first_value) if you don't care which value you get.;; > Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE > concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) > END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 > as string), cast(rand(8090243936131101651) as string)) END AS b#2074] > +- MetastoreRelation default, yuanfeng1_a > *Notice*: > If replace rand() as 1,it work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19035) nested functions in case when statement will failed
[ https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Yuan updated SPARK-19035: -- Description: *In this case:* select case when a=1 then 1 else concat(a,cast(rand() as string)) end b,count(1) from yuanfeng1_a group by case when a=1 then 1 else concat(a,cast(rand() as string)) end; *Throw error:* Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(8090243936131101651) as string)) END AS b#2074] +- MetastoreRelation default, yuanfeng1_a *Notice*: If replace rand() as 1,it work. was: In this case: select case when a=1 then 1 else concat(a,cast(rand() as string)) end b,count(1) from yuanfeng1_a group by case when a=1 then 1 else concat(a,cast(rand() as string)) end; Throw error: Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(8090243936131101651) as string)) END AS b#2074] +- MetastoreRelation default, yuanfeng1_a *Notice*: If replace rand() as 1,it work. > nested functions in case when statement will failed > --- > > Key: SPARK-19035 > URL: https://issues.apache.org/jira/browse/SPARK-19035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Feng Yuan > > *In this case:* >select >case when a=1 then 1 else concat(a,cast(rand() as > string)) end b,count(1) >from >yuanfeng1_a >group by >case when a=1 then 1 else concat(a,cast(rand() as > string)) end; > *Throw error:* > Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group > by, nor is it an aggregate function. Add to group by or wrap in first() (or > first_value) if you don't care which value you get.;; > Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE > concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) > END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 > as string), cast(rand(8090243936131101651) as string)) END AS b#2074] > +- MetastoreRelation default, yuanfeng1_a > *Notice*: > If replace rand() as 1,it work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19035) nested functions in case when statement will failed
[ https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Yuan updated SPARK-19035: -- Description: In this case: select case when a=1 then 1 else concat(a,cast(rand() as string)) end b,count(1) from yuanfeng1_a group by case when a=1 then 1 else concat(a,cast(rand() as string)) end; Throw error: Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(8090243936131101651) as string)) END AS b#2074] +- MetastoreRelation default, yuanfeng1_a was: in this case: select case when a=1 then 1 else concat(a,cast(rand() as string)) end b,count(1) from yuanfeng1_a group by case when a=1 then 1 else concat(a,cast(rand() as string)) end; throw error: Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(8090243936131101651) as string)) END AS b#2074] +- MetastoreRelation default, yuanfeng1_a spark-sql> select case when a=1 then 1 else concat(a,cast(rand() as string)) end b,count(1) from yuanfeng1_a group by case when a=1 then 1 else concat(a,cast(rand() as string)) end; 16/12/30 15:05:55 INFO execution.SparkSqlParser: Parsing command: select case when a=1 then 1 else concat(a,cast(rand() as string)) end b,count(1) from yuanfeng1_a group by case when a=1 then 1 else concat(a,cast(rand() as string)) end 16/12/30 15:05:55 INFO parser.CatalystSqlParser: Parsing command: int Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [CASE WHEN (a#2077 = 1) THEN cast(1 as string) ELSE concat(cast(a#2077 as string), cast(rand(-8113865568189974672) as string)) END], [CASE WHEN (a#2077 = 1) THEN cast(1 as string) ELSE concat(cast(a#2077 as string), cast(rand(-824889479508647173) as string)) END AS b#2076, count(1) AS count(1)#2079L] +- MetastoreRelation default, yuanfeng1_a > nested functions in case when statement will failed > --- > > Key: SPARK-19035 > URL: https://issues.apache.org/jira/browse/SPARK-19035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Feng Yuan > > In this case: >select >case when a=1 then 1 else concat(a,cast(rand() as > string)) end b,count(1) >from >yuanfeng1_a >group by >case when a=1 then 1 else concat(a,cast(rand() as > string)) end; > Throw error: > Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group > by, nor is it an aggregate function. Add to group by or wrap in first() (or > first_value) if you don't care which value you get.;; > Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE > concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) > END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 > as string), cast(rand(8090243936131101651) as string)) END AS b#2074] > +- MetastoreRelation default, yuanfeng1_a -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19035) nested functions in case when statement will failed
Feng Yuan created SPARK-19035: - Summary: nested functions in case when statement will failed Key: SPARK-19035 URL: https://issues.apache.org/jira/browse/SPARK-19035 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2, 2.0.1, 2.0.0 Reporter: Feng Yuan in this case: select case when a=1 then 1 else concat(a,cast(rand() as string)) end b,count(1) from yuanfeng1_a group by case when a=1 then 1 else concat(a,cast(rand() as string)) end; throw error: Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 as string), cast(rand(8090243936131101651) as string)) END AS b#2074] +- MetastoreRelation default, yuanfeng1_a spark-sql> select case when a=1 then 1 else concat(a,cast(rand() as string)) end b,count(1) from yuanfeng1_a group by case when a=1 then 1 else concat(a,cast(rand() as string)) end; 16/12/30 15:05:55 INFO execution.SparkSqlParser: Parsing command: select case when a=1 then 1 else concat(a,cast(rand() as string)) end b,count(1) from yuanfeng1_a group by case when a=1 then 1 else concat(a,cast(rand() as string)) end 16/12/30 15:05:55 INFO parser.CatalystSqlParser: Parsing command: int Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [CASE WHEN (a#2077 = 1) THEN cast(1 as string) ELSE concat(cast(a#2077 as string), cast(rand(-8113865568189974672) as string)) END], [CASE WHEN (a#2077 = 1) THEN cast(1 as string) ELSE concat(cast(a#2077 as string), cast(rand(-824889479508647173) as string)) END AS b#2076, count(1) AS count(1)#2079L] +- MetastoreRelation default, yuanfeng1_a -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19034) Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2
Sanjay Dasgupta created SPARK-19034: --- Summary: Download packages on 'spark.apache.org/downloads.html' contain release 2.0.2 Key: SPARK-19034 URL: https://issues.apache.org/jira/browse/SPARK-19034 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.1.0 Environment: All Reporter: Sanjay Dasgupta Download packages on 'https://spark.apache.org/downloads.html' have the right name ( spark-2.1.0-bin-...) but contain the release 2.0.2 software -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19033) HistoryServer still uses old ACLs even if ACLs are updated
[ https://issues.apache.org/jira/browse/SPARK-19033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15787031#comment-15787031 ] Saisai Shao commented on SPARK-19033: - Ping [~vanzin], I found that you made this change, would you mind explaining the purpose of doing so? Thanks very much. > HistoryServer still uses old ACLs even if ACLs are updated > -- > > Key: SPARK-19033 > URL: https://issues.apache.org/jira/browse/SPARK-19033 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Saisai Shao >Priority: Minor > > In the current implementation of HistoryServer, Application ACLs is picked > from event log rather than configuration: > {code} > val uiAclsEnabled = > conf.getBoolean("spark.history.ui.acls.enable", false) > ui.getSecurityManager.setAcls(uiAclsEnabled) > // make sure to set admin acls before view acls so they are > properly picked up > > ui.getSecurityManager.setAdminAcls(appListener.adminAcls.getOrElse("")) > ui.getSecurityManager.setViewAcls(attempt.sparkUser, > appListener.viewAcls.getOrElse("")) > > ui.getSecurityManager.setAdminAclsGroups(appListener.adminAclsGroups.getOrElse("")) > > ui.getSecurityManager.setViewAclsGroups(appListener.viewAclsGroups.getOrElse("")) > {code} > This will become a problem when ACLs is updated (newly added admin), only the > new application can be effected, the old applications were still using the > old ACLs. So these new admin still cannot check the logs of old applications. > It is hard to say this is a bug, but in our scenario this is not the expected > behavior we wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19033) HistoryServer still uses old ACLs even if ACLs are updated
[ https://issues.apache.org/jira/browse/SPARK-19033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-19033: Summary: HistoryServer still uses old ACLs even if ACLs are updated (was: HistoryServer will honor old ACLs even if ACLs are updated) > HistoryServer still uses old ACLs even if ACLs are updated > -- > > Key: SPARK-19033 > URL: https://issues.apache.org/jira/browse/SPARK-19033 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Saisai Shao >Priority: Minor > > In the current implementation of HistoryServer, Application ACLs is picked > from event log rather than configuration: > {code} > val uiAclsEnabled = > conf.getBoolean("spark.history.ui.acls.enable", false) > ui.getSecurityManager.setAcls(uiAclsEnabled) > // make sure to set admin acls before view acls so they are > properly picked up > > ui.getSecurityManager.setAdminAcls(appListener.adminAcls.getOrElse("")) > ui.getSecurityManager.setViewAcls(attempt.sparkUser, > appListener.viewAcls.getOrElse("")) > > ui.getSecurityManager.setAdminAclsGroups(appListener.adminAclsGroups.getOrElse("")) > > ui.getSecurityManager.setViewAclsGroups(appListener.viewAclsGroups.getOrElse("")) > {code} > This will become a problem when ACLs is updated (newly added admin), only the > new application can be effected, the old applications were still using the > old ACLs. So these new admin still cannot check the logs of old applications. > It is hard to say this is a bug, but in our scenario this is not the expected > behavior we wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19033) HistoryServer will honor old ACLs even if ACLs are updated
Saisai Shao created SPARK-19033: --- Summary: HistoryServer will honor old ACLs even if ACLs are updated Key: SPARK-19033 URL: https://issues.apache.org/jira/browse/SPARK-19033 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Saisai Shao Priority: Minor In the current implementation of HistoryServer, Application ACLs is picked from event log rather than configuration: {code} val uiAclsEnabled = conf.getBoolean("spark.history.ui.acls.enable", false) ui.getSecurityManager.setAcls(uiAclsEnabled) // make sure to set admin acls before view acls so they are properly picked up ui.getSecurityManager.setAdminAcls(appListener.adminAcls.getOrElse("")) ui.getSecurityManager.setViewAcls(attempt.sparkUser, appListener.viewAcls.getOrElse("")) ui.getSecurityManager.setAdminAclsGroups(appListener.adminAclsGroups.getOrElse("")) ui.getSecurityManager.setViewAclsGroups(appListener.viewAclsGroups.getOrElse("")) {code} This will become a problem when ACLs is updated (newly added admin), only the new application can be effected, the old applications were still using the old ACLs. So these new admin still cannot check the logs of old applications. It is hard to say this is a bug, but in our scenario this is not the expected behavior we wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator
[ https://issues.apache.org/jira/browse/SPARK-18986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786971#comment-15786971 ] Sameer Kumar commented on SPARK-18986: -- Shouldn't the priority be increased for this because I am facing this issue on almost every batch interval and the data doesn't get processed any further, which is a significant data loss for any application. > ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its > iterator > - > > Key: SPARK-18986 > URL: https://issues.apache.org/jira/browse/SPARK-18986 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Liang-Chi Hsieh > > {{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an > iterator is not null in the map. However, the assertion is only true after > the map is asked for iterator. Before it, if another memory consumer asks > more memory than currently available, {{ExternalAppendOnlyMap.forceSpill}} is > also be called too. In this case, we will see failure like this: > {code} > [info] java.lang.AssertionError: assertion failed > [info] at scala.Predef$.assert(Predef.scala:156) > [info] at > org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196) > [info] at > org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111) > [info] at > org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly > MapSuite.scala:294) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19032) Non-deterministic results using aggregation first across multiple workers
[ https://issues.apache.org/jira/browse/SPARK-19032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786883#comment-15786883 ] Liang-Chi Hsieh edited comment on SPARK-19032 at 12/30/16 4:50 AM: --- I think you can not guarantee the sort order per group in an aggregation under the current API. One workaround is the combination of repartition + sortWithinPartitions as I mentioned in the discussion. {code} df.repartition($"account").sortWithinPartitions($"account", $"probability".desc).groupBy($"account").agg(first($"product"),first($"probability")) {code} It should work. But this is still not guaranteed by the API. If the internal implementation of aggregation is changed, then it can't guarantee deterministic results again. was (Author: viirya): I think you can not guarantee the sort order per group in an aggregation under the current API. One workaround is the combination of repartition + sortWithinPartitions as I mentioned in the discussion. df.repartition($"account").sortWithinPartitions($"account", $"probability".desc).groupBy($"account").agg(first($"product"),first($"probability")) It should work. But this is still not guaranteed by the API. If the internal implementation of aggregation is changed, then it can't guarantee deterministic results again. > Non-deterministic results using aggregation first across multiple workers > - > > Key: SPARK-19032 > URL: https://issues.apache.org/jira/browse/SPARK-19032 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.1 > Environment: Standalone Spark 1.6.1 cluster on EC2 with 2 worker > nodes, one executor each. >Reporter: Harry Weppner > > We've come across a situation results aggregated using {{first}} on a sorted > df are non-deterministic. Given the explanation for the plan there appears to > be a plausible explanation but creates more question on the usefulness of > these aggregation functions in a spark cluster. > Here's a minimal example to reproduce: > {code} > val df = > sc.parallelize(Seq(("a","prod1",0.6),("a","prod2",0.4),("a","prod2",0.4),("a","prod2",0.4),("a","prod2",0.4))).toDF("account","product","probability") > var p = > df.sort($"probability".desc).groupBy($"account").agg(first($"product"),first($"probability")).show(); > +---+++ > |account|first(product)()|first(probability)()| > +---+++ > | a| prod1| 0.6| > +---+++ > p: Unit = () > // Repeat and notice that result will occasionally be different > +---+++ > |account|first(product)()|first(probability)()| > +---+++ > | a| prod2| 0.4| > +---+++ > p: Unit = () > scala> > df.sort($"probability".desc).groupBy($"account").agg(first($"product"),first($"probability")).explain(true); > == Parsed Logical Plan == > 'Aggregate ['account], > [unresolvedalias('account),(first('product)(),mode=Complete,isDistinct=false) > AS > first(product)()#523,(first('probability)(),mode=Complete,isDistinct=false) > AS first(probability)()#524] > +- Sort [probability#5 DESC], true >+- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5] > +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at > rddToDataFrameHolder at :27 > == Analyzed Logical Plan == > account: string, first(product)(): string, first(probability)(): double > Aggregate [account#3], > [account#3,(first(product#4)(),mode=Complete,isDistinct=false) AS > first(product)()#523,(first(probability#5)(),mode=Complete,isDistinct=false) > AS first(probability)()#524] > +- Sort [probability#5 DESC], true >+- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5] > +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at > rddToDataFrameHolder at :27 > == Optimized Logical Plan == > Aggregate [account#3], > [account#3,(first(product#4)(),mode=Complete,isDistinct=false) AS > first(product)()#523,(first(probability#5)(),mode=Complete,isDistinct=false) > AS first(probability)()#524] > +- Sort [probability#5 DESC], true >+- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5] > +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at > rddToDataFrameHolder at :27 > == Physical Plan == > SortBasedAggregate(key=[account#3], > functions=[(first(product#4)(),mode=Final,isDistinct=false),(first(probability#5)(),mode=Final,isDistinct=false)], > output=[account#3,first(product)()#523,first(probability)()#524]) > +- ConvertToSafe >+- Sort [account#3 ASC], false, 0 > +-
[jira] [Commented] (SPARK-19032) Non-deterministic results using aggregation first across multiple workers
[ https://issues.apache.org/jira/browse/SPARK-19032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786883#comment-15786883 ] Liang-Chi Hsieh commented on SPARK-19032: - I think you can not guarantee the sort order per group in an aggregation under the current API. One workaround is the combination of repartition + sortWithinPartitions as I mentioned in the discussion. df.repartition($"account").sortWithinPartitions($"account", $"probability".desc).groupBy($"account").agg(first($"product"),first($"probability")) It should work. But this is still not guaranteed by the API. If the internal implementation of aggregation is changed, then it can't guarantee deterministic results again. > Non-deterministic results using aggregation first across multiple workers > - > > Key: SPARK-19032 > URL: https://issues.apache.org/jira/browse/SPARK-19032 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.1 > Environment: Standalone Spark 1.6.1 cluster on EC2 with 2 worker > nodes, one executor each. >Reporter: Harry Weppner > > We've come across a situation results aggregated using {{first}} on a sorted > df are non-deterministic. Given the explanation for the plan there appears to > be a plausible explanation but creates more question on the usefulness of > these aggregation functions in a spark cluster. > Here's a minimal example to reproduce: > {code} > val df = > sc.parallelize(Seq(("a","prod1",0.6),("a","prod2",0.4),("a","prod2",0.4),("a","prod2",0.4),("a","prod2",0.4))).toDF("account","product","probability") > var p = > df.sort($"probability".desc).groupBy($"account").agg(first($"product"),first($"probability")).show(); > +---+++ > |account|first(product)()|first(probability)()| > +---+++ > | a| prod1| 0.6| > +---+++ > p: Unit = () > // Repeat and notice that result will occasionally be different > +---+++ > |account|first(product)()|first(probability)()| > +---+++ > | a| prod2| 0.4| > +---+++ > p: Unit = () > scala> > df.sort($"probability".desc).groupBy($"account").agg(first($"product"),first($"probability")).explain(true); > == Parsed Logical Plan == > 'Aggregate ['account], > [unresolvedalias('account),(first('product)(),mode=Complete,isDistinct=false) > AS > first(product)()#523,(first('probability)(),mode=Complete,isDistinct=false) > AS first(probability)()#524] > +- Sort [probability#5 DESC], true >+- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5] > +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at > rddToDataFrameHolder at :27 > == Analyzed Logical Plan == > account: string, first(product)(): string, first(probability)(): double > Aggregate [account#3], > [account#3,(first(product#4)(),mode=Complete,isDistinct=false) AS > first(product)()#523,(first(probability#5)(),mode=Complete,isDistinct=false) > AS first(probability)()#524] > +- Sort [probability#5 DESC], true >+- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5] > +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at > rddToDataFrameHolder at :27 > == Optimized Logical Plan == > Aggregate [account#3], > [account#3,(first(product#4)(),mode=Complete,isDistinct=false) AS > first(product)()#523,(first(probability#5)(),mode=Complete,isDistinct=false) > AS first(probability)()#524] > +- Sort [probability#5 DESC], true >+- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5] > +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at > rddToDataFrameHolder at :27 > == Physical Plan == > SortBasedAggregate(key=[account#3], > functions=[(first(product#4)(),mode=Final,isDistinct=false),(first(probability#5)(),mode=Final,isDistinct=false)], > output=[account#3,first(product)()#523,first(probability)()#524]) > +- ConvertToSafe >+- Sort [account#3 ASC], false, 0 > +- TungstenExchange hashpartitioning(account#3,200), None > +- ConvertToUnsafe > +- SortBasedAggregate(key=[account#3], > functions=[(first(product#4)(),mode=Partial,isDistinct=false),(first(probability#5)(),mode=Partial,isDistinct=false)], > output=[account#3,first#532,valueSet#533,first#534,valueSet#535]) >+- ConvertToSafe > +- Sort [account#3 ASC], false, 0 > +- Sort [probability#5 DESC], true, 0 > +- ConvertToUnsafe >+- Exchange rangepartitioning(probability#5 > DESC,200), None >
[jira] [Commented] (SPARK-15359) Mesos dispatcher should handle DRIVER_ABORTED status from mesosDriver.run()
[ https://issues.apache.org/jira/browse/SPARK-15359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786878#comment-15786878 ] Devaraj K commented on SPARK-15359: --- Thanks [~yu2003w] for verifying this PR, I forgot to mention that it depends on SPARK-15288 [https://github.com/apache/spark/pull/13072] for handling the UncaughtException's, sorry for that. Can you verify this PR with the SPARK-15288 fix? > Mesos dispatcher should handle DRIVER_ABORTED status from mesosDriver.run() > --- > > Key: SPARK-15359 > URL: https://issues.apache.org/jira/browse/SPARK-15359 > Project: Spark > Issue Type: Bug > Components: Deploy, Mesos >Reporter: Devaraj K >Priority: Minor > > Mesos dispatcher handles DRIVER_ABORTED status for mesosDriver.run() during > the successful registration but if the mesosDriver.run() returns > DRIVER_ABORTED status after the successful register then there is no action > for the status and the thread will be terminated. > I think we need to throw the exception and shutdown the dispatcher. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19032) Non-deterministic results using aggregation first across multiple workers
[ https://issues.apache.org/jira/browse/SPARK-19032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786824#comment-15786824 ] Liang-Chi Hsieh commented on SPARK-19032: - There is a related discussion at dev mailing list: http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tc1.html > Non-deterministic results using aggregation first across multiple workers > - > > Key: SPARK-19032 > URL: https://issues.apache.org/jira/browse/SPARK-19032 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.1 > Environment: Standalone Spark 1.6.1 cluster on EC2 with 2 worker > nodes, one executor each. >Reporter: Harry Weppner > > We've come across a situation results aggregated using {{first}} on a sorted > df are non-deterministic. Given the explanation for the plan there appears to > be a plausible explanation but creates more question on the usefulness of > these aggregation functions in a spark cluster. > Here's a minimal example to reproduce: > {code} > val df = > sc.parallelize(Seq(("a","prod1",0.6),("a","prod2",0.4),("a","prod2",0.4),("a","prod2",0.4),("a","prod2",0.4))).toDF("account","product","probability") > var p = > df.sort($"probability".desc).groupBy($"account").agg(first($"product"),first($"probability")).show(); > +---+++ > |account|first(product)()|first(probability)()| > +---+++ > | a| prod1| 0.6| > +---+++ > p: Unit = () > // Repeat and notice that result will occasionally be different > +---+++ > |account|first(product)()|first(probability)()| > +---+++ > | a| prod2| 0.4| > +---+++ > p: Unit = () > scala> > df.sort($"probability".desc).groupBy($"account").agg(first($"product"),first($"probability")).explain(true); > == Parsed Logical Plan == > 'Aggregate ['account], > [unresolvedalias('account),(first('product)(),mode=Complete,isDistinct=false) > AS > first(product)()#523,(first('probability)(),mode=Complete,isDistinct=false) > AS first(probability)()#524] > +- Sort [probability#5 DESC], true >+- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5] > +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at > rddToDataFrameHolder at :27 > == Analyzed Logical Plan == > account: string, first(product)(): string, first(probability)(): double > Aggregate [account#3], > [account#3,(first(product#4)(),mode=Complete,isDistinct=false) AS > first(product)()#523,(first(probability#5)(),mode=Complete,isDistinct=false) > AS first(probability)()#524] > +- Sort [probability#5 DESC], true >+- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5] > +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at > rddToDataFrameHolder at :27 > == Optimized Logical Plan == > Aggregate [account#3], > [account#3,(first(product#4)(),mode=Complete,isDistinct=false) AS > first(product)()#523,(first(probability#5)(),mode=Complete,isDistinct=false) > AS first(probability)()#524] > +- Sort [probability#5 DESC], true >+- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5] > +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at > rddToDataFrameHolder at :27 > == Physical Plan == > SortBasedAggregate(key=[account#3], > functions=[(first(product#4)(),mode=Final,isDistinct=false),(first(probability#5)(),mode=Final,isDistinct=false)], > output=[account#3,first(product)()#523,first(probability)()#524]) > +- ConvertToSafe >+- Sort [account#3 ASC], false, 0 > +- TungstenExchange hashpartitioning(account#3,200), None > +- ConvertToUnsafe > +- SortBasedAggregate(key=[account#3], > functions=[(first(product#4)(),mode=Partial,isDistinct=false),(first(probability#5)(),mode=Partial,isDistinct=false)], > output=[account#3,first#532,valueSet#533,first#534,valueSet#535]) >+- ConvertToSafe > +- Sort [account#3 ASC], false, 0 > +- Sort [probability#5 DESC], true, 0 > +- ConvertToUnsafe >+- Exchange rangepartitioning(probability#5 > DESC,200), None > +- ConvertToSafe > +- Project [_1#0 AS account#3,_2#1 AS > product#4,_3#2 AS probability#5] > +- Scan ExistingRDD[_1#0,_2#1,_3#2] > {code} > My working hypothesis is that after {{TungstenExchange hashpartitioning}} the > _global_ sort order on {{probability}} is lost leading to
[jira] [Commented] (SPARK-18933) Different log output between Terminal screen and stderr file
[ https://issues.apache.org/jira/browse/SPARK-18933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786822#comment-15786822 ] Sean Wong commented on SPARK-18933: --- But there is no stderr or stdout file available for driver logs. Only executors have these two files. > Different log output between Terminal screen and stderr file > > > Key: SPARK-18933 > URL: https://issues.apache.org/jira/browse/SPARK-18933 > Project: Spark > Issue Type: Bug > Components: Deploy, Documentation, Web UI >Affects Versions: 1.6.3 > Environment: Yarn mode and standalone mode >Reporter: Sean Wong > Original Estimate: 612h > Remaining Estimate: 612h > > First of all, I use the default log4j.properties in the Spark conf/ > But I found that the log output(e.g., INFO) is different between Terminal > screen and stderr File. Some INFO logs exist in both of them. Some INFO logs > exist in either of them. Why this happens? Is it supposed that the output > logs are same between the terminal screen and stderr file? > Then I did a Test. I modified the source code in SparkContext.scala and add > one line log code "logInfo("This is textFile")" in the textFile function. > However, after running an application, I found the log "This is textFile" > shown in the terminal screen. no such log in the stderr file. I am not sure > if this is a bug. So, hope you can solve this question. Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15359) Mesos dispatcher should handle DRIVER_ABORTED status from mesosDriver.run()
[ https://issues.apache.org/jira/browse/SPARK-15359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786813#comment-15786813 ] Jared commented on SPARK-15359: --- Hi, I tested the fix. However, it seemed the problem still existed. I1230 11:39:07.096375 6889 sched.cpp:1223] Aborting framework 16/12/30 11:39:07 INFO MesosClusterScheduler: driver.run() returned with code DRIVER_ABORTED 16/12/30 11:39:07 ERROR MesosClusterScheduler: driver.run() failed org.apache.spark.SparkException: Error starting driver, DRIVER_ABORTED at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils$$anon$1.run(MesosSchedulerUtils.scala:124) Exception in thread "MesosClusterScheduler-mesos-driver" org.apache.spark.SparkException: Error starting driver, DRIVER_ABORTED at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils$$anon$1.run(MesosSchedulerUtils.scala:124) 16/12/30 11:39:07 INFO Utils: Successfully started service on port 7077. 16/12/30 11:39:07 INFO MesosRestServer: Started REST server for submitting applications on port 7077 It seemed that exceptions thrown was not handled. I think several other files should also be changed to fix this problem. > Mesos dispatcher should handle DRIVER_ABORTED status from mesosDriver.run() > --- > > Key: SPARK-15359 > URL: https://issues.apache.org/jira/browse/SPARK-15359 > Project: Spark > Issue Type: Bug > Components: Deploy, Mesos >Reporter: Devaraj K >Priority: Minor > > Mesos dispatcher handles DRIVER_ABORTED status for mesosDriver.run() during > the successful registration but if the mesosDriver.run() returns > DRIVER_ABORTED status after the successful register then there is no action > for the status and the thread will be terminated. > I think we need to throw the exception and shutdown the dispatcher. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19026) local directories cannot be cleanuped when create directory of "executor-***" throws IOException such as there is no more free disk space to create it etc.
[ https://issues.apache.org/jira/browse/SPARK-19026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786773#comment-15786773 ] Apache Spark commented on SPARK-19026: -- User 'zuotingbing' has created a pull request for this issue: https://github.com/apache/spark/pull/16439 > local directories cannot be cleanuped when create directory of "executor-***" > throws IOException such as there is no more free disk space to create it etc. > --- > > Key: SPARK-19026 > URL: https://issues.apache.org/jira/browse/SPARK-19026 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 2.0.2 > Environment: linux >Reporter: zuotingbing > > i set SPARK_LOCAL_DIRS variable like this: > SPARK_LOCAL_DIRS=/data2/spark/tmp,/data3/spark/tmp,/data4/spark/tmp > when there is no more free disk space on "/data4/spark/tmp" , other local > directories (/data2/spark/tmp,/data3/spark/tmp) cannot be cleanuped when my > application finished. > we should catch the IOExecption when create local dirs throws execption , > otherwise the variable "appDirectories(appId)" not be set , then local > directories "executor-***" cannot be deleted for this application. If the > number of folders "executor-***" > 32k we cannot create executor anymore on > this worker node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19026) local directories cannot be cleanuped when create directory of "executor-***" throws IOException such as there is no more free disk space to create it etc.
[ https://issues.apache.org/jira/browse/SPARK-19026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19026: Assignee: (was: Apache Spark) > local directories cannot be cleanuped when create directory of "executor-***" > throws IOException such as there is no more free disk space to create it etc. > --- > > Key: SPARK-19026 > URL: https://issues.apache.org/jira/browse/SPARK-19026 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 2.0.2 > Environment: linux >Reporter: zuotingbing > > i set SPARK_LOCAL_DIRS variable like this: > SPARK_LOCAL_DIRS=/data2/spark/tmp,/data3/spark/tmp,/data4/spark/tmp > when there is no more free disk space on "/data4/spark/tmp" , other local > directories (/data2/spark/tmp,/data3/spark/tmp) cannot be cleanuped when my > application finished. > we should catch the IOExecption when create local dirs throws execption , > otherwise the variable "appDirectories(appId)" not be set , then local > directories "executor-***" cannot be deleted for this application. If the > number of folders "executor-***" > 32k we cannot create executor anymore on > this worker node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19026) local directories cannot be cleanuped when create directory of "executor-***" throws IOException such as there is no more free disk space to create it etc.
[ https://issues.apache.org/jira/browse/SPARK-19026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19026: Assignee: Apache Spark > local directories cannot be cleanuped when create directory of "executor-***" > throws IOException such as there is no more free disk space to create it etc. > --- > > Key: SPARK-19026 > URL: https://issues.apache.org/jira/browse/SPARK-19026 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 2.0.2 > Environment: linux >Reporter: zuotingbing >Assignee: Apache Spark > > i set SPARK_LOCAL_DIRS variable like this: > SPARK_LOCAL_DIRS=/data2/spark/tmp,/data3/spark/tmp,/data4/spark/tmp > when there is no more free disk space on "/data4/spark/tmp" , other local > directories (/data2/spark/tmp,/data3/spark/tmp) cannot be cleanuped when my > application finished. > we should catch the IOExecption when create local dirs throws execption , > otherwise the variable "appDirectories(appId)" not be set , then local > directories "executor-***" cannot be deleted for this application. If the > number of folders "executor-***" > 32k we cannot create executor anymore on > this worker node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18974) FileInputDStream could not detected files which moved to the directory
[ https://issues.apache.org/jira/browse/SPARK-18974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Wang updated SPARK-18974: -- Description: FileInputDStream use mod time to find new files, but if a file was moved into the directories it's modification time would not be changed, so FileInputDStream could not detect these files. (was: FileInputDStream use mod time to find new files, but if a file was moved into the directories it's modification time would not be changed, so FileInputDStream could not detect these files. I think a way to fix this bug is get access_time and do judgment, bug it need a Set of files to save all old files, it would very inefficient for lot of files directory.) > FileInputDStream could not detected files which moved to the directory > --- > > Key: SPARK-18974 > URL: https://issues.apache.org/jira/browse/SPARK-18974 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.0.2 >Reporter: Adam Wang > > FileInputDStream use mod time to find new files, but if a file was moved into > the directories it's modification time would not be changed, so > FileInputDStream could not detect these files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18974) FileInputDStream could not detected files which moved to the directory
[ https://issues.apache.org/jira/browse/SPARK-18974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786735#comment-15786735 ] Adam Wang commented on SPARK-18974: --- Thanks for reminding, I haven't tried before, I will try later > FileInputDStream could not detected files which moved to the directory > --- > > Key: SPARK-18974 > URL: https://issues.apache.org/jira/browse/SPARK-18974 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.0.2 >Reporter: Adam Wang > > FileInputDStream use mod time to find new files, but if a file was moved into > the directories it's modification time would not be changed, so > FileInputDStream could not detect these files. > I think a way to fix this bug is get access_time and do judgment, bug it need > a Set of files to save all old files, it would very inefficient for lot of > files directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads
[ https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786726#comment-15786726 ] Felix Cheung commented on SPARK-12757: -- ping. Still seeing a lot of these messages on Spark 2.1. Is that a new issue? > Use reference counting to prevent blocks from being evicted during reads > > > Key: SPARK-12757 > URL: https://issues.apache.org/jira/browse/SPARK-12757 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > As a pre-requisite to off-heap caching of blocks, we need a mechanism to > prevent pages / blocks from being evicted while they are being read. With > on-heap objects, evicting a block while it is being read merely leads to > memory-accounting problems (because we assume that an evicted block is a > candidate for garbage-collection, which will not be true during a read), but > with off-heap memory this will lead to either data corruption or segmentation > faults. > To address this, we should add a reference-counting mechanism to track which > blocks/pages are being read in order to prevent them from being evicted > prematurely. I propose to do this in two phases: first, add a safe, > conservative approach in which all BlockManager.get*() calls implicitly > increment the reference count of blocks and where tasks' references are > automatically freed upon task completion. This will be correct but may have > adverse performance impacts because it will prevent legitimate block > evictions. In phase two, we should incrementally add release() calls in order > to fix the eviction of unreferenced blocks. The latter change may need to > touch many different components, which is why I propose to do it separately > in order to make the changes easier to reason about and review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17766) Write ahead log exception on a toy project
[ https://issues.apache.org/jira/browse/SPARK-17766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-17766. -- Resolution: Duplicate This has been fixed in SPARK-18617 > Write ahead log exception on a toy project > -- > > Key: SPARK-17766 > URL: https://issues.apache.org/jira/browse/SPARK-17766 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Nadav Samet >Priority: Minor > > Write ahead log seems to get corrupted when the application is stopped > abruptly (Ctrl-C, or kill). Then, the application refuses to run due to this > exception: > {code} > 2016-10-03 08:03:32,321 ERROR [Executor task launch worker-1] > executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1) > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 13994 > ...skipping... > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Code: > {code} > import org.apache.hadoop.conf.Configuration > import org.apache.spark._ > import org.apache.spark.streaming._ > object ProtoDemo { > def createContext(dirName: String) = { > val conf = new SparkConf().setAppName("mything").setMaster("local[4]") > conf.set("spark.streaming.receiver.writeAheadLog.enable", "true") > /* > conf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite", > "true") > conf.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite", > "true") > */ > val ssc = new StreamingContext(conf, Seconds(1)) > ssc.checkpoint(dirName) > val lines = ssc.socketTextStream("127.0.0.1", ) > val words = lines.flatMap(_.split(" ")) > val pairs = words.map(word => (word, 1)) > val wordCounts = pairs.reduceByKey(_ + _) > val runningCounts = wordCounts.updateStateByKey[Int] { > (values: Seq[Int], oldValue: Option[Int]) => > val s = values.sum > Some(oldValue.fold(s)(_ + s)) > } > // Print the first ten elements of each RDD generated in this DStream to > the console > runningCounts.print() > ssc > } > def main(args: Array[String]) = { > val hadoopConf = new Configuration() > val dirName = "/tmp/chkp" > val ssc = StreamingContext.getOrCreate(dirName, () => > createContext(dirName), hadoopConf) > ssc.start() > ssc.awaitTermination() > } > } > {code} > Steps to reproduce: > 1. I put the code in a repository: git clone > https://github.com/thesamet/spark-issue > 2. in one terminal: {{ while true; do nc -l localhost ; done}} > 3. Start a new terminal > 4. Run "sbt run". > 5. Type a few lines in the netcat terminal. > 6. Kill the streaming project (Ctrl-C), > 7. Go back to step 4 until you see the exception above. > I tried the above with local filesystem and also with S3, and getting the > same result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19032) Non-deterministic results using aggregation first across multiple workers
Harry Weppner created SPARK-19032: - Summary: Non-deterministic results using aggregation first across multiple workers Key: SPARK-19032 URL: https://issues.apache.org/jira/browse/SPARK-19032 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 1.6.1 Environment: Standalone Spark 1.6.1 cluster on EC2 with 2 worker nodes, one executor each. Reporter: Harry Weppner We've come across a situation results aggregated using {{first}} on a sorted df are non-deterministic. Given the explanation for the plan there appears to be a plausible explanation but creates more question on the usefulness of these aggregation functions in a spark cluster. Here's a minimal example to reproduce: {code} val df = sc.parallelize(Seq(("a","prod1",0.6),("a","prod2",0.4),("a","prod2",0.4),("a","prod2",0.4),("a","prod2",0.4))).toDF("account","product","probability") var p = df.sort($"probability".desc).groupBy($"account").agg(first($"product"),first($"probability")).show(); +---+++ |account|first(product)()|first(probability)()| +---+++ | a| prod1| 0.6| +---+++ p: Unit = () // Repeat and notice that result will occasionally be different +---+++ |account|first(product)()|first(probability)()| +---+++ | a| prod2| 0.4| +---+++ p: Unit = () scala> df.sort($"probability".desc).groupBy($"account").agg(first($"product"),first($"probability")).explain(true); == Parsed Logical Plan == 'Aggregate ['account], [unresolvedalias('account),(first('product)(),mode=Complete,isDistinct=false) AS first(product)()#523,(first('probability)(),mode=Complete,isDistinct=false) AS first(probability)()#524] +- Sort [probability#5 DESC], true +- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5] +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at :27 == Analyzed Logical Plan == account: string, first(product)(): string, first(probability)(): double Aggregate [account#3], [account#3,(first(product#4)(),mode=Complete,isDistinct=false) AS first(product)()#523,(first(probability#5)(),mode=Complete,isDistinct=false) AS first(probability)()#524] +- Sort [probability#5 DESC], true +- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5] +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at :27 == Optimized Logical Plan == Aggregate [account#3], [account#3,(first(product#4)(),mode=Complete,isDistinct=false) AS first(product)()#523,(first(probability#5)(),mode=Complete,isDistinct=false) AS first(probability)()#524] +- Sort [probability#5 DESC], true +- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5] +- LogicalRDD [_1#0,_2#1,_3#2], MapPartitionsRDD[1] at rddToDataFrameHolder at :27 == Physical Plan == SortBasedAggregate(key=[account#3], functions=[(first(product#4)(),mode=Final,isDistinct=false),(first(probability#5)(),mode=Final,isDistinct=false)], output=[account#3,first(product)()#523,first(probability)()#524]) +- ConvertToSafe +- Sort [account#3 ASC], false, 0 +- TungstenExchange hashpartitioning(account#3,200), None +- ConvertToUnsafe +- SortBasedAggregate(key=[account#3], functions=[(first(product#4)(),mode=Partial,isDistinct=false),(first(probability#5)(),mode=Partial,isDistinct=false)], output=[account#3,first#532,valueSet#533,first#534,valueSet#535]) +- ConvertToSafe +- Sort [account#3 ASC], false, 0 +- Sort [probability#5 DESC], true, 0 +- ConvertToUnsafe +- Exchange rangepartitioning(probability#5 DESC,200), None +- ConvertToSafe +- Project [_1#0 AS account#3,_2#1 AS product#4,_3#2 AS probability#5] +- Scan ExistingRDD[_1#0,_2#1,_3#2] {code} My working hypothesis is that after {{TungstenExchange hashpartitioning}} the _global_ sort order on {{probability}} is lost leading to non-deterministic results. If this hypothesis is valid, then how useful are aggregation functions such as {{first}}, {{last}} and possibly others in Spark? It appears that the use of window functions could address the ambiguity by making the partitions explicit but I'd be interested in your assessment. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Created] (SPARK-19031) JDBC Streaming Source
Michael Armbrust created SPARK-19031: Summary: JDBC Streaming Source Key: SPARK-19031 URL: https://issues.apache.org/jira/browse/SPARK-19031 Project: Spark Issue Type: New Feature Components: Structured Streaming Reporter: Michael Armbrust Many RDBMs provide the ability to capture changes to a table (change data capture). We should make this available as a streaming source. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18942) Support output operations for kinesis
[ https://issues.apache.org/jira/browse/SPARK-18942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-18942. -- Resolution: Won't Fix > Support output operations for kinesis > - > > Key: SPARK-18942 > URL: https://issues.apache.org/jira/browse/SPARK-18942 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro >Priority: Trivial > > Spark does not support output operations (e.g. DStream#saveAsTextFile) for > Kinesis. So, officially supporting this is useful for some AWS users, I > think. An usage of the output operations is assumed as follows; > {code} > // Import a class that includes an output function > scala> import org.apache.spark.streaming.kinesis.KinesisDStreamFunctions._ > // Create a DStream > scala> val stream: DStream[String] = ... > // Define a handler to convert the DStream type for output > scala> val msgHandler = (s: String) => s.getBytes("UTF-8") > // Define the output operation > scala> kinesisStream.count().saveAsKinesisStream(streamName, endpointUrl, > msgHandler) > {code} > A prototype I made is here: > https://github.com/apache/spark/compare/master...maropu:OutputOpForKinesis -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18942) Support output operations for kinesis
[ https://issues.apache.org/jira/browse/SPARK-18942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786515#comment-15786515 ] Takeshi Yamamuro commented on SPARK-18942: -- okay, I'll put 'won't fix' in this ticket and thanks! Since I make some kinds of kinesis integration in my repo (https://github.com/maropu/spark-kinesis-sql-asl#output-operation-for-spark-streaming), I'll put this in SparkPackage in future. > Support output operations for kinesis > - > > Key: SPARK-18942 > URL: https://issues.apache.org/jira/browse/SPARK-18942 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro >Priority: Trivial > > Spark does not support output operations (e.g. DStream#saveAsTextFile) for > Kinesis. So, officially supporting this is useful for some AWS users, I > think. An usage of the output operations is assumed as follows; > {code} > // Import a class that includes an output function > scala> import org.apache.spark.streaming.kinesis.KinesisDStreamFunctions._ > // Create a DStream > scala> val stream: DStream[String] = ... > // Define a handler to convert the DStream type for output > scala> val msgHandler = (s: String) => s.getBytes("UTF-8") > // Define the output operation > scala> kinesisStream.count().saveAsKinesisStream(streamName, endpointUrl, > msgHandler) > {code} > A prototype I made is here: > https://github.com/apache/spark/compare/master...maropu:OutputOpForKinesis -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786390#comment-15786390 ] Devesh Parekh commented on SPARK-18693: --- I suggest this is more appropriately classified as a bug rather than an improvement. Users who follow the documentation to use CrossValidator for model selection with these evaluators and weighted input will get wrong results. At the very least, the user should be warned in the documentation that the results will be wrong if they fit a weight-aware model on weighted input and use these existing evaluators in CrossValidator. With that warning in place, making the evaluators work on weighted input would then be an improvement. > BinaryClassificationEvaluator, RegressionEvaluator, and > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-18693 > URL: https://issues.apache.org/jira/browse/SPARK-18693 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Devesh Parekh > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.
[ https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786368#comment-15786368 ] Egor Pahomov commented on SPARK-18930: -- I'm not sure, that such restriction buried in documentation is OK. Basically the problem - I've created correct schema for table. I've correctly inserted into it. But for some reason I need to keep order of columns in select statement > Inserting in partitioned table - partitioned field should be last in select > statement. > --- > > Key: SPARK-18930 > URL: https://issues.apache.org/jira/browse/SPARK-18930 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > CREATE TABLE temp.test_partitioning_4 ( > num string > ) > PARTITIONED BY ( > day string) > stored as parquet > INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day) > select day, count(*) as num from > hss.session where year=2016 and month=4 > group by day > Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, > emp.db/test_partitioning_3/day=69094345 > As you can imagine these numbers are num of records. But! When I do select * > from temp.test_partitioning_4 data is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786319#comment-15786319 ] Joseph K. Bradley commented on SPARK-18813: --- I just added links to the categories listed above to help with maintenance. Given feedback, I'll go ahead and update the text above to confirm that the proposed roadmap process will be used. But further feedback is welcome. Some JIRAs likely do not yet follow the process proposal (e.g., lacking shepherds). I'll start trying to ping on those JIRAs which need to be updated. > MLlib 2.2 Roadmap > - > > Key: SPARK-18813 > URL: https://issues.apache.org/jira/browse/SPARK-18813 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > > *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.* > The roadmap process described below is significantly updated since the 2.1 > roadmap [SPARK-15581]. Please refer to [SPARK-15581] for more discussion on > the basis for this proposal, and comment in this JIRA if you have suggestions > for improvements. > h1. Roadmap process > This roadmap is a master list for MLlib improvements we are working on during > this release. This includes ML-related changes in PySpark and SparkR. > *What is planned for the next release?* > * This roadmap lists issues which at least one Committer has prioritized. > See details below in "Instructions for committers." > * This roadmap only lists larger or more critical issues. > *How can contributors influence this roadmap?* > * If you believe an issue should be in this roadmap, please discuss the issue > on JIRA and/or the dev mailing list. Make sure to ping Committers since at > least one must agree to shepherd the issue. > * For general discussions, use this JIRA or the dev mailing list. For > specific issues, please comment on those issues or the mailing list. > * Vote for & watch issues which are important to you. > ** MLlib, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC] > ** SparkR, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC] > h2. Target Version and Priority > This section describes the meaning of Target Version and Priority. _These > meanings have been updated in this proposal for the 2.2 process._ > || Category | Target Version | Priority | Shepherd | Put on roadmap? | In > next release? || > | [1 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Blocker | *must* | *must* | *must* | > | [2 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Critical | *must* | yes, unless small | *best effort* | > | [3 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Major | *must* | optional | *best effort* | > | [4 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Minor | optional | no | maybe | > | [5 | >
[jira] [Updated] (SPARK-18813) MLlib 2.2 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18813: -- Description: *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.* The roadmap process described below is significantly updated since the 2.1 roadmap [SPARK-15581]. Please refer to [SPARK-15581] for more discussion on the basis for this proposal, and comment in this JIRA if you have suggestions for improvements. h1. Roadmap process This roadmap is a master list for MLlib improvements we are working on during this release. This includes ML-related changes in PySpark and SparkR. *What is planned for the next release?* * This roadmap lists issues which at least one Committer has prioritized. See details below in "Instructions for committers." * This roadmap only lists larger or more critical issues. *How can contributors influence this roadmap?* * If you believe an issue should be in this roadmap, please discuss the issue on JIRA and/or the dev mailing list. Make sure to ping Committers since at least one must agree to shepherd the issue. * For general discussions, use this JIRA or the dev mailing list. For specific issues, please comment on those issues or the mailing list. * Vote for & watch issues which are important to you. ** MLlib, sorted by: [Votes | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC] or [Watchers | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC] ** SparkR, sorted by: [Votes | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC] or [Watchers | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC] h2. Target Version and Priority This section describes the meaning of Target Version and Priority. _These meanings have been updated in this proposal for the 2.2 process._ || Category | Target Version | Priority | Shepherd | Put on roadmap? | In next release? || | [1 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] | next release | Blocker | *must* | *must* | *must* | | [2 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] | next release | Critical | *must* | yes, unless small | *best effort* | | [3 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] | next release | Major | *must* | optional | *best effort* | | [4 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] | next release | Minor | optional | no | maybe | | [5 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] | next release | Trivial | optional | no | maybe | | [6 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC] | (empty) | (any) | yes | no | maybe | | [7 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(EMPTY)%20AND%20Shepherd%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC] | (empty) | (any) |
[jira] [Commented] (SPARK-19026) local directories cannot be cleanuped when create directory of "executor-***" throws IOException such as there is no more free disk space to create it etc.
[ https://issues.apache.org/jira/browse/SPARK-19026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786276#comment-15786276 ] Sean Owen commented on SPARK-19026: --- Can you clarify? I'm not sure what you're proposing here. Maybe a PR is the best way to express it. > local directories cannot be cleanuped when create directory of "executor-***" > throws IOException such as there is no more free disk space to create it etc. > --- > > Key: SPARK-19026 > URL: https://issues.apache.org/jira/browse/SPARK-19026 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 2.0.2 > Environment: linux >Reporter: zuotingbing > > i set SPARK_LOCAL_DIRS variable like this: > SPARK_LOCAL_DIRS=/data2/spark/tmp,/data3/spark/tmp,/data4/spark/tmp > when there is no more free disk space on "/data4/spark/tmp" , other local > directories (/data2/spark/tmp,/data3/spark/tmp) cannot be cleanuped when my > application finished. > we should catch the IOExecption when create local dirs throws execption , > otherwise the variable "appDirectories(appId)" not be set , then local > directories "executor-***" cannot be deleted for this application. If the > number of folders "executor-***" > 32k we cannot create executor anymore on > this worker node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.
[ https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786241#comment-15786241 ] Sean Owen commented on SPARK-18930: --- I don't know enough to say that myself. [~epahomov] what's the actual problem here? you say it seems to work correctly. > Inserting in partitioned table - partitioned field should be last in select > statement. > --- > > Key: SPARK-18930 > URL: https://issues.apache.org/jira/browse/SPARK-18930 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > CREATE TABLE temp.test_partitioning_4 ( > num string > ) > PARTITIONED BY ( > day string) > stored as parquet > INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day) > select day, count(*) as num from > hss.session where year=2016 and month=4 > group by day > Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, > emp.db/test_partitioning_3/day=69094345 > As you can imagine these numbers are num of records. But! When I do select * > from temp.test_partitioning_4 data is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19003) Add Java examples in "Spark Streaming Guide", section "Design Patterns for using foreachRDD"
[ https://issues.apache.org/jira/browse/SPARK-19003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19003: -- Assignee: Tushar Adeshara > Add Java examples in "Spark Streaming Guide", section "Design Patterns for > using foreachRDD" > - > > Key: SPARK-19003 > URL: https://issues.apache.org/jira/browse/SPARK-19003 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Tushar Adeshara >Assignee: Tushar Adeshara >Priority: Minor > Fix For: 2.1.1, 2.2.0 > > > The page http://spark.apache.org/docs/latest/streaming-programming-guide.html > is missing Java example in section "Design Patterns for using foreachRDD". > Except this section, the page has Scala, Java and Python examples for all > other sections, so would be good to add for consistency. > I have made required code changes, will raise a pull request against this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19003) Add Java examples in "Spark Streaming Guide", section "Design Patterns for using foreachRDD"
[ https://issues.apache.org/jira/browse/SPARK-19003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19003. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16408 [https://github.com/apache/spark/pull/16408] > Add Java examples in "Spark Streaming Guide", section "Design Patterns for > using foreachRDD" > - > > Key: SPARK-19003 > URL: https://issues.apache.org/jira/browse/SPARK-19003 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Tushar Adeshara >Priority: Minor > Fix For: 2.1.1, 2.2.0 > > > The page http://spark.apache.org/docs/latest/streaming-programming-guide.html > is missing Java example in section "Design Patterns for using foreachRDD". > Except this section, the page has Scala, Java and Python examples for all > other sections, so would be good to add for consistency. > I have made required code changes, will raise a pull request against this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18693: -- Issue Type: Improvement (was: Bug) > BinaryClassificationEvaluator, RegressionEvaluator, and > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-18693 > URL: https://issues.apache.org/jira/browse/SPARK-18693 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Devesh Parekh > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18698) public constructor with uid for IndexToString-class
[ https://issues.apache.org/jira/browse/SPARK-18698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18698: -- Assignee: Ilya Matiach > public constructor with uid for IndexToString-class > --- > > Key: SPARK-18698 > URL: https://issues.apache.org/jira/browse/SPARK-18698 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Bjoern Toldbod >Assignee: Ilya Matiach >Priority: Minor > Fix For: 2.2.0 > > > The IndexToString class in org.apache.spark.ml.feature does not provide a > public constructor which takes a uid string. > It would be nice to have such a constructor. > (Generally, being able to name pipelinestages makes it much easier to work > with complex models) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18698) public constructor with uid for IndexToString-class
[ https://issues.apache.org/jira/browse/SPARK-18698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18698. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16436 [https://github.com/apache/spark/pull/16436] > public constructor with uid for IndexToString-class > --- > > Key: SPARK-18698 > URL: https://issues.apache.org/jira/browse/SPARK-18698 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Bjoern Toldbod >Priority: Minor > Fix For: 2.2.0 > > > The IndexToString class in org.apache.spark.ml.feature does not provide a > public constructor which takes a uid string. > It would be nice to have such a constructor. > (Generally, being able to name pipelinestages makes it much easier to work > with complex models) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19030) Dropped event errors being reported after SparkContext has been stopped
michael procopio created SPARK-19030: Summary: Dropped event errors being reported after SparkContext has been stopped Key: SPARK-19030 URL: https://issues.apache.org/jira/browse/SPARK-19030 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2 Environment: Debian 8 using spark-submit with MATLAB integration spark code is being code using java. Reporter: michael procopio Priority: Minor After stop has been called on SparkContext, errors are being reported. 6/12/29 15:54:04 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(2,WrappedArray()) The stack in the hearbeat thread is at the point where the error is thrown is: Daemon Thread [heartbeat-receiver-event-loop-thread] (Suspended (breakpoint at line 124 in LiveListenerBus)) LiveListenerBus.post(SparkListenerEvent) line: 124 DAGScheduler.executorHeartbeatReceived(String, Tuple4
[jira] [Commented] (SPARK-16402) JDBC source: Implement save API
[ https://issues.apache.org/jira/browse/SPARK-16402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786114#comment-15786114 ] Xiao Li commented on SPARK-16402: - Yes. > JDBC source: Implement save API > --- > > Key: SPARK-16402 > URL: https://issues.apache.org/jira/browse/SPARK-16402 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, we are unable to call the `save` API of `DataFrameWriter` when the > source is JDBC. For example, > {noformat} > df.write > .format("jdbc") > .option("url", url1) > .option("dbtable", "TEST.TRUNCATETEST") > .option("user", "testUser") > .option("password", "testPass") > .save() > {noformat} > The error message users will get is like > {noformat} > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not > allow create table as select. > java.lang.RuntimeException: > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not > allow create table as select. > {noformat} > However, the `save` API is very common for all the data sources, like parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16402) JDBC source: Implement save API
[ https://issues.apache.org/jira/browse/SPARK-16402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li closed SPARK-16402. --- Resolution: Duplicate > JDBC source: Implement save API > --- > > Key: SPARK-16402 > URL: https://issues.apache.org/jira/browse/SPARK-16402 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, we are unable to call the `save` API of `DataFrameWriter` when the > source is JDBC. For example, > {noformat} > df.write > .format("jdbc") > .option("url", url1) > .option("dbtable", "TEST.TRUNCATETEST") > .option("user", "testUser") > .option("password", "testPass") > .save() > {noformat} > The error message users will get is like > {noformat} > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not > allow create table as select. > java.lang.RuntimeException: > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not > allow create table as select. > {noformat} > However, the `save` API is very common for all the data sources, like parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19029) Remove databaseName from SimpleCatalogRelation
[ https://issues.apache.org/jira/browse/SPARK-19029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786109#comment-15786109 ] Apache Spark commented on SPARK-19029: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/16438 > Remove databaseName from SimpleCatalogRelation > --- > > Key: SPARK-19029 > URL: https://issues.apache.org/jira/browse/SPARK-19029 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Remove useless `databaseName ` from `SimpleCatalogRelation`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19029) Remove databaseName from SimpleCatalogRelation
[ https://issues.apache.org/jira/browse/SPARK-19029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19029: Assignee: Xiao Li (was: Apache Spark) > Remove databaseName from SimpleCatalogRelation > --- > > Key: SPARK-19029 > URL: https://issues.apache.org/jira/browse/SPARK-19029 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Remove useless `databaseName ` from `SimpleCatalogRelation`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19029) Remove databaseName from SimpleCatalogRelation
[ https://issues.apache.org/jira/browse/SPARK-19029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19029: Assignee: Apache Spark (was: Xiao Li) > Remove databaseName from SimpleCatalogRelation > --- > > Key: SPARK-19029 > URL: https://issues.apache.org/jira/browse/SPARK-19029 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Remove useless `databaseName ` from `SimpleCatalogRelation`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19029) Remove databaseName from SimpleCatalogRelation
Xiao Li created SPARK-19029: --- Summary: Remove databaseName from SimpleCatalogRelation Key: SPARK-19029 URL: https://issues.apache.org/jira/browse/SPARK-19029 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Xiao Li Assignee: Xiao Li Remove useless `databaseName ` from `SimpleCatalogRelation`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical
[ https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-19012: -- Affects Version/s: 2.1.0 > CreateOrReplaceTempView throws > org.apache.spark.sql.catalyst.parser.ParseException when viewName first char > is numerical > > > Key: SPARK-19012 > URL: https://issues.apache.org/jira/browse/SPARK-19012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.0.2, 2.1.0 >Reporter: Jork Zijlstra >Assignee: Dongjoon Hyun > Fix For: 2.2.0 > > > Using a viewName where the the fist char is a numerical value on > dataframe.createOrReplaceTempView(viewName: String) causes: > {code} > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', > 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', > 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', > 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0) > == SQL == > 1 > {code} > {code} > val tableOrViewName = "1" //fails > val tableOrViewName = "a" //works > sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical
[ https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786052#comment-15786052 ] Herman van Hovell commented on SPARK-19012: --- Ok, you could also start a table name with {{tbl_}} and that would also make the problem go away. > CreateOrReplaceTempView throws > org.apache.spark.sql.catalyst.parser.ParseException when viewName first char > is numerical > > > Key: SPARK-19012 > URL: https://issues.apache.org/jira/browse/SPARK-19012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.0.2 >Reporter: Jork Zijlstra >Assignee: Dongjoon Hyun > Fix For: 2.2.0 > > > Using a viewName where the the fist char is a numerical value on > dataframe.createOrReplaceTempView(viewName: String) causes: > {code} > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', > 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', > 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', > 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0) > == SQL == > 1 > {code} > {code} > val tableOrViewName = "1" //fails > val tableOrViewName = "a" //works > sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical
[ https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-19012. --- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.2.0 > CreateOrReplaceTempView throws > org.apache.spark.sql.catalyst.parser.ParseException when viewName first char > is numerical > > > Key: SPARK-19012 > URL: https://issues.apache.org/jira/browse/SPARK-19012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.0.2 >Reporter: Jork Zijlstra >Assignee: Dongjoon Hyun > Fix For: 2.2.0 > > > Using a viewName where the the fist char is a numerical value on > dataframe.createOrReplaceTempView(viewName: String) causes: > {code} > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', > 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', > 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', > 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0) > == SQL == > 1 > {code} > {code} > val tableOrViewName = "1" //fails > val tableOrViewName = "a" //works > sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18669) Update Apache docs regard watermarking in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-18669. -- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > Update Apache docs regard watermarking in Structured Streaming > -- > > Key: SPARK-18669 > URL: https://issues.apache.org/jira/browse/SPARK-18669 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.1.1, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical
[ https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786015#comment-15786015 ] Dongjoon Hyun commented on SPARK-19012: --- Yep. I tried to update the annotation but unfortunately it was reverted that now. (You can see that in my PR.) > Maybe updating the annotation of the method would also be enough. Having an > Exception with a clear reason would definitely already a fix. Changing annotation on `public` API seems to be handled in a different issue with some more discussion because it affects many other codes (e.g. examples). > CreateOrReplaceTempView throws > org.apache.spark.sql.catalyst.parser.ParseException when viewName first char > is numerical > > > Key: SPARK-19012 > URL: https://issues.apache.org/jira/browse/SPARK-19012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.0.2 >Reporter: Jork Zijlstra > > Using a viewName where the the fist char is a numerical value on > dataframe.createOrReplaceTempView(viewName: String) causes: > {code} > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', > 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', > 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', > 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0) > == SQL == > 1 > {code} > {code} > val tableOrViewName = "1" //fails > val tableOrViewName = "a" //works > sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786013#comment-15786013 ] Ilya Matiach commented on SPARK-18693: -- Many classifiers in ml don't seem to support weight columns yet, so probably other JIRAs need to be created to add weight columns to them (eg DecisionTreeClassifier). Also, it doesn't look like any packages in MLLIB contain weight columns, so I probably should try to limit the changes to ML only, but it is difficult to do so since ML evaluators are just wrappers around MLLIB. Also, please note the pull request that is linked to here hasn't been updated in a long time, and it only resolved the issue for RegressionMetrics in MLLIB: "SPARK-11520 RegressionMetrics should support instance weights " I'm still planning out the changes that need to be made, since this one looks nontrivial, any suggestions from spark folks? > BinaryClassificationEvaluator, RegressionEvaluator, and > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-18693 > URL: https://issues.apache.org/jira/browse/SPARK-18693 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Devesh Parekh > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18698) public constructor with uid for IndexToString-class
[ https://issues.apache.org/jira/browse/SPARK-18698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18698: -- Shepherd: Joseph K. Bradley Affects Version/s: (was: 2.0.2) Target Version/s: 2.2.0 > public constructor with uid for IndexToString-class > --- > > Key: SPARK-18698 > URL: https://issues.apache.org/jira/browse/SPARK-18698 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Bjoern Toldbod >Priority: Minor > > The IndexToString class in org.apache.spark.ml.feature does not provide a > public constructor which takes a uid string. > It would be nice to have such a constructor. > (Generally, being able to name pipelinestages makes it much easier to work > with complex models) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18698) public constructor with uid for IndexToString-class
[ https://issues.apache.org/jira/browse/SPARK-18698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18698: -- Issue Type: Improvement (was: Wish) > public constructor with uid for IndexToString-class > --- > > Key: SPARK-18698 > URL: https://issues.apache.org/jira/browse/SPARK-18698 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Bjoern Toldbod >Priority: Minor > > The IndexToString class in org.apache.spark.ml.feature does not provide a > public constructor which takes a uid string. > It would be nice to have such a constructor. > (Generally, being able to name pipelinestages makes it much easier to work > with complex models) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18805) InternalMapWithStateDStream make java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-18805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15786000#comment-15786000 ] Shixiong Zhu commented on SPARK-18805: -- @etienne That should be not an infinite loop. The time is different on each call. Do you have the beginning of the stack track? SPARK-6847 may be related but you can still reproduce it in 2.0.2. > InternalMapWithStateDStream make java.lang.StackOverflowError > -- > > Key: SPARK-18805 > URL: https://issues.apache.org/jira/browse/SPARK-18805 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.0.2 > Environment: mesos >Reporter: etienne > > When load InternalMapWithStateDStream from a check point. > If isValidTime is true and if there is no generatedRDD at the given time > there is an infinite loop. > 1) compute is call on InternalMapWithStateDStream > 2) InternalMapWithStateDStream try to generate the previousRDD > 3) Stream look in generatedRDD if the RDD is already generated for the given > time > 4) It not fund the rdd so it check if the time is valid. > 5) if the time is valid call compute on InternalMapWithStateDStream > 6) restart from 1) > Here the exception that illustrate this error > {code} > Exception in thread "streaming-start" java.lang.StackOverflowError > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16402) JDBC source: Implement save API
[ https://issues.apache.org/jira/browse/SPARK-16402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785990#comment-15785990 ] Nicholas Chammas commented on SPARK-16402: -- [~JustinPihony], [~smilegator] - Does the resolution on SPARK-14525 also resolve this issue? > JDBC source: Implement save API > --- > > Key: SPARK-16402 > URL: https://issues.apache.org/jira/browse/SPARK-16402 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, we are unable to call the `save` API of `DataFrameWriter` when the > source is JDBC. For example, > {noformat} > df.write > .format("jdbc") > .option("url", url1) > .option("dbtable", "TEST.TRUNCATETEST") > .option("user", "testUser") > .option("password", "testPass") > .save() > {noformat} > The error message users will get is like > {noformat} > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not > allow create table as select. > java.lang.RuntimeException: > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does not > allow create table as select. > {noformat} > However, the `save` API is very common for all the data sources, like parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devesh Parekh updated SPARK-18693: -- Description: The LogisticRegression and LinearRegression models support training with a weight column, but the corresponding evaluators do not support computing metrics using those weights. This breaks model selection using CrossValidator. (was: The LogisticRegression and LinearRegression models support training with a weight column, but the corresponding evaluators do not support computing metrics using those weights.) > BinaryClassificationEvaluator, RegressionEvaluator, and > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-18693 > URL: https://issues.apache.org/jira/browse/SPARK-18693 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Devesh Parekh > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18942) Support output operations for kinesis
[ https://issues.apache.org/jira/browse/SPARK-18942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785953#comment-15785953 ] Shixiong Zhu commented on SPARK-18942: -- Thanks for your prototype. Actually, you can just implement an RDD action, or DataFrame DataSource and put them as a Spark package like [spark-redshift|https://github.com/databricks/spark-redshift]. [Spark Packages|https://spark-packages.org/] is a better place for such third-party data sources. > Support output operations for kinesis > - > > Key: SPARK-18942 > URL: https://issues.apache.org/jira/browse/SPARK-18942 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro >Priority: Trivial > > Spark does not support output operations (e.g. DStream#saveAsTextFile) for > Kinesis. So, officially supporting this is useful for some AWS users, I > think. An usage of the output operations is assumed as follows; > {code} > // Import a class that includes an output function > scala> import org.apache.spark.streaming.kinesis.KinesisDStreamFunctions._ > // Create a DStream > scala> val stream: DStream[String] = ... > // Define a handler to convert the DStream type for output > scala> val msgHandler = (s: String) => s.getBytes("UTF-8") > // Define the output operation > scala> kinesisStream.count().saveAsKinesisStream(streamName, endpointUrl, > msgHandler) > {code} > A prototype I made is here: > https://github.com/apache/spark/compare/master...maropu:OutputOpForKinesis -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15493) Allow setting the quoteEscapingEnabled flag when writing CSV
[ https://issues.apache.org/jira/browse/SPARK-15493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785951#comment-15785951 ] Jacob Wellington commented on SPARK-15493: -- I'm running into an issue where this doesn't seem to be working for the SQL interface. I'm connecting to the thrift server using beeline and submitting the following sql: {quote} CREATE TABLE e2 USING csv OPTIONS (path 'test.csv', quote '"', escapeQuotes 'false', quoteEscapingEnabled 'false') AS SELECT '"G"' FROM parquet.`test.parquet`; DROP TABLE e2; {quote} When I look at the test.csv output I get this: {quote} "\"G\"" {quote} I'm using spark 2.0.2 with its version of beeline and its hive server. I've also tried multiple variations of the options. > Allow setting the quoteEscapingEnabled flag when writing CSV > > > Key: SPARK-15493 > URL: https://issues.apache.org/jira/browse/SPARK-15493 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jurriaan Pruis >Assignee: Jurriaan Pruis > Fix For: 2.0.0 > > > See > https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247 > This kind of functionality is needed to be able to write RFC 4180 > (https://tools.ietf.org/html/rfc4180#section-2) / Amazon Redshift compatible > CSV files > (https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-csv) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18974) FileInputDStream could not detected files which moved to the directory
[ https://issues.apache.org/jira/browse/SPARK-18974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785930#comment-15785930 ] Shixiong Zhu commented on SPARK-18974: -- Do you want to try Structured Streaming? Its FileStreamSource allows 7 days old files by default. > FileInputDStream could not detected files which moved to the directory > --- > > Key: SPARK-18974 > URL: https://issues.apache.org/jira/browse/SPARK-18974 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.0.2 >Reporter: Adam Wang > > FileInputDStream use mod time to find new files, but if a file was moved into > the directories it's modification time would not be changed, so > FileInputDStream could not detect these files. > I think a way to fix this bug is get access_time and do judgment, bug it need > a Set of files to save all old files, it would very inefficient for lot of > files directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18359) Let user specify locale in CSV parsing
[ https://issues.apache.org/jira/browse/SPARK-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18359: - Component/s: (was: Spark Core) > Let user specify locale in CSV parsing > -- > > Key: SPARK-18359 > URL: https://issues.apache.org/jira/browse/SPARK-18359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: yannick Radji > > On the DataFrameReader object there no CSV-specific option to set decimal > delimiter on comma whereas dot like it use to be in France and Europe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18404) RPC call from executor to driver blocks when getting map output locations (Netty Only)
[ https://issues.apache.org/jira/browse/SPARK-18404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785914#comment-15785914 ] Shixiong Zhu commented on SPARK-18404: -- That's pretty weird. It's a blocking call for both netty and akka rpc. > RPC call from executor to driver blocks when getting map output locations > (Netty Only) > -- > > Key: SPARK-18404 > URL: https://issues.apache.org/jira/browse/SPARK-18404 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Jeffrey Shmain > > Compared identical application run on Spark 1.5 and Spark 1.6. Noticed that > jobs became slower. After looking at it closer, found that 75% of tasks > finished same or above, and 25% had significant delays (unrelated to data > skew and GC) > After more debugging noticed that the executors are blocking for few seconds > (sometimes 25) on this call: > https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L199 >logInfo("Doing the fetch; tracker endpoint = " + trackerEndpoint) > // This try-finally prevents hangs due to timeouts: > try { > val fetchedBytes = > askTracker[Array[Byte]](GetMapOutputStatuses(shuffleId)) > fetchedStatuses = > MapOutputTracker.deserializeMapStatuses(fetchedBytes) > logInfo("Got the output locations") > So the regression seems to be related changing the default from from Akka to > Netty. > This was an application working with RDDs, submitting 10 concurrent queries > at a time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical
[ https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19012: - Component/s: (was: Spark Core) SQL > CreateOrReplaceTempView throws > org.apache.spark.sql.catalyst.parser.ParseException when viewName first char > is numerical > > > Key: SPARK-19012 > URL: https://issues.apache.org/jira/browse/SPARK-19012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.0.2 >Reporter: Jork Zijlstra > > Using a viewName where the the fist char is a numerical value on > dataframe.createOrReplaceTempView(viewName: String) causes: > {code} > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', > 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', > 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', > 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0) > == SQL == > 1 > {code} > {code} > val tableOrViewName = "1" //fails > val tableOrViewName = "a" //works > sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19028) Fixed non-thread-safe functions used in SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-19028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19028: Assignee: Xiao Li (was: Apache Spark) > Fixed non-thread-safe functions used in SessionCatalog > -- > > Key: SPARK-19028 > URL: https://issues.apache.org/jira/browse/SPARK-19028 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Fixed non-thread-safe functions used in SessionCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19028) Fixed non-thread-safe functions used in SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-19028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19028: Assignee: Apache Spark (was: Xiao Li) > Fixed non-thread-safe functions used in SessionCatalog > -- > > Key: SPARK-19028 > URL: https://issues.apache.org/jira/browse/SPARK-19028 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Fixed non-thread-safe functions used in SessionCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19028) Fixed non-thread-safe functions used in SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-19028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785827#comment-15785827 ] Apache Spark commented on SPARK-19028: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/16437 > Fixed non-thread-safe functions used in SessionCatalog > -- > > Key: SPARK-19028 > URL: https://issues.apache.org/jira/browse/SPARK-19028 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Fixed non-thread-safe functions used in SessionCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19028) Fixed non-thread-safe functions used in SessionCatalog
Xiao Li created SPARK-19028: --- Summary: Fixed non-thread-safe functions used in SessionCatalog Key: SPARK-19028 URL: https://issues.apache.org/jira/browse/SPARK-19028 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2 Reporter: Xiao Li Assignee: Xiao Li Fixed non-thread-safe functions used in SessionCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18737) Serialization setting "spark.serializer" ignored in Spark 2.x
[ https://issues.apache.org/jira/browse/SPARK-18737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785715#comment-15785715 ] Josh Bacon commented on SPARK-18737: I think this issue may be related to the following issues: https://issues.apache.org/jira/browse/SPARK-18560 https://issues.apache.org/jira/browse/SPARK-18617 > Serialization setting "spark.serializer" ignored in Spark 2.x > - > > Key: SPARK-18737 > URL: https://issues.apache.org/jira/browse/SPARK-18737 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Dr. Michael Menzel > > The following exception occurs although the JavaSerializer has been activated: > 16/11/22 10:49:24 INFO TaskSetManager: Starting task 0.0 in stage 9.0 (TID > 77, ip-10-121-14-147.eu-central-1.compute.internal, partition 1, RACK_LOCAL, > 5621 bytes) > 16/11/22 10:49:24 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching > task 77 on executor id: 2 hostname: > ip-10-121-14-147.eu-central-1.compute.internal. > 16/11/22 10:49:24 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory > on ip-10-121-14-147.eu-central-1.compute.internal:45059 (size: 879.0 B, free: > 410.4 MB) > 16/11/22 10:49:24 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 77, > ip-10-121-14-147.eu-central-1.compute.internal): > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 13994 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at org.apache.spark.util.NextIterator.to(NextIterator.scala:21) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at org.apache.spark.util.NextIterator.toBuffer(NextIterator.scala:21) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at org.apache.spark.util.NextIterator.toArray(NextIterator.scala:21) > at > org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:927) > at > org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:927) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > The code runs perfectly with Spark 1.6.0. Since we moved to 2.0.0 and now > 2.0.1, we see the Kyro deserialization exception and over time the Spark > streaming job stops processing since too many tasks failed. > Our action was to use conf.set("spark.serializer", > "org.apache.spark.serializer.JavaSerializer") and to disable Kryo class > registration with conf.set("spark.kryo.registrationRequired", false). We hope > to identify the root cause of the exception. > However, setting the serializer to JavaSerializer is oviously ignored by the > Spark-internals. Despite the setting we still see the exception printed in > the log and tasks fail. The occurence seems to be non-deterministic, but to > become more frequent over time. > Several questions we could not answer during our troubleshooting: > 1. How can the debug log for Kryo be enabled? -- We tried following the > minilog documentation, but no output can be found. > 2. Is the serializer setting effective
[jira] [Commented] (SPARK-18883) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-18883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785634#comment-15785634 ] Steve Loughran commented on SPARK-18883: thanks, good to know > FileNotFoundException on _temporary directory > -- > > Key: SPARK-18883 > URL: https://issues.apache.org/jira/browse/SPARK-18883 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: We're on a CDH 5.7, Hadoop 2.6. >Reporter: Mathieu D > > I'm experiencing the following exception, usually after some time with heavy > load : > {code} > 16/12/15 11:25:18 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > java.io.FileNotFoundException: File > hdfs://nameservice1/user/xdstore/rfs/rfsDB/_temporary/0 does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:795) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:106) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:853) > at > org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:849) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:860) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:291) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:361) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:525) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488) > at > com.bluedme.woda.ng.indexer.RfsRepository.append(RfsRepository.scala:36) > at > com.bluedme.woda.ng.indexer.RfsRepository.insert(RfsRepository.scala:23) > at > com.bluedme.woda.cmd.ShareDatasetImpl.runImmediate(ShareDatasetImpl.scala:33) > at > com.bluedme.woda.cmd.ShareDatasetImpl.runImmediate(ShareDatasetImpl.scala:13) > at >
[jira] [Assigned] (SPARK-18698) public constructor with uid for IndexToString-class
[ https://issues.apache.org/jira/browse/SPARK-18698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18698: Assignee: Apache Spark > public constructor with uid for IndexToString-class > --- > > Key: SPARK-18698 > URL: https://issues.apache.org/jira/browse/SPARK-18698 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.0.2 >Reporter: Bjoern Toldbod >Assignee: Apache Spark >Priority: Minor > > The IndexToString class in org.apache.spark.ml.feature does not provide a > public constructor which takes a uid string. > It would be nice to have such a constructor. > (Generally, being able to name pipelinestages makes it much easier to work > with complex models) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16494) Upgrade breeze version to 0.12
[ https://issues.apache.org/jira/browse/SPARK-16494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785629#comment-15785629 ] koert kuipers commented on SPARK-16494: --- i just ran into an issue because of this when trying to upgrade to spark 2.1.0 breeze 0.12 introduces a dependency on shapeless 2.0.0, which is old (april 2014) and not compatible with the version(s) we are using > Upgrade breeze version to 0.12 > -- > > Key: SPARK-16494 > URL: https://issues.apache.org/jira/browse/SPARK-16494 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.1.0 > > > breeze 0.12 has been released for more than half a year, and it brings lots > of new features, performance improvement and bug fixes. > One of the biggest features is LBFGS-B which is an implementation of LBFGS > with box constraints and much faster for some special case. > We would like to implement Huber loss function for {{LinearRegression}} > (SPARK-3181) and it requires LBFGS-B as the optimization solver. So we should > bump up the dependent breeze version to 0.12. > For more features, improvements and bug fixes of breeze 0.12, you can refer > the following link: > https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18698) public constructor with uid for IndexToString-class
[ https://issues.apache.org/jira/browse/SPARK-18698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18698: Assignee: (was: Apache Spark) > public constructor with uid for IndexToString-class > --- > > Key: SPARK-18698 > URL: https://issues.apache.org/jira/browse/SPARK-18698 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.0.2 >Reporter: Bjoern Toldbod >Priority: Minor > > The IndexToString class in org.apache.spark.ml.feature does not provide a > public constructor which takes a uid string. > It would be nice to have such a constructor. > (Generally, being able to name pipelinestages makes it much easier to work > with complex models) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18698) public constructor with uid for IndexToString-class
[ https://issues.apache.org/jira/browse/SPARK-18698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785627#comment-15785627 ] Apache Spark commented on SPARK-18698: -- User 'imatiach-msft' has created a pull request for this issue: https://github.com/apache/spark/pull/16436 > public constructor with uid for IndexToString-class > --- > > Key: SPARK-18698 > URL: https://issues.apache.org/jira/browse/SPARK-18698 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.0.2 >Reporter: Bjoern Toldbod >Priority: Minor > > The IndexToString class in org.apache.spark.ml.feature does not provide a > public constructor which takes a uid string. > It would be nice to have such a constructor. > (Generally, being able to name pipelinestages makes it much easier to work > with complex models) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19027) estimate size of object buffer for object hash aggregate
[ https://issues.apache.org/jira/browse/SPARK-19027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19027: Assignee: Wenchen Fan (was: Apache Spark) > estimate size of object buffer for object hash aggregate > > > Key: SPARK-19027 > URL: https://issues.apache.org/jira/browse/SPARK-19027 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19027) estimate size of object buffer for object hash aggregate
[ https://issues.apache.org/jira/browse/SPARK-19027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785604#comment-15785604 ] Apache Spark commented on SPARK-19027: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/16435 > estimate size of object buffer for object hash aggregate > > > Key: SPARK-19027 > URL: https://issues.apache.org/jira/browse/SPARK-19027 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19027) estimate size of object buffer for object hash aggregate
[ https://issues.apache.org/jira/browse/SPARK-19027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19027: Assignee: Apache Spark (was: Wenchen Fan) > estimate size of object buffer for object hash aggregate > > > Key: SPARK-19027 > URL: https://issues.apache.org/jira/browse/SPARK-19027 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19027) estimate size of object buffer for object hash aggregate
Wenchen Fan created SPARK-19027: --- Summary: estimate size of object buffer for object hash aggregate Key: SPARK-19027 URL: https://issues.apache.org/jira/browse/SPARK-19027 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17346) Kafka 0.10 support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785560#comment-15785560 ] koert kuipers commented on SPARK-17346: --- this ticket mentions kafka 0.10-based sinks for structured streaming, but i think only sources are implemented. is there another ticket for sinks? thanks > Kafka 0.10 support in Structured Streaming > -- > > Key: SPARK-17346 > URL: https://issues.apache.org/jira/browse/SPARK-17346 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Frederick Reiss >Assignee: Shixiong Zhu > Fix For: 2.0.2, 2.1.0 > > > Implement Kafka 0.10-based sources and sinks for Structured Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18698) public constructor with uid for IndexToString-class
[ https://issues.apache.org/jira/browse/SPARK-18698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785558#comment-15785558 ] Ilya Matiach commented on SPARK-18698: -- This looks like a minor bug... similar transformers have such a constructor. I can send a pull request for this change. > public constructor with uid for IndexToString-class > --- > > Key: SPARK-18698 > URL: https://issues.apache.org/jira/browse/SPARK-18698 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.0.2 >Reporter: Bjoern Toldbod >Priority: Minor > > The IndexToString class in org.apache.spark.ml.feature does not provide a > public constructor which takes a uid string. > It would be nice to have such a constructor. > (Generally, being able to name pipelinestages makes it much easier to work > with complex models) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17645) Add feature selector methods based on: False Discovery Rate (FDR) and Family Wise Error rate (FWE)
[ https://issues.apache.org/jira/browse/SPARK-17645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785524#comment-15785524 ] Apache Spark commented on SPARK-17645: -- User 'mpjlu' has created a pull request for this issue: https://github.com/apache/spark/pull/16434 > Add feature selector methods based on: False Discovery Rate (FDR) and Family > Wise Error rate (FWE) > -- > > Key: SPARK-17645 > URL: https://issues.apache.org/jira/browse/SPARK-17645 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Peng Meng >Assignee: Peng Meng >Priority: Minor > Fix For: 2.2.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > Univariate feature selection works by selecting the best features based on > univariate statistical tests. > FDR and FWE are a popular univariate statistical test for feature selection. > In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the > 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg > procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. > In statistics, FWE is the probability of making one or more false > discoveries, or type I errors, among all the hypotheses when performing > multiple hypotheses tests. > https://en.wikipedia.org/wiki/Family-wise_error_rate > We add FDR and FWE methods for ChiSqSelector in this PR, like it is > implemented in scikit-learn. > http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785496#comment-15785496 ] Nattavut Sutyanyong commented on SPARK-19017: - In 3-value logic, true OR unknown = true. Using your formula above, we will have (2,1) NOT IN (1,null) evaluated as (2 <> 1) OR (1 <> null) which is true. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nattavut Sutyanyong updated SPARK-19017: Comment: was deleted (was: In 3-value logic, true OR unknown = true. Using your formula above, we will have (2,1) NOT IN (1,null) evaluated as (2 <> 1) OR (1 <> null) which is true.) > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785494#comment-15785494 ] Nattavut Sutyanyong commented on SPARK-19017: - In 3-value logic, true OR unknown = true. Using your formula above, we will have (2,1) NOT IN (1,null) evaluated as (2 <> 1) OR (1 <> null) which is true. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785469#comment-15785469 ] Ilya Matiach commented on SPARK-18693: -- I can take a look into fixing this issue. > BinaryClassificationEvaluator, RegressionEvaluator, and > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-18693 > URL: https://issues.apache.org/jira/browse/SPARK-18693 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.2 >Reporter: Devesh Parekh > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785436#comment-15785436 ] Herman van Hovell commented on SPARK-19017: --- Ok, that is fair. Let me correct my mistake. {{NOT IN}} can be rewritten into, in to a sequence of NOT equals statements. Each statement contains one tuple of the subquery relation. So we would get something like: {noformat} WHERE (NOT (a1 = a2(1) AND b1 = b2(1))) AND (NOT (a1 = a2(2) AND b1 = b2(2))) AND ... AND (NOT (a1 = a2(n) AND b1 = b2(n))) {noformat} Which can be rewritten into: {noformat} WHERE (a1 <> a2(1) OR b1 <> b2(1)) AND (a1 <> a2(2) OR b1 <> b2(2)) AND ... AND (a1 <> a2(n) OR b1 <> b2(n)) {noformat} This would evaluate to null if one of the tuples in the subquery relation contains a null. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19017) NOT IN subquery with more than one column may return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785417#comment-15785417 ] Nattavut Sutyanyong commented on SPARK-19017: - Using your interpretation, (2,1) not in (2,0) would be evaluated to false. Spark returns (2,1). So do many other SQL engines. > NOT IN subquery with more than one column may return incorrect results > -- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19026) local directories cannot be cleanuped when create directory of "executor-***" throws IOException such as there is no more free disk space to create it etc.
[ https://issues.apache.org/jira/browse/SPARK-19026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785220#comment-15785220 ] zuotingbing commented on SPARK-19026: - i will commit the code after this issues be accepted. > local directories cannot be cleanuped when create directory of "executor-***" > throws IOException such as there is no more free disk space to create it etc. > --- > > Key: SPARK-19026 > URL: https://issues.apache.org/jira/browse/SPARK-19026 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 2.0.2 > Environment: linux >Reporter: zuotingbing > > i set SPARK_LOCAL_DIRS variable like this: > SPARK_LOCAL_DIRS=/data2/spark/tmp,/data3/spark/tmp,/data4/spark/tmp > when there is no more free disk space on "/data4/spark/tmp" , other local > directories (/data2/spark/tmp,/data3/spark/tmp) cannot be cleanuped when my > application finished. > we should catch the IOExecption when create local dirs throws execption , > otherwise the variable "appDirectories(appId)" not be set , then local > directories "executor-***" cannot be deleted for this application. If the > number of folders "executor-***" > 32k we cannot create executor anymore on > this worker node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19026) local directories cannot be cleanuped when create directory of "executor-***" throws IOException such as there is no more free disk space to create it etc.
[ https://issues.apache.org/jira/browse/SPARK-19026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-19026: Description: i set SPARK_LOCAL_DIRS variable like this: SPARK_LOCAL_DIRS=/data2/spark/tmp,/data3/spark/tmp,/data4/spark/tmp when there is no more free disk space on "/data4/spark/tmp" , other local directories (/data2/spark/tmp,/data3/spark/tmp) cannot be cleanuped when my application finished. we should catch the IOExecption when create local dirs throws execption , otherwise the variable "appDirectories(appId)" not be set , then local directories "executor-***" cannot be deleted for this application. If the number of folders "executor-***" > 32k we cannot create executor anymore on this worker node. was: i set SPARK_LOCAL_DIRS variable like this: SPARK_LOCAL_DIRS=/data2/spark/tmp,/data3/spark/tmp,/data4/spark/tmp when there is no more free disk space on "/data4/spark/tmp" , other local directories (/data2/spark/tmp,/data3/spark/tmp) cannot be cleanuped when my application finished. we should catch the IOExecption when create local dirs throws execption , otherwise the variable "appDirectories(appId)" not be set , then local directories "executor-***" cannot be deleted for this application. If the number of folders "executor-***" > 32k we cannot created executor anymore on this worker node. > local directories cannot be cleanuped when create directory of "executor-***" > throws IOException such as there is no more free disk space to create it etc. > --- > > Key: SPARK-19026 > URL: https://issues.apache.org/jira/browse/SPARK-19026 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 2.0.2 > Environment: linux >Reporter: zuotingbing > > i set SPARK_LOCAL_DIRS variable like this: > SPARK_LOCAL_DIRS=/data2/spark/tmp,/data3/spark/tmp,/data4/spark/tmp > when there is no more free disk space on "/data4/spark/tmp" , other local > directories (/data2/spark/tmp,/data3/spark/tmp) cannot be cleanuped when my > application finished. > we should catch the IOExecption when create local dirs throws execption , > otherwise the variable "appDirectories(appId)" not be set , then local > directories "executor-***" cannot be deleted for this application. If the > number of folders "executor-***" > 32k we cannot create executor anymore on > this worker node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19026) local directories cannot be cleanuped when create directory of "executor-***" throws IOException such as there is no more free disk space to create it etc.
zuotingbing created SPARK-19026: --- Summary: local directories cannot be cleanuped when create directory of "executor-***" throws IOException such as there is no more free disk space to create it etc. Key: SPARK-19026 URL: https://issues.apache.org/jira/browse/SPARK-19026 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2, 1.5.2 Environment: linux Reporter: zuotingbing i set SPARK_LOCAL_DIRS variable like this: SPARK_LOCAL_DIRS=/data2/spark/tmp,/data3/spark/tmp,/data4/spark/tmp when there is no more free disk space on "/data4/spark/tmp" , other local directories (/data2/spark/tmp,/data3/spark/tmp) cannot be cleanuped when my application finished. we should catch the IOExecption when create local dirs throws execption , otherwise the variable "appDirectories(appId)" not be set , then local directories "executor-***" cannot be deleted for this application. If the number of folders "executor-***" > 32k we cannot created executor anymore on this worker node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19025) Remove SQL builder for operators
Jiang Xingbo created SPARK-19025: Summary: Remove SQL builder for operators Key: SPARK-19025 URL: https://issues.apache.org/jira/browse/SPARK-19025 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Jiang Xingbo With the new approach of view resolution, we can get rid of SQL generation on view creation, so let's remove SQL builder for operators. Note that, since all sql generation for operators is defined in one file (org.apache.spark.sql.catalyst.SQLBuilder), it’d be trivial to recover it in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19024) Don't generate SQL query on CREATE/ALTER a view
Jiang Xingbo created SPARK-19024: Summary: Don't generate SQL query on CREATE/ALTER a view Key: SPARK-19024 URL: https://issues.apache.org/jira/browse/SPARK-19024 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Jiang Xingbo On CREATE/ALTER a view, it's no longer needed to generate a SQL text string from the LogicalPlan, instead we store the SQL query text、the output schema of the LogicalPlan, and current database to CatalogTable. The new view resolution approach will be able to resolve the view. The main advantage includes: 1. If you update an underlying view, the current view also gets updated; 2. That gives us a change to get ride of SQL generation for operators. Should bring in the following changes: 1. Add new params to `CatalogTable`, that represents the SQL query text、 the output schema of the LogicalPlan, and current database, on the time when the view is created; 2. Update the commands `CreateViewCommand` and `AlterViewAsCommand`, get rid of SQL generation in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19023) Memory leak on GraphX with an iterative algorithm and checkpoint on the graph
Julien MASSIOT created SPARK-19023: -- Summary: Memory leak on GraphX with an iterative algorithm and checkpoint on the graph Key: SPARK-19023 URL: https://issues.apache.org/jira/browse/SPARK-19023 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 2.0.2 Reporter: Julien MASSIOT I am facing OOM whithin a spark streaming application with GraphX. While trying to reproduce the issue on a simple application, I was able to identify what appears to be 2 kind of memory leaks. *Leak 1* It can be reproduced with this simple scala application (that simulates more or less what I'm doing in my spark streaming application, each iteration within the loop simulating one micro-batch). {code:title=TestGraph.scala|borderStyle=solid} import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.graphx.Graph import org.apache.spark.rdd.RDD import org.apache.spark.graphx._ object TestGraph { case class ImpactingEvent(entityInstance: String) case class ImpactedNode(entityIsntance:String) case class RelationInstance(relationType : String) var impactingGraph : Graph[ImpactedNode, RelationInstance] = null; def main(args: Array[String]) { val conf = new SparkConf().setAppName("TestImpactingPropagation").setMaster("local") conf.set("spark.checkpoint.checkpointAllMarkedAncestors", "True") val sc = new SparkContext(conf) sc.setLogLevel("ERROR") val vertices: RDD[(VertexId, ImpactedNode)] = sc.parallelize(Array( (1L, ImpactedNode("Node1")), (2L, ImpactedNode("Node2")), (3L, ImpactedNode("Node3" val edges: RDD[Edge[RelationInstance]] = sc.parallelize(Array( Edge(1L, 2L, RelationInstance("Required")), Edge(1L, 2L, RelationInstance("Failover" impactingGraph = Graph(vertices, edges, null) var x =0; for(x <- 1 to 10){ impactingGraph = propagateEvent(impactingGraph, ImpactingEvent("node1"), sc) impactingGraph.checkpoint() impactingGraph.edges.count() impactingGraph.vertices.count() } println("Hello") Thread.sleep(1000) } private def propagateEvent(impactingGraph: Graph[ImpactedNode, RelationInstance], event: ImpactingEvent, sc:SparkContext): Graph[ImpactedNode, RelationInstance] = { var graph = impactingGraph.mapVertices((id, node) => node ).cache impactingGraph.unpersist(true) graph.cache(); } } {code} In this simple application, I am just applying a mapVertices transformation on the graph and then I am doing a checkpoint on the graph. I am doing this operation 10 times. After this application finished the loop, I am taking an heapdump. In this heapdump, I am able to see 11 "live" GraphImpl instances in memory. My expectation is to have only 1 (the one referenced in the global variable impactingGraph). The "leak" is coming from the f function in a MapPartitionsRDD (which is referenced by the partitionsRDD variable of my VertexRDD). This f function contains an outer reference to the graph created in the previous iteration. I can see that in the clearDependencies function of MapPartitionsRDD, the f function is not reset to null. When looking to similar issues, I found this pull request: [https://github.com/apache/spark/pull/3545] In this pull request, the f variable is reset to null in the clearDependencies method of the ZippedPartitionsRDD. I do not understand why the same is not done within the MapPartitionsRDD. I made a try by patching spark-core and by setting f to null in clearDependencies of MapPartitionsRDD and it solved my leak on this simple use case. Don't you think the f variable has to be reset to null also in MapPartitionsRDD ? *Leak 2* Now, I'll do the same but in the propageEvent method in addition to the mapVertices I am doing a joinVertices on the graph. It can be found in the following application: {code:title=TestGraph.scala|borderStyle=solid} import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.graphx.Graph import org.apache.spark.rdd.RDD import org.apache.spark.graphx._ object TestGraph { case class ImpactingEvent(entityInstance: String) case class ImpactedNode(entityIsntance:String) case class RelationInstance(relationType : String) var impactingGraph : Graph[ImpactedNode, RelationInstance] = null; def main(args: Array[String]) { val conf = new SparkConf().setAppName("TestImpactingPropagation").setMaster("local") conf.set("spark.checkpoint.checkpointAllMarkedAncestors", "True") val sc = new SparkContext(conf) sc.setLogLevel("ERROR") val vertices: RDD[(VertexId, ImpactedNode)] = sc.parallelize(Array( (1L, ImpactedNode("Node1")), (2L,
[jira] [Assigned] (SPARK-19022) Fix tests dependent on OS due to different newline characters
[ https://issues.apache.org/jira/browse/SPARK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19022: Assignee: Apache Spark > Fix tests dependent on OS due to different newline characters > - > > Key: SPARK-19022 > URL: https://issues.apache.org/jira/browse/SPARK-19022 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > There are two tests failing on Windows due to the different newlines. > {code} > - StreamingQueryProgress - prettyJson *** FAILED *** (0 milliseconds) >"{ > "id" : "39788670-6722-48b7-a248-df6ba08722ac", > "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", > "name" : "myName", > "timestamp" : "2016-12-05T20:54:20.827Z", > "numInputRows" : 678, > "inputRowsPerSecond" : 10.0, > "durationMs" : { >"total" : 0 > }, > "eventTime" : { >"avg" : "2016-12-05T20:54:20.827Z", >"max" : "2016-12-05T20:54:20.827Z", >"min" : "2016-12-05T20:54:20.827Z", >"watermark" : "2016-12-05T20:54:20.827Z" > }, > "stateOperators" : [ { >"numRowsTotal" : 0, >"numRowsUpdated" : 1 > } ], > "sources" : [ { >"description" : "source", >"startOffset" : 123, >"endOffset" : 456, >"numInputRows" : 678, >"inputRowsPerSecond" : 10.0 > } ], > "sink" : { >"description" : "sink" > } >}" did not equal "{ > "id" : "39788670-6722-48b7-a248-df6ba08722ac", > "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", > "name" : "myName", > "timestamp" : "2016-12-05T20:54:20.827Z", > "numInputRows" : 678, > "inputRowsPerSecond" : 10.0, > "durationMs" : { >"total" : 0 > }, > "eventTime" : { >"avg" : "2016-12-05T20:54:20.827Z", >"max" : "2016-12-05T20:54:20.827Z", >"min" : "2016-12-05T20:54:20.827Z", >"watermark" : "2016-12-05T20:54:20.827Z" > }, > "stateOperators" : [ { >"numRowsTotal" : 0, >"numRowsUpdated" : 1 > } ], > "sources" : [ { >"description" : "source", >"startOffset" : 123, >"endOffset" : 456, >"numInputRows" : 678, >"inputRowsPerSecond" : 10.0 > } ], > "sink" : { >"description" : "sink" > } >}" (StreamingQueryStatusAndProgressSuite.scala:36) > {code} > {code} > - StreamingQueryStatus - prettyJson *** FAILED *** (0 milliseconds) >"{ > "message" : "active", > "isDataAvailable" : true, > "isTriggerActive" : false >}" did not equal "{ > "message" : "active", > "isDataAvailable" : true, > "isTriggerActive" : false >}" (StreamingQueryStatusAndProgressSuite.scala:115) >org.scalatest.exceptions.TestFailedException: > {code} > The reason is, {{pretty}} in {{org.json4s.pretty}} writes OS-dependent > newlines but the string defined in the tests are {{\n}}. This ends up with > test failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19022) Fix tests dependent on OS due to different newline characters
[ https://issues.apache.org/jira/browse/SPARK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19022: Assignee: (was: Apache Spark) > Fix tests dependent on OS due to different newline characters > - > > Key: SPARK-19022 > URL: https://issues.apache.org/jira/browse/SPARK-19022 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Reporter: Hyukjin Kwon >Priority: Minor > > There are two tests failing on Windows due to the different newlines. > {code} > - StreamingQueryProgress - prettyJson *** FAILED *** (0 milliseconds) >"{ > "id" : "39788670-6722-48b7-a248-df6ba08722ac", > "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", > "name" : "myName", > "timestamp" : "2016-12-05T20:54:20.827Z", > "numInputRows" : 678, > "inputRowsPerSecond" : 10.0, > "durationMs" : { >"total" : 0 > }, > "eventTime" : { >"avg" : "2016-12-05T20:54:20.827Z", >"max" : "2016-12-05T20:54:20.827Z", >"min" : "2016-12-05T20:54:20.827Z", >"watermark" : "2016-12-05T20:54:20.827Z" > }, > "stateOperators" : [ { >"numRowsTotal" : 0, >"numRowsUpdated" : 1 > } ], > "sources" : [ { >"description" : "source", >"startOffset" : 123, >"endOffset" : 456, >"numInputRows" : 678, >"inputRowsPerSecond" : 10.0 > } ], > "sink" : { >"description" : "sink" > } >}" did not equal "{ > "id" : "39788670-6722-48b7-a248-df6ba08722ac", > "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", > "name" : "myName", > "timestamp" : "2016-12-05T20:54:20.827Z", > "numInputRows" : 678, > "inputRowsPerSecond" : 10.0, > "durationMs" : { >"total" : 0 > }, > "eventTime" : { >"avg" : "2016-12-05T20:54:20.827Z", >"max" : "2016-12-05T20:54:20.827Z", >"min" : "2016-12-05T20:54:20.827Z", >"watermark" : "2016-12-05T20:54:20.827Z" > }, > "stateOperators" : [ { >"numRowsTotal" : 0, >"numRowsUpdated" : 1 > } ], > "sources" : [ { >"description" : "source", >"startOffset" : 123, >"endOffset" : 456, >"numInputRows" : 678, >"inputRowsPerSecond" : 10.0 > } ], > "sink" : { >"description" : "sink" > } >}" (StreamingQueryStatusAndProgressSuite.scala:36) > {code} > {code} > - StreamingQueryStatus - prettyJson *** FAILED *** (0 milliseconds) >"{ > "message" : "active", > "isDataAvailable" : true, > "isTriggerActive" : false >}" did not equal "{ > "message" : "active", > "isDataAvailable" : true, > "isTriggerActive" : false >}" (StreamingQueryStatusAndProgressSuite.scala:115) >org.scalatest.exceptions.TestFailedException: > {code} > The reason is, {{pretty}} in {{org.json4s.pretty}} writes OS-dependent > newlines but the string defined in the tests are {{\n}}. This ends up with > test failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19022) Fix tests dependent on OS due to different newline characters
[ https://issues.apache.org/jira/browse/SPARK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784908#comment-15784908 ] Apache Spark commented on SPARK-19022: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16433 > Fix tests dependent on OS due to different newline characters > - > > Key: SPARK-19022 > URL: https://issues.apache.org/jira/browse/SPARK-19022 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Reporter: Hyukjin Kwon >Priority: Minor > > There are two tests failing on Windows due to the different newlines. > {code} > - StreamingQueryProgress - prettyJson *** FAILED *** (0 milliseconds) >"{ > "id" : "39788670-6722-48b7-a248-df6ba08722ac", > "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", > "name" : "myName", > "timestamp" : "2016-12-05T20:54:20.827Z", > "numInputRows" : 678, > "inputRowsPerSecond" : 10.0, > "durationMs" : { >"total" : 0 > }, > "eventTime" : { >"avg" : "2016-12-05T20:54:20.827Z", >"max" : "2016-12-05T20:54:20.827Z", >"min" : "2016-12-05T20:54:20.827Z", >"watermark" : "2016-12-05T20:54:20.827Z" > }, > "stateOperators" : [ { >"numRowsTotal" : 0, >"numRowsUpdated" : 1 > } ], > "sources" : [ { >"description" : "source", >"startOffset" : 123, >"endOffset" : 456, >"numInputRows" : 678, >"inputRowsPerSecond" : 10.0 > } ], > "sink" : { >"description" : "sink" > } >}" did not equal "{ > "id" : "39788670-6722-48b7-a248-df6ba08722ac", > "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", > "name" : "myName", > "timestamp" : "2016-12-05T20:54:20.827Z", > "numInputRows" : 678, > "inputRowsPerSecond" : 10.0, > "durationMs" : { >"total" : 0 > }, > "eventTime" : { >"avg" : "2016-12-05T20:54:20.827Z", >"max" : "2016-12-05T20:54:20.827Z", >"min" : "2016-12-05T20:54:20.827Z", >"watermark" : "2016-12-05T20:54:20.827Z" > }, > "stateOperators" : [ { >"numRowsTotal" : 0, >"numRowsUpdated" : 1 > } ], > "sources" : [ { >"description" : "source", >"startOffset" : 123, >"endOffset" : 456, >"numInputRows" : 678, >"inputRowsPerSecond" : 10.0 > } ], > "sink" : { >"description" : "sink" > } >}" (StreamingQueryStatusAndProgressSuite.scala:36) > {code} > {code} > - StreamingQueryStatus - prettyJson *** FAILED *** (0 milliseconds) >"{ > "message" : "active", > "isDataAvailable" : true, > "isTriggerActive" : false >}" did not equal "{ > "message" : "active", > "isDataAvailable" : true, > "isTriggerActive" : false >}" (StreamingQueryStatusAndProgressSuite.scala:115) >org.scalatest.exceptions.TestFailedException: > {code} > The reason is, {{pretty}} in {{org.json4s.pretty}} writes OS-dependent > newlines but the string defined in the tests are {{\n}}. This ends up with > test failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19022) Fix tests dependent on OS due to different newline characters
[ https://issues.apache.org/jira/browse/SPARK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784894#comment-15784894 ] Hyukjin Kwon commented on SPARK-19022: -- It seems these are (almost) all instances across the tests on Windows. I will double check in the PR again. > Fix tests dependent on OS due to different newline characters > - > > Key: SPARK-19022 > URL: https://issues.apache.org/jira/browse/SPARK-19022 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Reporter: Hyukjin Kwon >Priority: Minor > > There are two tests failing on Windows due to the different newlines. > {code} > - StreamingQueryProgress - prettyJson *** FAILED *** (0 milliseconds) >"{ > "id" : "39788670-6722-48b7-a248-df6ba08722ac", > "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", > "name" : "myName", > "timestamp" : "2016-12-05T20:54:20.827Z", > "numInputRows" : 678, > "inputRowsPerSecond" : 10.0, > "durationMs" : { >"total" : 0 > }, > "eventTime" : { >"avg" : "2016-12-05T20:54:20.827Z", >"max" : "2016-12-05T20:54:20.827Z", >"min" : "2016-12-05T20:54:20.827Z", >"watermark" : "2016-12-05T20:54:20.827Z" > }, > "stateOperators" : [ { >"numRowsTotal" : 0, >"numRowsUpdated" : 1 > } ], > "sources" : [ { >"description" : "source", >"startOffset" : 123, >"endOffset" : 456, >"numInputRows" : 678, >"inputRowsPerSecond" : 10.0 > } ], > "sink" : { >"description" : "sink" > } >}" did not equal "{ > "id" : "39788670-6722-48b7-a248-df6ba08722ac", > "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", > "name" : "myName", > "timestamp" : "2016-12-05T20:54:20.827Z", > "numInputRows" : 678, > "inputRowsPerSecond" : 10.0, > "durationMs" : { >"total" : 0 > }, > "eventTime" : { >"avg" : "2016-12-05T20:54:20.827Z", >"max" : "2016-12-05T20:54:20.827Z", >"min" : "2016-12-05T20:54:20.827Z", >"watermark" : "2016-12-05T20:54:20.827Z" > }, > "stateOperators" : [ { >"numRowsTotal" : 0, >"numRowsUpdated" : 1 > } ], > "sources" : [ { >"description" : "source", >"startOffset" : 123, >"endOffset" : 456, >"numInputRows" : 678, >"inputRowsPerSecond" : 10.0 > } ], > "sink" : { >"description" : "sink" > } >}" (StreamingQueryStatusAndProgressSuite.scala:36) > {code} > {code} > - StreamingQueryStatus - prettyJson *** FAILED *** (0 milliseconds) >"{ > "message" : "active", > "isDataAvailable" : true, > "isTriggerActive" : false >}" did not equal "{ > "message" : "active", > "isDataAvailable" : true, > "isTriggerActive" : false >}" (StreamingQueryStatusAndProgressSuite.scala:115) >org.scalatest.exceptions.TestFailedException: > {code} > The reason is, {{pretty}} in {{org.json4s.pretty}} writes OS-dependent > newlines but the string defined in the tests are {{\n}}. This ends up with > test failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19022) Fix tests dependent on OS due to different newline characters
Hyukjin Kwon created SPARK-19022: Summary: Fix tests dependent on OS due to different newline characters Key: SPARK-19022 URL: https://issues.apache.org/jira/browse/SPARK-19022 Project: Spark Issue Type: Test Components: Structured Streaming, Tests Reporter: Hyukjin Kwon Priority: Minor There are two tests failing on Windows due to the different newlines. {code} - StreamingQueryProgress - prettyJson *** FAILED *** (0 milliseconds) "{ "id" : "39788670-6722-48b7-a248-df6ba08722ac", "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", "name" : "myName", "timestamp" : "2016-12-05T20:54:20.827Z", "numInputRows" : 678, "inputRowsPerSecond" : 10.0, "durationMs" : { "total" : 0 }, "eventTime" : { "avg" : "2016-12-05T20:54:20.827Z", "max" : "2016-12-05T20:54:20.827Z", "min" : "2016-12-05T20:54:20.827Z", "watermark" : "2016-12-05T20:54:20.827Z" }, "stateOperators" : [ { "numRowsTotal" : 0, "numRowsUpdated" : 1 } ], "sources" : [ { "description" : "source", "startOffset" : 123, "endOffset" : 456, "numInputRows" : 678, "inputRowsPerSecond" : 10.0 } ], "sink" : { "description" : "sink" } }" did not equal "{ "id" : "39788670-6722-48b7-a248-df6ba08722ac", "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", "name" : "myName", "timestamp" : "2016-12-05T20:54:20.827Z", "numInputRows" : 678, "inputRowsPerSecond" : 10.0, "durationMs" : { "total" : 0 }, "eventTime" : { "avg" : "2016-12-05T20:54:20.827Z", "max" : "2016-12-05T20:54:20.827Z", "min" : "2016-12-05T20:54:20.827Z", "watermark" : "2016-12-05T20:54:20.827Z" }, "stateOperators" : [ { "numRowsTotal" : 0, "numRowsUpdated" : 1 } ], "sources" : [ { "description" : "source", "startOffset" : 123, "endOffset" : 456, "numInputRows" : 678, "inputRowsPerSecond" : 10.0 } ], "sink" : { "description" : "sink" } }" (StreamingQueryStatusAndProgressSuite.scala:36) {code} {code} - StreamingQueryStatus - prettyJson *** FAILED *** (0 milliseconds) "{ "message" : "active", "isDataAvailable" : true, "isTriggerActive" : false }" did not equal "{ "message" : "active", "isDataAvailable" : true, "isTriggerActive" : false }" (StreamingQueryStatusAndProgressSuite.scala:115) org.scalatest.exceptions.TestFailedException: {code} The reason is, {{pretty}} in {{org.json4s.pretty}} writes OS-dependent newlines but the string defined in the tests are {{\n}}. This ends up with test failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18801) Support resolve a nested view
[ https://issues.apache.org/jira/browse/SPARK-18801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiang Xingbo updated SPARK-18801: - Summary: Support resolve a nested view (was: Add `View` operator to help resolve a view) > Support resolve a nested view > - > > Key: SPARK-18801 > URL: https://issues.apache.org/jira/browse/SPARK-18801 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Jiang Xingbo > > We should be able to resolve a nested view. The main advantage is that if you > update an underlying view, the current view also gets updated. > The new approach should be compatible with older versions of SPARK/HIVE, that > means: > 1. The new approach should be able to resolve the views that created by > older versions of SPARK/HIVE; > 2. The new approach should be able to resolve the views that are > currently supported by SPARK SQL. > The new approach mainly brings in the following changes: > 1. Add a new operator called `View` to keep track of the CatalogTable > that describes the view, and the output attributes as well as the child of > the view; > 2. Update the `ResolveRelations` rule to resolve the relations and > views, note that a nested view should be resolved correctly; > 3. Add `AnalysisContext` to enable us to still support a view created > with CTE/Windows query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18801) Add `View` operator to help resolve a view
[ https://issues.apache.org/jira/browse/SPARK-18801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiang Xingbo updated SPARK-18801: - Description: We should be able to resolve a nested view. The main advantage is that if you update an underlying view, the current view also gets updated. The new approach should be compatible with older versions of SPARK/HIVE, that means: 1. The new approach should be able to resolve the views that created by older versions of SPARK/HIVE; 2. The new approach should be able to resolve the views that are currently supported by SPARK SQL. The new approach mainly brings in the following changes: 1. Add a new operator called `View` to keep track of the CatalogTable that describes the view, and the output attributes as well as the child of the view; 2. Update the `ResolveRelations` rule to resolve the relations and views, note that a nested view should be resolved correctly; 3. Add `AnalysisContext` to enable us to still support a view created with CTE/Windows query. was: We should be able to resolve a nested view. The main advantage is that if you update an underlying view, the current view also gets updated. The new approach should be compatible with older versions of SPARK/HIVE, that means: 1. The new approach should be able to resolve the views that created by older versions of SPARK/HIVE; 2. The new approach should be able to resolve the views that are currently supported by SPARK SQL. The new approach mainly brings in the following changes: 1. Add a new operator called `View` to keep track of the CatalogTable that descripts the view, and the output attributes as well as the child of the view; 2. Update the `ResolveRelations` rule to resolve the relations and views, note that a nested view should be resolved correctly; 3. Add `AnalysisContext` to enable us to still support a view created with CTE/Windows query. > Add `View` operator to help resolve a view > -- > > Key: SPARK-18801 > URL: https://issues.apache.org/jira/browse/SPARK-18801 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Jiang Xingbo > > We should be able to resolve a nested view. The main advantage is that if you > update an underlying view, the current view also gets updated. > The new approach should be compatible with older versions of SPARK/HIVE, that > means: > 1. The new approach should be able to resolve the views that created by > older versions of SPARK/HIVE; > 2. The new approach should be able to resolve the views that are > currently supported by SPARK SQL. > The new approach mainly brings in the following changes: > 1. Add a new operator called `View` to keep track of the CatalogTable > that describes the view, and the output attributes as well as the child of > the view; > 2. Update the `ResolveRelations` rule to resolve the relations and > views, note that a nested view should be resolved correctly; > 3. Add `AnalysisContext` to enable us to still support a view created > with CTE/Windows query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18801) Add `View` operator to help resolve a view
[ https://issues.apache.org/jira/browse/SPARK-18801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiang Xingbo updated SPARK-18801: - Description: We should be able to resolve a nested view. The main advantage is that if you update an underlying view, the current view also gets updated. The new approach should be compatible with older versions of SPARK/HIVE, that means: 1. The new approach should be able to resolve the views that created by older versions of SPARK/HIVE; 2. The new approach should be able to resolve the views that are currently supported by SPARK SQL. The new approach mainly brings in the following changes: 1. Add a new operator called `View` to keep track of the CatalogTable that descripts the view, and the output attributes as well as the child of the view; 2. Update the `ResolveRelations` rule to resolve the relations and views, note that a nested view should be resolved correctly; 3. Add `AnalysisContext` to enable us to still support a view created with CTE/Windows query. was: We should add a new operator called `View` to keep track of the database name used on resolving a view. The analysis rule `ResolveRelations` should also be updated. After that change, we should be able to resolve a nested view. > Add `View` operator to help resolve a view > -- > > Key: SPARK-18801 > URL: https://issues.apache.org/jira/browse/SPARK-18801 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Jiang Xingbo > > We should be able to resolve a nested view. The main advantage is that if you > update an underlying view, the current view also gets updated. > The new approach should be compatible with older versions of SPARK/HIVE, that > means: > 1. The new approach should be able to resolve the views that created by > older versions of SPARK/HIVE; > 2. The new approach should be able to resolve the views that are > currently supported by SPARK SQL. > The new approach mainly brings in the following changes: > 1. Add a new operator called `View` to keep track of the CatalogTable > that descripts the view, and the output attributes as well as the child of > the view; > 2. Update the `ResolveRelations` rule to resolve the relations and > views, note that a nested view should be resolved correctly; > 3. Add `AnalysisContext` to enable us to still support a view created > with CTE/Windows query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org