[jira] [Updated] (SPARK-10840) SparkSQL doesn't work well with JSON
[ https://issues.apache.org/jira/browse/SPARK-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Sarraf updated SPARK-10840: - Description: Well formed JSON doesn't work with the 1.5.1 version while using sqlContext.read.json(""): { "employees": { "employee": [ { "name": "Mia", "surname": "Radison", "mobile": "7295913821", "email": "miaradi...@sparky.com" }, { "name": "Thor", "surname": "Kovaskz", "mobile": "8829177193", "email": "tkova...@sparky.com" }, { "name": "Bindy", "surname": "Kvuls", "mobile": "5033828845", "email": "bind...@sparky.com" } ] } } For the above following error is obtained: ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2) scala.MatchError: (VALUE_STRING,StructType()) (of class scala.Tuple2) Where as, this works fine because all components are in the same line: [ {"name": "Mia","surname": "Radison","mobile": "7295913821","email": "miaradi...@sparky.com"}, {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": "tkova...@sparky.com"}, {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": "bind...@sparky.com"} ] was: Well formed JSON doesn't work with the 1.5.1 version while using sqlContext.read.json(""): { "employees": { "employee": [ { "name": "Mia", "surname": "Radison", "mobile": "7295913821", "email": "miaradi...@sparky.com" }, { "name": "Thor", "surname": "Kovaskz", "mobile": "8829177193", "email": "tkova...@sparky.com" }, { "name": "Bindy", "surname": "Kvuls", "mobile": "5033828845", "email": "bind...@sparky.com" } ] } } Where as, this works because all are in the same line: [ {"name": "Mia","surname": "Radison","mobile": "7295913821","email": "miaradi...@sparky.com"}, {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": "tkova...@sparky.com"}, {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": "bind...@sparky.com"} ] > SparkSQL doesn't work well with JSON > > > Key: SPARK-10840 > URL: https://issues.apache.org/jira/browse/SPARK-10840 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ankit Sarraf >Priority: Trivial > Labels: JSON, Scala, SparkSQL > > Well formed JSON doesn't work with the 1.5.1 version while using > sqlContext.read.json(""): > { > "employees": { > "employee": [ > { > "name": "Mia", > "surname": "Radison", > "mobile": "7295913821", > "email": "miaradi...@sparky.com" > }, > { > "name": "Thor", > "surname": "Kovaskz", > "mobile": "8829177193", > "email": "tkova...@sparky.com" > }, > { > "name": "Bindy", > "surname": "Kvuls", > "mobile": "5033828845", > "email": "bind...@sparky.com" > } > ] > } > } > For the above following error is obtained: > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2) > scala.MatchError: (VALUE_STRING,StructType()) (of class scala.Tuple2) > Where as, this works fine because all components are in the same line: > [ > {"name": "Mia","surname": "Radison","mobile": "7295913821","email": > "miaradi...@sparky.com"}, > {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": > "tkova...@sparky.com"}, > {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": > "bind...@sparky.com"} > ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10840) SparkSQL doesn't work well with JSON
[ https://issues.apache.org/jira/browse/SPARK-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Sarraf updated SPARK-10840: - Description: Well formed JSON doesn't work with the 1.5.1 version while using sqlContext.read.json(""): { "employees": { "employee": [ { "name": "Mia", "surname": "Radison", "mobile": "7295913821", "email": "miaradi...@sparky.com" }, { "name": "Thor", "surname": "Kovaskz", "mobile": "8829177193", "email": "tkova...@sparky.com" }, { "name": "Bindy", "surname": "Kvuls", "mobile": "5033828845", "email": "bind...@sparky.com" } ] } } For the above following error is obtained: ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2) scala.MatchError: (VALUE_STRING,StructType()) (of class scala.Tuple2) Where as, this works fine because all components are in the same line: [ {"name": "Mia","surname": "Radison","mobile": "7295913821","email": "miaradi...@sparky.com"}, {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": "tkova...@sparky.com"}, {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": "bind...@sparky.com"} ] was: Well formed JSON doesn't work with the 1.5.1 version while using sqlContext.read.json(""): { "employees": { "employee": [ { "name": "Mia", "surname": "Radison", "mobile": "7295913821", "email": "miaradi...@sparky.com" }, { "name": "Thor", "surname": "Kovaskz", "mobile": "8829177193", "email": "tkova...@sparky.com" }, { "name": "Bindy", "surname": "Kvuls", "mobile": "5033828845", "email": "bind...@sparky.com" } ] } } For the above following error is obtained: ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2) scala.MatchError: (VALUE_STRING,StructType()) (of class scala.Tuple2) Where as, this works fine because all components are in the same line: [ {"name": "Mia","surname": "Radison","mobile": "7295913821","email": "miaradi...@sparky.com"}, {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": "tkova...@sparky.com"}, {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": "bind...@sparky.com"} ] > SparkSQL doesn't work well with JSON > > > Key: SPARK-10840 > URL: https://issues.apache.org/jira/browse/SPARK-10840 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ankit Sarraf >Priority: Trivial > Labels: JSON, Scala, SparkSQL > > Well formed JSON doesn't work with the 1.5.1 version while using > sqlContext.read.json(""): > { > "employees": { > "employee": [ > { > "name": "Mia", > "surname": "Radison", > "mobile": "7295913821", > "email": "miaradi...@sparky.com" > }, > { > "name": "Thor", > "surname": "Kovaskz", > "mobile": "8829177193", > "email": "tkova...@sparky.com" > }, > { > "name": "Bindy", > "surname": "Kvuls", > "mobile": "5033828845", > "email": "bind...@sparky.com" > } > ] > } > } > For the above following error is obtained: > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2) > scala.MatchError: (VALUE_STRING,StructType()) (of class scala.Tuple2) > Where as, this works fine because all components are in the same line: > [ > {"name": "Mia","surname": "Radison","mobile": "7295913821","email": > "miaradi...@sparky.com"}, > {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": > "tkova...@sparky.com"}, > {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": > "bind...@sparky.com"} > ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10840) SparkSQL doesn't work well with JSON
[ https://issues.apache.org/jira/browse/SPARK-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Sarraf updated SPARK-10840: - Description: Well formed JSON doesn't work with the 1.5.1 version while using sqlContext.read.json(""): { "employees": { "employee": [ { "name": "Mia", "surname": "Radison", "mobile": "7295913821", "email": "miaradi...@sparky.com" }, { "name": "Thor", "surname": "Kovaskz", "mobile": "8829177193", "email": "tkova...@sparky.com" }, { "name": "Bindy", "surname": "Kvuls", "mobile": "5033828845", "email": "bind...@sparky.com" } ] } } For the above following error is obtained: ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2) scala.MatchError: (VALUE_STRING,StructType()) (of class scala.Tuple2) Where as, this works fine because all components are in the same line: [ {"name": "Mia","surname": "Radison","mobile": "7295913821","email": "miaradi...@sparky.com"}, {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": "tkova...@sparky.com"}, {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": "bind...@sparky.com"} ] was: Well formed JSON doesn't work with the 1.5.1 version while using sqlContext.read.json(""): { "employees": { "employee": [ { "name": "Mia", "surname": "Radison", "mobile": "7295913821", "email": "miaradi...@sparky.com" }, { "name": "Thor", "surname": "Kovaskz", "mobile": "8829177193", "email": "tkova...@sparky.com" }, { "name": "Bindy", "surname": "Kvuls", "mobile": "5033828845", "email": "bind...@sparky.com" } ] } } For the above following error is obtained: ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2) scala.MatchError: (VALUE_STRING,StructType()) (of class scala.Tuple2) Where as, this works fine because all components are in the same line: [ {"name": "Mia","surname": "Radison","mobile": "7295913821","email": "miaradi...@sparky.com"}, {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": "tkova...@sparky.com"}, {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": "bind...@sparky.com"} ] > SparkSQL doesn't work well with JSON > > > Key: SPARK-10840 > URL: https://issues.apache.org/jira/browse/SPARK-10840 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ankit Sarraf >Priority: Trivial > Labels: JSON, Scala, SparkSQL > > Well formed JSON doesn't work with the 1.5.1 version while using > sqlContext.read.json(""): > { > "employees": { > "employee": [ > { > "name": "Mia", > "surname": "Radison", > "mobile": "7295913821", > "email": "miaradi...@sparky.com" > }, > { > "name": "Thor", > "surname": "Kovaskz", > "mobile": "8829177193", > "email": "tkova...@sparky.com" > }, > { > "name": "Bindy", > "surname": "Kvuls", > "mobile": "5033828845", > "email": "bind...@sparky.com" > } > ] > } > } > For the above following error is obtained: > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 2) > scala.MatchError: (VALUE_STRING,StructType()) (of class scala.Tuple2) > Where as, this works fine because all components are in the same line: > [ > {"name": "Mia","surname": "Radison","mobile": "7295913821","email": > "miaradi...@sparky.com"}, > {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": > "tkova...@sparky.com"}, > {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": > "bind...@sparky.com"} > ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10840) SparkSQL doesn't work well with JSON
[ https://issues.apache.org/jira/browse/SPARK-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Sarraf updated SPARK-10840: - Description: Well formed JSON doesn't work with the 1.5.1 version while using sqlContext.read.json(""): { "employees": { "employee": [ { "name": "Mia", "surname": "Radison", "mobile": "7295913821", "email": "miaradi...@sparky.com" }, { "name": "Thor", "surname": "Kovaskz", "mobile": "8829177193", "email": "tkova...@sparky.com" }, { "name": "Bindy", "surname": "Kvuls", "mobile": "5033828845", "email": "bind...@sparky.com" } ] } } Where as, this works because all are in the same line: [ {"name": "Mia","surname": "Radison","mobile": "7295913821","email": "miaradi...@sparky.com"}, {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": "tkova...@sparky.com"}, {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": "bind...@sparky.com"} ] was: Well formed JSON doesn't work with the 1.5.1 version while using sqlContext.read.json(""): { "employees": { "employee": [ { "name": "Mia", "surname": "Radison", "mobile": "7295913821", "email": "miaradi...@sparky.com" }, { "name": "Thor", "surname": "Kovaskz", "mobile": "8829177193", "email": "tkova...@sparky.com" }, { "name": "Bindy", "surname": "Kvuls", "mobile": "5033828845", "email": "bind...@sparky.com" } ] } } Where as, this works because all are in the same line: [ {"name": "Mia","surname": "Radison","mobile": "7295913821","email": "miaradi...@sparky.com"}, {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": "tkova...@sparky.com"}, {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": "bind...@sparky.com"} ] > SparkSQL doesn't work well with JSON > > > Key: SPARK-10840 > URL: https://issues.apache.org/jira/browse/SPARK-10840 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ankit Sarraf >Priority: Trivial > Labels: JSON, Scala, SparkSQL > > Well formed JSON doesn't work with the 1.5.1 version while using > sqlContext.read.json(""): > > { > "employees": { > "employee": [ > { > "name": "Mia", > "surname": "Radison", > "mobile": "7295913821", > "email": "miaradi...@sparky.com" > }, > { > "name": "Thor", > "surname": "Kovaskz", > "mobile": "8829177193", > "email": "tkova...@sparky.com" > }, > { > "name": "Bindy", > "surname": "Kvuls", > "mobile": "5033828845", > "email": "bind...@sparky.com" > } > ] > } > } > > Where as, this works because all are in the same line: > [ > {"name": "Mia","surname": "Radison","mobile": "7295913821","email": > "miaradi...@sparky.com"}, > {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": > "tkova...@sparky.com"}, > {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": > "bind...@sparky.com"} > ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10840) SparkSQL doesn't work well with JSON
[ https://issues.apache.org/jira/browse/SPARK-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Sarraf updated SPARK-10840: - Description: Well formed JSON doesn't work with the 1.5.1 version while using sqlContext.read.json(""): { "employees": { "employee": [ { "name": "Mia", "surname": "Radison", "mobile": "7295913821", "email": "miaradi...@sparky.com" }, { "name": "Thor", "surname": "Kovaskz", "mobile": "8829177193", "email": "tkova...@sparky.com" }, { "name": "Bindy", "surname": "Kvuls", "mobile": "5033828845", "email": "bind...@sparky.com" } ] } } Where as, this works because all are in the same line: [ {"name": "Mia","surname": "Radison","mobile": "7295913821","email": "miaradi...@sparky.com"}, {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": "tkova...@sparky.com"}, {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": "bind...@sparky.com"} ] was: Well formed JSON doesn't work with the 1.5.1 version while using sqlContext.read.json(""): { "employees": { "employee": [ { "name": "Mia", "surname": "Radison", "mobile": "7295913821", "email": "miaradi...@sparky.com" }, { "name": "Thor", "surname": "Kovaskz", "mobile": "8829177193", "email": "tkova...@sparky.com" }, { "name": "Bindy", "surname": "Kvuls", "mobile": "5033828845", "email": "bind...@sparky.com" } ] } } Where as, this works because all are in the same line: [ {"name": "Mia","surname": "Radison","mobile": "7295913821","email": "miaradi...@sparky.com"}, {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": "tkova...@sparky.com"}, {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": "bind...@sparky.com"} ] > SparkSQL doesn't work well with JSON > > > Key: SPARK-10840 > URL: https://issues.apache.org/jira/browse/SPARK-10840 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ankit Sarraf >Priority: Trivial > Labels: JSON, Scala, SparkSQL > > Well formed JSON doesn't work with the 1.5.1 version while using > sqlContext.read.json(""): > > { > "employees": { > "employee": [ > { > "name": "Mia", > "surname": "Radison", > "mobile": "7295913821", > "email": "miaradi...@sparky.com" > }, > { > "name": "Thor", > "surname": "Kovaskz", > "mobile": "8829177193", > "email": "tkova...@sparky.com" > }, > { > "name": "Bindy", > "surname": "Kvuls", > "mobile": "5033828845", > "email": "bind...@sparky.com" > } > ] > } > } > > Where as, this works because all are in the same line: > [ > {"name": "Mia","surname": "Radison","mobile": "7295913821","email": > "miaradi...@sparky.com"}, > {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": > "tkova...@sparky.com"}, > {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": > "bind...@sparky.com"} > ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10840) SparkSQL doesn't work well with JSON
Ankit Sarraf created SPARK-10840: Summary: SparkSQL doesn't work well with JSON Key: SPARK-10840 URL: https://issues.apache.org/jira/browse/SPARK-10840 Project: Spark Issue Type: Bug Components: SQL Reporter: Ankit Sarraf Priority: Trivial Well formed JSON doesn't work with the 1.5.1 version while using sqlContext.read.json(""): { "employees": { "employee": [ { "name": "Mia", "surname": "Radison", "mobile": "7295913821", "email": "miaradi...@sparky.com" }, { "name": "Thor", "surname": "Kovaskz", "mobile": "8829177193", "email": "tkova...@sparky.com" }, { "name": "Bindy", "surname": "Kvuls", "mobile": "5033828845", "email": "bind...@sparky.com" } ] } } Where as, this works because all are in the same line: [ {"name": "Mia","surname": "Radison","mobile": "7295913821","email": "miaradi...@sparky.com"}, {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": "tkova...@sparky.com"}, {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": "bind...@sparky.com"} ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10840) SparkSQL doesn't work well with JSON
[ https://issues.apache.org/jira/browse/SPARK-10840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Sarraf updated SPARK-10840: - Description: Well formed JSON doesn't work with the 1.5.1 version while using sqlContext.read.json(""): { "employees": { "employee": [ { "name": "Mia", "surname": "Radison", "mobile": "7295913821", "email": "miaradi...@sparky.com" }, { "name": "Thor", "surname": "Kovaskz", "mobile": "8829177193", "email": "tkova...@sparky.com" }, { "name": "Bindy", "surname": "Kvuls", "mobile": "5033828845", "email": "bind...@sparky.com" } ] } } Where as, this works because all are in the same line: [ {"name": "Mia","surname": "Radison","mobile": "7295913821","email": "miaradi...@sparky.com"}, {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": "tkova...@sparky.com"}, {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": "bind...@sparky.com"} ] was: Well formed JSON doesn't work with the 1.5.1 version while using sqlContext.read.json(""): { "employees": { "employee": [ { "name": "Mia", "surname": "Radison", "mobile": "7295913821", "email": "miaradi...@sparky.com" }, { "name": "Thor", "surname": "Kovaskz", "mobile": "8829177193", "email": "tkova...@sparky.com" }, { "name": "Bindy", "surname": "Kvuls", "mobile": "5033828845", "email": "bind...@sparky.com" } ] } } Where as, this works because all are in the same line: [ {"name": "Mia","surname": "Radison","mobile": "7295913821","email": "miaradi...@sparky.com"}, {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": "tkova...@sparky.com"}, {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": "bind...@sparky.com"} ] > SparkSQL doesn't work well with JSON > > > Key: SPARK-10840 > URL: https://issues.apache.org/jira/browse/SPARK-10840 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ankit Sarraf >Priority: Trivial > Labels: JSON, Scala, SparkSQL > > Well formed JSON doesn't work with the 1.5.1 version while using > sqlContext.read.json(""): > { > "employees": { > "employee": [ > { > "name": "Mia", > "surname": "Radison", > "mobile": "7295913821", > "email": "miaradi...@sparky.com" > }, > { > "name": "Thor", > "surname": "Kovaskz", > "mobile": "8829177193", > "email": "tkova...@sparky.com" > }, > { > "name": "Bindy", > "surname": "Kvuls", > "mobile": "5033828845", > "email": "bind...@sparky.com" > } > ] > } > } > Where as, this works because all are in the same line: > [ > {"name": "Mia","surname": "Radison","mobile": "7295913821","email": > "miaradi...@sparky.com"}, > {"name": "Thor","surname": "Kovaskz","mobile": "8829177193","email": > "tkova...@sparky.com"}, > {"name": "Bindy","surname": "Kvuls","mobile": "5033828845","email": > "bind...@sparky.com"} > ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10836) SparkR: Add sort function to dataframe
[ https://issues.apache.org/jira/browse/SPARK-10836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-10836: -- Summary: SparkR: Add sort function to dataframe (was: Add SparkR sort function to dataframe) > SparkR: Add sort function to dataframe > -- > > Key: SPARK-10836 > URL: https://issues.apache.org/jira/browse/SPARK-10836 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Hi everyone, > the sort function can be used as an alternative to arrange(... ). > As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of > orderings for columns and the list of columns, represented as string names > for example: > sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to > sort some of the columns in the same order > sort(df, decreasing=TRUE, "col1") > sort(df, decreasing=c(TRUE,FALSE), "col1","col2") > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10839) SPARK_DAEMON_MEMORY has effect on heap size of thriftserver
[ https://issues.apache.org/jira/browse/SPARK-10839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10839: Assignee: (was: Apache Spark) > SPARK_DAEMON_MEMORY has effect on heap size of thriftserver > --- > > Key: SPARK-10839 > URL: https://issues.apache.org/jira/browse/SPARK-10839 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.4.1, 1.5.0 >Reporter: Yun Zhao > > When SPARK_DAEMON_MEMORY in spark-env.sh is setted to modify memory of Master > or Worker, there's an effect on heap size of thriftserver, further, this > effect cannot be modified by spark.driver.memory or --driver-memory. Version > 1.3.1 does not have the same problem. > in org.apache.spark.launcher.SparkSubmitCommandBuilder: > {quote} > String tsMemory = > isThriftServer(mainClass) ? System.getenv("SPARK_DAEMON_MEMORY") : > null; > String memory = firstNonEmpty(tsMemory, > firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, conf, props), > System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), > DEFAULT_MEM); > cmd.add("-Xms" + memory); > cmd.add("-Xmx" + memory); > {quote} > SPARK_DAEMON_MEMORY has the highest priority. > It can be modified like this: > {quote} > String memory = firstNonEmpty(firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, > conf, props), > System.getenv("SPARK_DRIVER_MEMORY"), tsMemory, > System.getenv("SPARK_MEM"), DEFAULT_MEM); > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10839) SPARK_DAEMON_MEMORY has effect on heap size of thriftserver
[ https://issues.apache.org/jira/browse/SPARK-10839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10839: Assignee: Apache Spark > SPARK_DAEMON_MEMORY has effect on heap size of thriftserver > --- > > Key: SPARK-10839 > URL: https://issues.apache.org/jira/browse/SPARK-10839 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.4.1, 1.5.0 >Reporter: Yun Zhao >Assignee: Apache Spark > > When SPARK_DAEMON_MEMORY in spark-env.sh is setted to modify memory of Master > or Worker, there's an effect on heap size of thriftserver, further, this > effect cannot be modified by spark.driver.memory or --driver-memory. Version > 1.3.1 does not have the same problem. > in org.apache.spark.launcher.SparkSubmitCommandBuilder: > {quote} > String tsMemory = > isThriftServer(mainClass) ? System.getenv("SPARK_DAEMON_MEMORY") : > null; > String memory = firstNonEmpty(tsMemory, > firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, conf, props), > System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), > DEFAULT_MEM); > cmd.add("-Xms" + memory); > cmd.add("-Xmx" + memory); > {quote} > SPARK_DAEMON_MEMORY has the highest priority. > It can be modified like this: > {quote} > String memory = firstNonEmpty(firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, > conf, props), > System.getenv("SPARK_DRIVER_MEMORY"), tsMemory, > System.getenv("SPARK_MEM"), DEFAULT_MEM); > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10839) SPARK_DAEMON_MEMORY has effect on heap size of thriftserver
[ https://issues.apache.org/jira/browse/SPARK-10839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908998#comment-14908998 ] Apache Spark commented on SPARK-10839: -- User 'xiaowen147' has created a pull request for this issue: https://github.com/apache/spark/pull/8921 > SPARK_DAEMON_MEMORY has effect on heap size of thriftserver > --- > > Key: SPARK-10839 > URL: https://issues.apache.org/jira/browse/SPARK-10839 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.4.1, 1.5.0 >Reporter: Yun Zhao > > When SPARK_DAEMON_MEMORY in spark-env.sh is setted to modify memory of Master > or Worker, there's an effect on heap size of thriftserver, further, this > effect cannot be modified by spark.driver.memory or --driver-memory. Version > 1.3.1 does not have the same problem. > in org.apache.spark.launcher.SparkSubmitCommandBuilder: > {quote} > String tsMemory = > isThriftServer(mainClass) ? System.getenv("SPARK_DAEMON_MEMORY") : > null; > String memory = firstNonEmpty(tsMemory, > firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, conf, props), > System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), > DEFAULT_MEM); > cmd.add("-Xms" + memory); > cmd.add("-Xmx" + memory); > {quote} > SPARK_DAEMON_MEMORY has the highest priority. > It can be modified like this: > {quote} > String memory = firstNonEmpty(firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, > conf, props), > System.getenv("SPARK_DRIVER_MEMORY"), tsMemory, > System.getenv("SPARK_MEM"), DEFAULT_MEM); > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10839) SPARK_DAEMON_MEMORY has effect on heap size of thriftserver
Yun Zhao created SPARK-10839: Summary: SPARK_DAEMON_MEMORY has effect on heap size of thriftserver Key: SPARK-10839 URL: https://issues.apache.org/jira/browse/SPARK-10839 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 1.5.0, 1.4.1 Reporter: Yun Zhao When SPARK_DAEMON_MEMORY in spark-env.sh is setted to modify memory of Master or Worker, there's an effect on heap size of thriftserver, further, this effect cannot be modified by spark.driver.memory or --driver-memory. Version 1.3.1 does not have the same problem. in org.apache.spark.launcher.SparkSubmitCommandBuilder: {quote} String tsMemory = isThriftServer(mainClass) ? System.getenv("SPARK_DAEMON_MEMORY") : null; String memory = firstNonEmpty(tsMemory, firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, conf, props), System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), DEFAULT_MEM); cmd.add("-Xms" + memory); cmd.add("-Xmx" + memory); {quote} SPARK_DAEMON_MEMORY has the highest priority. It can be modified like this: {quote} String memory = firstNonEmpty(firstNonEmptyValue(SparkLauncher.DRIVER_MEMORY, conf, props), System.getenv("SPARK_DRIVER_MEMORY"), tsMemory, System.getenv("SPARK_MEM"), DEFAULT_MEM); {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10838) Repeat to join one DataFrame twice,there will be AnalysisException.
[ https://issues.apache.org/jira/browse/SPARK-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Zhao updated SPARK-10838: - Description: The detail of exception is: {quote} Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) col_a#1 missing from col_a#0,col_b#2,col_a#3,col_b#4 in operator !Join Inner, Some((col_b#2 = col_a#1)); at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:908) at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) at org.apache.spark.sql.DataFrame.join(DataFrame.scala:554) at org.apache.spark.sql.DataFrame.join(DataFrame.scala:521) {quote} The related codes are: {quote} import org.apache.spark.sql.SQLContext import org.apache.spark.\{SparkContext, SparkConf} object DFJoinTest extends App \{ case class Foo(col_a: String) case class Bar(col_a: String, col_b: String) val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("DFJoinTest")) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val df1 = sc.parallelize(Array("1")).map(_.split(",")).map(p => Foo(p(0))).toDF() val df2 = sc.parallelize(Array("1,1")).map(_.split(",")).map(p => Bar(p(0), p(1))).toDF() val df3 = df1.join(df2, df1("col_a") === df2("col_a")).select(df1("col_a"), $"col_b") df3.join(df2, df3("col_b") === df2("col_a")).show() // val df4 = df2.as("df4") // df3.join(df4, df3("col_b") === df4("col_a")).show() // df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show() sc.stop() } {quote} When uses {quote} val df4 = df2.as("df4") df3.join(df4, df3("col_b") === df4("col_a")).show() {quote} there's errors,but when uses {quote} df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show() {quote} it's normal. was: The detail of exception is: {quote} Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) col_a#1 missing from col_a#0,col_b#2,col_a#3,col_b#4 in operator !Join Inner, Some((col_b#2 = col_a#1)); at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:908) at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) at org.apache.spark.sql.DataFrame.join(DataFrame.scala:554) at org.apache.spark.sql.DataFrame.join(DataFrame.scala:521) {quote} The related codes are: {quote} object DFJoinTest extends App { case class Foo(col_a: String) case class Bar(col_a: String, col_b: String) val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("DFJoinTest")) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val df1 = sc.parallelize(Array("1")).map(_.split(",")).map(p => Foo(p(0))).toDF() val df2 = sc.parallelize(Array("1,1")).map(_.split(",")).map(p => Bar(p(0), p(1))).toDF() val df3 = df1.join(df2, df1("col_a") === df2("col_a")).select(df1("col_a"), $"col_b") df3.join(df2, df3("col_b") === df2("col_a")).show() // val df4 = df2.as("df4") // df3.join(df4, df3("col_b") === df4("col_a")).show() // df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show() sc.stop() } {quote} When uses {quote} val df4 = df2.as("df4") df3.join(df4, df3("col_b") === d
[jira] [Created] (SPARK-10838) Repeat to join one DataFrame twice,there will be AnalysisException.
Yun Zhao created SPARK-10838: Summary: Repeat to join one DataFrame twice,there will be AnalysisException. Key: SPARK-10838 URL: https://issues.apache.org/jira/browse/SPARK-10838 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Reporter: Yun Zhao The detail of exception is: {quote} Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) col_a#1 missing from col_a#0,col_b#2,col_a#3,col_b#4 in operator !Join Inner, Some((col_b#2 = col_a#1)); at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:908) at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) at org.apache.spark.sql.DataFrame.join(DataFrame.scala:554) at org.apache.spark.sql.DataFrame.join(DataFrame.scala:521) {quote} The related codes are: {quote} object DFJoinTest extends App { case class Foo(col_a: String) case class Bar(col_a: String, col_b: String) val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("DFJoinTest")) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val df1 = sc.parallelize(Array("1")).map(_.split(",")).map(p => Foo(p(0))).toDF() val df2 = sc.parallelize(Array("1,1")).map(_.split(",")).map(p => Bar(p(0), p(1))).toDF() val df3 = df1.join(df2, df1("col_a") === df2("col_a")).select(df1("col_a"), $"col_b") df3.join(df2, df3("col_b") === df2("col_a")).show() // val df4 = df2.as("df4") // df3.join(df4, df3("col_b") === df4("col_a")).show() // df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show() sc.stop() } {quote} When uses {quote} val df4 = df2.as("df4") df3.join(df4, df3("col_b") === df4("col_a")).show() {quote} there's errors,but when uses {quote} df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show() {quote} it's normal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10837) TimeStamp could not work on sparksql very well
Yun Zhao created SPARK-10837: Summary: TimeStamp could not work on sparksql very well Key: SPARK-10837 URL: https://issues.apache.org/jira/browse/SPARK-10837 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yun Zhao create a file as follows: {quote} 2015-09-02 09:06:00.000 2015-09-02 09:06:00.001 2015-09-02 09:06:00.100 2015-09-02 09:06:01.000 {quote} Then upload it to hdfs, for example,put it to /test/testTable. create table: {quote} CREATE EXTERNAL TABLE `testTable`(`createtime` timestamp) LOCATION '/test/testTable'; {quote} process sqls: {quote} select * from testTable where createtime = "2015-09-02 09:06:00.000"; select * from testTable where createtime > "2015-09-02 09:06:00.000"; select * from testTable where createtime >= "2015-09-02 09:06:00.000"; {quote} The set of ">=" is not union set of "=" and ">". but if process sqls as follows: {quote} select * from testTable where createtime = timestamp("2015-09-02 09:06:00.000"); select * from testTable where createtime > timestamp("2015-09-02 09:06:00.000"); select * from testTable where createtime >= timestamp("2015-09-02 09:06:00.000"); {quote} There's no such former problem. User *explain extended* to find the difference of sqls: When uses "=","2015-09-02 09:06:00.000" is transfered to timestamp. When uses ">" or ">=",createtime is transfered to String. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10836) Add SparkR sort function to dataframe
[ https://issues.apache.org/jira/browse/SPARK-10836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10836: Assignee: Apache Spark > Add SparkR sort function to dataframe > - > > Key: SPARK-10836 > URL: https://issues.apache.org/jira/browse/SPARK-10836 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Assignee: Apache Spark >Priority: Minor > > Hi everyone, > the sort function can be used as an alternative to arrange(... ). > As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of > orderings for columns and the list of columns, represented as string names > for example: > sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to > sort some of the columns in the same order > sort(df, decreasing=TRUE, "col1") > sort(df, decreasing=c(TRUE,FALSE), "col1","col2") > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10836) Add SparkR sort function to dataframe
[ https://issues.apache.org/jira/browse/SPARK-10836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10836: Assignee: (was: Apache Spark) > Add SparkR sort function to dataframe > - > > Key: SPARK-10836 > URL: https://issues.apache.org/jira/browse/SPARK-10836 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Hi everyone, > the sort function can be used as an alternative to arrange(... ). > As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of > orderings for columns and the list of columns, represented as string names > for example: > sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to > sort some of the columns in the same order > sort(df, decreasing=TRUE, "col1") > sort(df, decreasing=c(TRUE,FALSE), "col1","col2") > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10836) Add SparkR sort function to dataframe
[ https://issues.apache.org/jira/browse/SPARK-10836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908907#comment-14908907 ] Apache Spark commented on SPARK-10836: -- User 'NarineK' has created a pull request for this issue: https://github.com/apache/spark/pull/8920 > Add SparkR sort function to dataframe > - > > Key: SPARK-10836 > URL: https://issues.apache.org/jira/browse/SPARK-10836 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Hi everyone, > the sort function can be used as an alternative to arrange(... ). > As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of > orderings for columns and the list of columns, represented as string names > for example: > sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to > sort some of the columns in the same order > sort(df, decreasing=TRUE, "col1") > sort(df, decreasing=c(TRUE,FALSE), "col1","col2") > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10836) Add SparkR sort function to dataframe
Narine Kokhlikyan created SPARK-10836: - Summary: Add SparkR sort function to dataframe Key: SPARK-10836 URL: https://issues.apache.org/jira/browse/SPARK-10836 Project: Spark Issue Type: Sub-task Reporter: Narine Kokhlikyan Priority: Minor Hi everyone, the sort function can be used as an alternative to arrange(... ). As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of orderings for columns and the list of columns, represented as string names for example: sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to sort some of the columns in the same order sort(df, decreasing=TRUE, "col1") sort(df, decreasing=c(TRUE,FALSE), "col1","col2") Thanks, Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9883) Distance to each cluster given a point
[ https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908884#comment-14908884 ] Joseph K. Bradley commented on SPARK-9883: -- OK! I'll be traveling next week, but I'll try to take a look soon. > Distance to each cluster given a point > -- > > Key: SPARK-9883 > URL: https://issues.apache.org/jira/browse/SPARK-9883 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Bertrand Dechoux >Priority: Minor > > Right now KMeansModel provides only a 'predict 'method which returns the > index of the closest cluster. > https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector) > It would be nice to have a method giving the distance to all clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10821) RandomForest serialization OOM during findBestSplits
[ https://issues.apache.org/jira/browse/SPARK-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Luan updated SPARK-10821: - Description: I am getting OOM during serialization for a relatively small dataset for a RandomForest. Even with spark.serializer.objectStreamReset at 1, It is still running out of memory when attempting to serialize my data. Stack Trace: Traceback (most recent call last): File "/root/random_forest/random_forest_spark.py", line 198, in main() File "/root/random_forest/random_forest_spark.py", line 166, in main trainModel(dset) File "/root/random_forest/random_forest_spark.py", line 191, in trainModel impurity='gini', maxDepth=4, maxBins=32) File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 352, in trainClassifier File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 270, in _train File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError15/09/25 00:44:41 DEBUG BlockManagerSlaveEndpoint: Done removing RDD 7, response is 0 15/09/25 00:44:41 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to AkkaRpcEndpointRef(Actor[akka://sparkDriver/temp/$Mj]) : An error occurred while calling o89.trainRandomForestModel. : java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) at org.apache.spark.SparkContext.clean(SparkContext.scala:2021) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702) at org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:625) at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235) at org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291) at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:742) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Details: My RDD is type MLLIB LabeledPoint objects, with each holding sparse vectors inside. This RDD has a total size of roughly 45MB. My sparse vector has a total length of ~15 million while only about 3000 or so are non-zeros. Works fine for up to sparse vector size 10 million. My cluster is setup on AWS such that my master is a r3.8xlarge along with two r3.4xlarge workers. Driver has ~190GB allocated to it while my RD
[jira] [Commented] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908738#comment-14908738 ] Ben Duffield commented on SPARK-10635: -- Ok good flag that there are other places this'd need to be considered. How open would you be to a PR which addresses this? I.e. sure it's an assumption now - could we move away from that? > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908705#comment-14908705 ] Imran Rashid commented on SPARK-5928: - [~ariskk] The workaround is to increase the number of partitions. All of the operations which trigger a shuffle take an optional second argument with the number of partitions, eg., {{reduceByKey( reduceFunc, numPartitions)}}. In general, its best to err on the side of too many partitions, rather than too few. My rule of thumb is to try to size partitions to to have roughly 100 MB of data (I have heard others throw around numbers in roughly the same ballpark). Note that means you use a lot of partitions if you have say 1 TB of data you are shuffling. Its worth noting that if you have very skewed data, just increasing the number of partitions in the function that triggers the shuffle might not help. That controls the number of partitions on the shuffle-read (aka reduce) side, but not the shuffle-write (aka map) side. If one map task writes out 2GB of data for one key, increasing the number of reduce partitions won't help you, since no matter how many reduce partitions, you will still write 2GB into one shuffle block. (A shuffle block corresponds to one map task / reduce task pair.) In that case, you may want to increase the number of partitions for your *map* stage, so that it is writing less data to one particular key. You control the number of partitions for the map-stage either at the previous operation that triggered a shuffle (eg., a preceding {{reduceByKey}}), or the operation that loaded the data (eg, {{sc.textFile}}). Eg: {noformat} val rawData = sc.textFile(..., numPartitionsFirstStage) // control the "map" partitions here val afterShuffle = rawData.map{...}.reduceByKey( ..., numPartitionsSecondStage) // control the "reduce" partitions here {noformat} My general recommendation, if you want to re-use your code, and have it work on a data sets of varying sizes, is to make the number of partitions at *every* stage some easily controllable parameter (eg., via the command line), so you can tweak things without having to recompile your code. > Remote Shuffle Blocks cannot be more than 2 GB > -- > > Key: SPARK-5928 > URL: https://issues.apache.org/jira/browse/SPARK-5928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid > > If a shuffle block is over 2GB, the shuffle fails, with an uninformative > exception. The tasks get retried a few times and then eventually the job > fails. > Here is an example program which can cause the exception: > {code} > val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore => > val n = 3e3.toInt > val arr = new Array[Byte](n) > //need to make sure the array doesn't compress to something small > scala.util.Random.nextBytes(arr) > arr > } > rdd.map { x => (1, x)}.groupByKey().count() > {code} > Note that you can't trigger this exception in local mode, it only happens on > remote fetches. I triggered these exceptions running with > {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} > {noformat} > 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, > imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, > imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= > org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds > 2147483647: 3021252889 - discarded > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at > org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) > at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.s
[jira] [Comment Edited] (SPARK-8734) Expose all Mesos DockerInfo options to Spark
[ https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908651#comment-14908651 ] Ondřej Smola edited comment on SPARK-8734 at 9/25/15 9:01 PM: -- No it wont as spark config is internally stored in hashmap - i realized this when walking home :). What about this spark.mesos.executor.docker.parameter.abc abc spark.mesos.executor.docker.parameters.envFOO=BAR, ENV1=VAL1 I have simple working solution with tests - i need to write some docs and i will send link to my fork for further discussion Edit: repo https://github.com/ondrej-smola/spark/tree/feature/SPARK-8734 was (Author: ondrej.smola): No it wont as spark config is internally stored in hashmap - i realized this when walking home :). What about this spark.mesos.executor.docker.parameter.abc abc spark.mesos.executor.docker.parameters.envFOO=BAR, ENV1=VAL1 I have simple working solution with tests - i need to write some docs and i will send link to my fork for further discussion > Expose all Mesos DockerInfo options to Spark > > > Key: SPARK-8734 > URL: https://issues.apache.org/jira/browse/SPARK-8734 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Chris Heller >Priority: Minor > Attachments: network.diff > > > SPARK-2691 only exposed a few options from the DockerInfo message. It would > be reasonable to expose them all, especially given one can now specify > arbitrary parameters to docker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark
[ https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908651#comment-14908651 ] Ondřej Smola commented on SPARK-8734: - No it wont as spark config is internally stored in hashmap - i realized this when walking home :). What about this spark.mesos.executor.docker.parameter.abc abc spark.mesos.executor.docker.parameters.envFOO=BAR, ENV1=VAL1 I have simple working solution with tests - i need to write some docs and i will send link to my fork for further discussion > Expose all Mesos DockerInfo options to Spark > > > Key: SPARK-8734 > URL: https://issues.apache.org/jira/browse/SPARK-8734 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Chris Heller >Priority: Minor > Attachments: network.diff > > > SPARK-2691 only exposed a few options from the DockerInfo message. It would > be reasonable to expose them all, especially given one can now specify > arbitrary parameters to docker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark
[ https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908647#comment-14908647 ] Alan Braithwaite commented on SPARK-8734: - Will this work with multiple instances of the same property? My concern is that there are some arguments which can be repeated and this scheme doesn't allow for that. > Expose all Mesos DockerInfo options to Spark > > > Key: SPARK-8734 > URL: https://issues.apache.org/jira/browse/SPARK-8734 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Chris Heller >Priority: Minor > Attachments: network.diff > > > SPARK-2691 only exposed a few options from the DockerInfo message. It would > be reasonable to expose them all, especially given one can now specify > arbitrary parameters to docker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9103) Tracking spark's memory usage
[ https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908644#comment-14908644 ] Imran Rashid commented on SPARK-9103: - ah, of course, sorry I made a big mistake. I was thinking that you only need to keep the latest max value per executor. But of course if that max occurred before the latest stage started, then you need to reset your counter. And with concurrent stages, you can't simply reset one global counter, since you need the max within every window. Thanks for explaining it to me again! > Tracking spark's memory usage > - > > Key: SPARK-9103 > URL: https://issues.apache.org/jira/browse/SPARK-9103 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Reporter: Zhang, Liye > Attachments: Tracking Spark Memory Usage - Phase 1.pdf > > > Currently spark only provides little memory usage information (RDD cache on > webUI) for the executors. User have no idea on what is the memory consumption > when they are running spark applications with a lot of memory used in spark > executors. Especially when they encounter the OOM, it’s really hard to know > what is the cause of the problem. So it would be helpful to give out the > detail memory consumption information for each part of spark, so that user > can clearly have a picture of where the memory is exactly used. > The memory usage info to expose should include but not limited to shuffle, > cache, network, serializer, etc. > User can optionally choose to open this functionality since this is mainly > for debugging and tuning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6649) DataFrame created through SQLContext.jdbc() failed if columns table must be quoted
[ https://issues.apache.org/jira/browse/SPARK-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908609#comment-14908609 ] Rick Hillegas commented on SPARK-6649: -- Hi Fred, The backtick syntax seems to be a feature of HiveQL according to this discussion on the developer list: http://apache-spark-developers-list.1001551.n3.nabble.com/column-identifiers-in-Spark-SQL-td14280.html Thanks, -Rick > DataFrame created through SQLContext.jdbc() failed if columns table must be > quoted > -- > > Key: SPARK-6649 > URL: https://issues.apache.org/jira/browse/SPARK-6649 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Frédéric Blanc >Priority: Minor > > If I want to import the content a table from oracle, that contains a column > with name COMMENT (a reserved keyword), I cannot use a DataFrame that map all > the columns of this table. > {code:title=ddl.sql|borderStyle=solid} > CREATE TABLE TEST_TABLE ( > "COMMENT" VARCHAR2(10) > ); > {code} > {code:title=test.java|borderStyle=solid} > SQLContext sqlContext = ... > DataFrame df = sqlContext.jdbc(databaseURL, "TEST_TABLE"); > df.rdd(); // => failed if the table contains a column with a reserved > keyword > {code} > The same problem can be encounter if reserved keyword are used on table name. > The JDBCRDD scala class could be improved, if the columnList initializer > append the double-quote for each column. (line : 225) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9883) Distance to each cluster given a point
[ https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908598#comment-14908598 ] Bertrand Dechoux commented on SPARK-9883: - The patch is now ready for MLlib and is waiting for a technical review. I will see about Pipelines API for the next step. > Distance to each cluster given a point > -- > > Key: SPARK-9883 > URL: https://issues.apache.org/jira/browse/SPARK-9883 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Bertrand Dechoux >Priority: Minor > > Right now KMeansModel provides only a 'predict 'method which returns the > index of the closest cluster. > https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector) > It would be nice to have a method giving the distance to all clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10760) SparkR glm: the documentation in examples - family argument is missing
[ https://issues.apache.org/jira/browse/SPARK-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-10760: -- Assignee: Narine Kokhlikyan > SparkR glm: the documentation in examples - family argument is missing > -- > > Key: SPARK-10760 > URL: https://issues.apache.org/jira/browse/SPARK-10760 > Project: Spark > Issue Type: Documentation > Components: SparkR >Reporter: Narine Kokhlikyan >Assignee: Narine Kokhlikyan >Priority: Minor > Fix For: 1.6.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Hi everyone, > Since the family argument is required for the glm function, the execution of: > model <- glm(Sepal_Length ~ Sepal_Width, df) > is failing. > I've fixed the documentation by adding the family argument and also added the > summay(model) which will show the coefficients for the model. > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10760) SparkR glm: the documentation in examples - family argument is missing
[ https://issues.apache.org/jira/browse/SPARK-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-10760. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8870 [https://github.com/apache/spark/pull/8870] > SparkR glm: the documentation in examples - family argument is missing > -- > > Key: SPARK-10760 > URL: https://issues.apache.org/jira/browse/SPARK-10760 > Project: Spark > Issue Type: Documentation > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > Fix For: 1.6.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Hi everyone, > Since the family argument is required for the glm function, the execution of: > model <- glm(Sepal_Length ~ Sepal_Width, df) > is failing. > I've fixed the documentation by adding the family argument and also added the > summay(model) which will show the coefficients for the model. > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908455#comment-14908455 ] Seth Hendrickson commented on SPARK-7129: - A couple of quick comments I have: * The design doc implies that we will have several different boosting predictors, whereas I initially thought this JIRA called for a single generic boosting predictor. So it seems like we'll have {{AdaBoostClassifier}}, {{LogitBoostClassifier}}, {{GradientBoostClassifier}} all separate boosting implementations instead of a single {{BoostedClassifier}} implementation that has a param like {{setAlgo("AdaBoost")}}. Personally think that a single generic implementation doesn't make as much sense, and so I like the separation of different algorithms better, but I wanted to clarify. * What are the base learners in the design doc? It looks like you propose to create a new {{Learner}} class. How will that interact with existing predictors? * I think {{AdaBoostClassifier}} is better than {{SAMMEClassifier}} since it is the classification analogy of {{AdaBoostRegressor}}, plus we'll keep in line with the sci-kit api. * Is {{setNumberOfBaseLearners}} equivalent to setting the number of boosting iterations? I ask because in R mboost package, they accept a set of P candidate base learners where, at each boosting iteration, they train each one and select only the "best" base learner. If this were the case, we would want to allow the user to specify multiple base learners. It seems as if we will not be doing that under the proposed architecture. Just want to clarify > Add generic boosting algorithm to spark.ml > -- > > Key: SPARK-7129 > URL: https://issues.apache.org/jira/browse/SPARK-7129 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > The Pipelines API will make it easier to create a generic Boosting algorithm > which can work with any Classifier or Regressor. Creating this feature will > require researching the possible variants and extensions of boosting which we > may want to support now and/or in the future, and planning an API which will > be properly extensible. > In particular, it will be important to think about supporting: > * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.) > * multiclass variants > * multilabel variants (which will probably be in a separate class and JIRA) > * For more esoteric variants, we should consider them but not design too much > around them: totally corrective boosting, cascaded models > Note: This may interact some with the existing tree ensemble methods, but it > should be largely separate since the tree ensemble APIs and implementations > are specialized for trees. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance
[ https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908448#comment-14908448 ] Joseph K. Bradley commented on SPARK-10791: --- Oh, OK, I'll comment there as needed. Thanks > Optimize MLlib LDA topic distribution query performance > --- > > Key: SPARK-10791 > URL: https://issues.apache.org/jira/browse/SPARK-10791 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 > Environment: Ubuntu 13.10, Oracle Java 8 >Reporter: Marko Asplund > > I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size > and ~3.4 M documents using EMLDAOptimizer. > Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit > training with the same data and on the same system set took ~5 minutes. > Loading the persisted model from disk (~2 minutes), as well as querying LDA > model topic distributions (~4 seconds for one document) are also quite slow > operations. > Our application is querying LDA model topic distribution (for one doc at a > time) as part of end-user operation execution flow, so a ~4 second execution > time is very problematic. > The log includes the following message, which AFAIK, should mean that > netlib-java is using machine optimised native implementation: > "com.github.fommil.jni.JniLoader - successfully loaded > /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so" > My test code can be found here: > https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57 > I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable > change in training performance. Model loading time was reduced to ~ 5 seconds > from ~ 2 minutes (now persisted as LocalLDAModel). However, query / > prediction time was unchanged. > Unfortunately, this is the critical performance characteristic in our case. > I did some profiling for my LDA prototype code that requests topic > distributions from a model. According to Java Mission Control more than 80 % > of execution time during sample interval is spent in the following methods: > - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07% > - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91% > - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50; > 6.98% > - java.lang.Double.valueOf(double); count: 31; 4.33% > Is there any way of using the API more optimally? > Are there any opportunities for optimising the "topicDistributions" code > path in MLlib? > My query test code looks like this essentially: > // executed once > val model = LocalLDAModel.load(ctx, ModelFileName) > // executed four times > val samples = Transformers.toSparseVectors(vocabularySize, > ctx.parallelize(Seq(input))) // fast > model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this > seems to take about 4 seconds to execute -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10824) DataFrame show method - show(df) should show first N number of rows, similar to R
[ https://issues.apache.org/jira/browse/SPARK-10824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908431#comment-14908431 ] Narine Kokhlikyan commented on SPARK-10824: --- Thanks Shivaram, I see! I'll look at it and watch the jira. > DataFrame show method - show(df) should show first N number of rows, similar > to R > - > > Key: SPARK-10824 > URL: https://issues.apache.org/jira/browse/SPARK-10824 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Hi everyone, > currently, show(dataframe) method shows some information about the columns > and their datatypes, however R shows the first N number of rows in dataframe. > Basically, the same as showDF. Right now I changed so that show calls showDF. > Also, the default number of rows was hard coded in DataFrame.R, I set it as > environment variable in sparkR.R. We can change it if you have other better > suggestions. > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9103) Tracking spark's memory usage
[ https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908404#comment-14908404 ] Zhang, Liye edited comment on SPARK-9103 at 9/25/15 5:56 PM: - Hi [~irashid], thanks for reviewing the doc. {quote} 1) Will the proposed design cover SPARK-9111, getting the memory when the executor dies abnormally, (esp when killed by yarn)? It seems to me the answer is "no", which is fine, that can be tackled separately, I just wanted to clarify. {quote} You are right, the answer is "no". This design is for phase 1, we can move it on later to cover [SPARK-9111|https://issues.apache.org/jira/browse/SPARK-9111]. {quote} I see the complexity of having overlapping stages, but I wonder if it could be simplified somewhat. It seems to me you just need to maintain a executorToLatestMetrics: Map[executor, metrics], and then on every stage complete, you just log them all? {quote} Since we want to reduce the number of events to log, I didn't find a way to simplify this for overlapping stages. And in the current implementation, we log all the ExectorMetrics of all the executors when executor complete. I think this can be simplified by only log ExecutorMetrics of executors that is related to the stage instead of all the executors. This will reduce a lot of events to log if there are many stages running on different executors. {quote} but it seems like there is more state & a bit more logging going on {quote} I don't quite understand, what do you mean about "*more state and more logging going on*", can you explain it further? {quote} I don't fully understand why you need to log both "CHB1" and "HB3" in your example. {quote} That is because the "CHB1" is the combined event, and "HB3" is the real event, we have to log "HB3" because there might be no heartbeat received for the stage that after "HB3" (just like stage2 in figure-1 described in the doc). And for that stage, it will use "HB3" instead of "CHB1" because "CHB1" is not the correct event it should refer to. was (Author: liyezhang556520): Hi @Imran Rashid, thanks for reviewing the doc. {quote} 1) Will the proposed design cover SPARK-9111, getting the memory when the executor dies abnormally, (esp when killed by yarn)? It seems to me the answer is "no", which is fine, that can be tackled separately, I just wanted to clarify. {quote} You are right, the answer is "no". This design is for phase 1, we can move it on later to cover [SPARK-9111|https://issues.apache.org/jira/browse/SPARK-9111]. {quote} I see the complexity of having overlapping stages, but I wonder if it could be simplified somewhat. It seems to me you just need to maintain a executorToLatestMetrics: Map[executor, metrics], and then on every stage complete, you just log them all? {quote} Since we want to reduce the number of events to log, I didn't find a way to simplify this for overlapping stages. And in the current implementation, we log all the ExectorMetrics of all the executors when executor complete. I think this can be simplified by only log ExecutorMetrics of executors that is related to the stage instead of all the executors. This will reduce a lot of events to log if there are many stages running on different executors. {quote} but it seems like there is more state & a bit more logging going on {quote} I don't quite understand, what do you mean about "*more state and more logging going on*", can you explain it further? {quote} I don't fully understand why you need to log both "CHB1" and "HB3" in your example. {quote} That is because the "CHB1" is the combined event, and "HB3" is the real event, we have to log "HB3" because there might be no heartbeat received for the stage that after "HB3" (just like stage2 in figure-1 described in the doc). And for that stage, it will use "HB3" instead of "CHB1" because "CHB1" is not the correct event it should refer to. > Tracking spark's memory usage > - > > Key: SPARK-9103 > URL: https://issues.apache.org/jira/browse/SPARK-9103 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Reporter: Zhang, Liye > Attachments: Tracking Spark Memory Usage - Phase 1.pdf > > > Currently spark only provides little memory usage information (RDD cache on > webUI) for the executors. User have no idea on what is the memory consumption > when they are running spark applications with a lot of memory used in spark > executors. Especially when they encounter the OOM, it’s really hard to know > what is the cause of the problem. So it would be helpful to give out the > detail memory consumption information for each part of spark, so that user > can clearly have a picture of where the memory is exactly used. > The memory usage info to expose should include but not
[jira] [Commented] (SPARK-9103) Tracking spark's memory usage
[ https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908404#comment-14908404 ] Zhang, Liye commented on SPARK-9103: Hi @Imran Rashid, thanks for reviewing the doc. {quote} 1) Will the proposed design cover SPARK-9111, getting the memory when the executor dies abnormally, (esp when killed by yarn)? It seems to me the answer is "no", which is fine, that can be tackled separately, I just wanted to clarify. {quote} You are right, the answer is "no". This design is for phase 1, we can move it on later to cover [SPARK-9111|https://issues.apache.org/jira/browse/SPARK-9111]. {quote} I see the complexity of having overlapping stages, but I wonder if it could be simplified somewhat. It seems to me you just need to maintain a executorToLatestMetrics: Map[executor, metrics], and then on every stage complete, you just log them all? {quote} Since we want to reduce the number of events to log, I didn't find a way to simplify this for overlapping stages. And in the current implementation, we log all the ExectorMetrics of all the executors when executor complete. I think this can be simplified by only log ExecutorMetrics of executors that is related to the stage instead of all the executors. This will reduce a lot of events to log if there are many stages running on different executors. {quote} but it seems like there is more state & a bit more logging going on {quote} I don't quite understand, what do you mean about "*more state and more logging going on*", can you explain it further? {quote} I don't fully understand why you need to log both "CHB1" and "HB3" in your example. {quote} That is because the "CHB1" is the combined event, and "HB3" is the real event, we have to log "HB3" because there might be no heartbeat received for the stage that after "HB3" (just like stage2 in figure-1 described in the doc). And for that stage, it will use "HB3" instead of "CHB1" because "CHB1" is not the correct event it should refer to. > Tracking spark's memory usage > - > > Key: SPARK-9103 > URL: https://issues.apache.org/jira/browse/SPARK-9103 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Reporter: Zhang, Liye > Attachments: Tracking Spark Memory Usage - Phase 1.pdf > > > Currently spark only provides little memory usage information (RDD cache on > webUI) for the executors. User have no idea on what is the memory consumption > when they are running spark applications with a lot of memory used in spark > executors. Especially when they encounter the OOM, it’s really hard to know > what is the cause of the problem. So it would be helpful to give out the > detail memory consumption information for each part of spark, so that user > can clearly have a picture of where the memory is exactly used. > The memory usage info to expose should include but not limited to shuffle, > cache, network, serializer, etc. > User can optionally choose to open this functionality since this is mainly > for debugging and tuning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10828) Can we use the accumulo data RDD created from JAVA in spark, in sparkR?Is there any other way to proceed with it to create RRDD from a source RDD other than text RDD?O
[ https://issues.apache.org/jira/browse/SPARK-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908399#comment-14908399 ] Shivaram Venkataraman commented on SPARK-10828: --- I don't think we want to support new ways to read in HDFS formats into SparkR -- IMHO The DataSource API is the right way to solve this problem as its well established now and works across Python, Scala, R etc. You can check with the Accumulo project to see if they have plans to add a DataSource implementation. Also the DataSource implementation does not need live in the Spark source tree (See http://github.com/databricks/spark-avro for an example), so we don't need a JIRA in Spark to track this. > Can we use the accumulo data RDD created from JAVA in spark, in sparkR?Is > there any other way to proceed with it to create RRDD from a source RDD other > than text RDD?Or to use any other format of data stored in HDFS in sparkR? > -- > > Key: SPARK-10828 > URL: https://issues.apache.org/jira/browse/SPARK-10828 > Project: Spark > Issue Type: Question > Components: R >Affects Versions: 1.5.0 > Environment: ubuntu 12.04,8GB RAM,accumulo 1.6.3,hadoop 2.6 >Reporter: madhvi gupta > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908395#comment-14908395 ] Meihua Wu commented on SPARK-7129: -- [~josephkb] [~sethah] I have compile a doc for AdaBoost. https://docs.google.com/document/d/1Neo5_6po9ap7dZuT3fwT6ptJa_XvkUUdRgCqB51lcy4/edit#heading=h.d4mq6f37je6x Thank you very much for reviewing them. I am look forward to your comments. > Add generic boosting algorithm to spark.ml > -- > > Key: SPARK-7129 > URL: https://issues.apache.org/jira/browse/SPARK-7129 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > The Pipelines API will make it easier to create a generic Boosting algorithm > which can work with any Classifier or Regressor. Creating this feature will > require researching the possible variants and extensions of boosting which we > may want to support now and/or in the future, and planning an API which will > be properly extensible. > In particular, it will be important to think about supporting: > * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.) > * multiclass variants > * multilabel variants (which will probably be in a separate class and JIRA) > * For more esoteric variants, we should consider them but not design too much > around them: totally corrective boosting, cascaded models > Note: This may interact some with the existing tree ensemble methods, but it > should be largely separate since the tree ensemble APIs and implementations > are specialized for trees. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908389#comment-14908389 ] Shivaram Venkataraman commented on SPARK-7736: -- [~ztoth] Could you open a new JIRA for the SparkR problem ? > Exception not failing Python applications (in yarn cluster mode) > > > Key: SPARK-7736 > URL: https://issues.apache.org/jira/browse/SPARK-7736 > Project: Spark > Issue Type: Bug > Components: YARN > Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 >Reporter: Shay Rojansky >Assignee: Marcelo Vanzin > Fix For: 1.5.1, 1.6.0 > > > It seems that exceptions thrown in Python spark apps after the SparkContext > is instantiated don't cause the application to fail, at least in Yarn: the > application is marked as SUCCEEDED. > Note that any exception right before the SparkContext correctly places the > application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10824) DataFrame show method - show(df) should show first N number of rows, similar to R
[ https://issues.apache.org/jira/browse/SPARK-10824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908380#comment-14908380 ] Shivaram Venkataraman commented on SPARK-10824: --- [~Narine] We discussed this in https://issues.apache.org/jira/browse/SPARK-9317 and specifically in the github PR https://github.com/apache/spark/pull/8360#issuecomment-133516179 As described in the PR comment this needs a more involved change on the SQL side to see if the data frame is cheap to print or as we don't want to trigger expensive computation in this case. cc [~rxin] > DataFrame show method - show(df) should show first N number of rows, similar > to R > - > > Key: SPARK-10824 > URL: https://issues.apache.org/jira/browse/SPARK-10824 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Hi everyone, > currently, show(dataframe) method shows some information about the columns > and their datatypes, however R shows the first N number of rows in dataframe. > Basically, the same as showDF. Right now I changed so that show calls showDF. > Also, the default number of rows was hard coded in DataFrame.R, I set it as > environment variable in sparkR.R. We can change it if you have other better > suggestions. > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10835) Change Output of NGram to Array(String, True)
[ https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumit Chawla updated SPARK-10835: - Description: Currently output type of NGram is Array(String, false), which is not compatible with LDA since their input type is Array(String, true). was: Currently output type of Tokenizer is Array(String, false), which is not compatible with Word2Vec and Other transformers since their input type is Array(String, true). Seq[String] in udf will be treated as Array(String, true) by default. I'm also thinking for Nullable columns, maybe tokenizer should return Array(null) for null value in the input. > Change Output of NGram to Array(String, True) > - > > Key: SPARK-10835 > URL: https://issues.apache.org/jira/browse/SPARK-10835 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Sumit Chawla >Assignee: yuhao yang >Priority: Minor > Fix For: 1.5.0 > > > Currently output type of NGram is Array(String, false), which is not > compatible with LDA since their input type is Array(String, true). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10835) Change Output of NGram to Array(String, True)
Sumit Chawla created SPARK-10835: Summary: Change Output of NGram to Array(String, True) Key: SPARK-10835 URL: https://issues.apache.org/jira/browse/SPARK-10835 Project: Spark Issue Type: Improvement Components: ML Reporter: Sumit Chawla Assignee: yuhao yang Priority: Minor Fix For: 1.5.0 Currently output type of Tokenizer is Array(String, false), which is not compatible with Word2Vec and Other transformers since their input type is Array(String, true). Seq[String] in udf will be treated as Array(String, true) by default. I'm also thinking for Nullable columns, maybe tokenizer should return Array(null) for null value in the input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6281) Support incremental updates for Graph
[ https://issues.apache.org/jira/browse/SPARK-6281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908349#comment-14908349 ] Rohit commented on SPARK-6281: -- Hi I see the issue as resolved and the resolution as wont fix! I am currently using GraphX and I also have this requirement of updating the graph incrementally specially adding new edges and deletion of old edges. Is it possible in current version some how? currently As i understood this could be done only by doing a union of the new edge RDDs with the graph.Edge RDDs and creating the graph again which does not seem very efficient as the new edges arrives in stream at quit frequent interval. Any plans of supporting this in future? > Support incremental updates for Graph > - > > Key: SPARK-6281 > URL: https://issues.apache.org/jira/browse/SPARK-6281 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Takeshi Yamamuro >Priority: Minor > > Add api to efficiently append new vertices and edges into existing Graph, > e.g., Graph#append(newVerts: RDD[(VertexId, VD)], newEdges: RDD[Edge[ED]], > defaultVertexAttr: VD) > This is useful for time-evolving graphs; new vertices and edges are built from > streaming data thru Spark Streaming, and then incrementally appended > into a existing graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9103) Tracking spark's memory usage
[ https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908309#comment-14908309 ] Imran Rashid commented on SPARK-9103: - Hi [~liyezhang556520], thanks for posting the design doc. Looks good, just a couple of questions. 1) Will the proposed design cover SPARK-9111, getting the memory when the executor dies abnormally, (esp when killed by yarn)? It seems to me the answer is "no", which is fine, that can be tackled separately, I just wanted to clarify. 2) I see the complexity of having overlapping stages, but I wonder if it could be simplified somewhat. It seems to me you just need to maintain a {{executorToLatestMetrics: Map[executor, metrics]}}, and then on every stage complete, you just log them all? Maybe this is what you are already describing in the doc, but it seems like there is more state & a bit more logging going on. Eg., I don't fully understand why you need to log both "CHB1" and "HB3" in your example. thanks > Tracking spark's memory usage > - > > Key: SPARK-9103 > URL: https://issues.apache.org/jira/browse/SPARK-9103 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Reporter: Zhang, Liye > Attachments: Tracking Spark Memory Usage - Phase 1.pdf > > > Currently spark only provides little memory usage information (RDD cache on > webUI) for the executors. User have no idea on what is the memory consumption > when they are running spark applications with a lot of memory used in spark > executors. Especially when they encounter the OOM, it’s really hard to know > what is the cause of the problem. So it would be helpful to give out the > detail memory consumption information for each part of spark, so that user > can clearly have a picture of where the memory is exactly used. > The memory usage info to expose should include but not limited to shuffle, > cache, network, serializer, etc. > User can optionally choose to open this functionality since this is mainly > for debugging and tuning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10804) "LOCAL" in LOAD DATA LOCAL INPATH means "remote"
[ https://issues.apache.org/jira/browse/SPARK-10804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908296#comment-14908296 ] Antonio Piccolboni commented on SPARK-10804: Good suggestion SPARK-10834 > "LOCAL" in LOAD DATA LOCAL INPATH means "remote" > > > Key: SPARK-10804 > URL: https://issues.apache.org/jira/browse/SPARK-10804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Antonio Piccolboni > > Connecting with a remote thriftserver with a custom JDBC client or beeline, > load data local inpath fails. Hiveserver2 docs explain in a quick comment > that local now means local to the server. I think this is just a > rationalization for a bug. When a user types "local" > # it needs to be local to him, not some server > # Failing 1., one needs to have a way to determine what local means and > create a "local" item under the new definition. > With the thirftserver, I have a host to connect to, but I don't have any way > to create a file local to that host, at least in spark. It may not be > desirable to create user directories on the thriftserver host or running file > transfer services like scp. Moreover, it appears that this syntax is unique > to Hive and Spark but its origin can be traced to LOAD DATA LOCAL INFILE in > Oracle and was adopted by mysql. In the latter docs we can read "If LOCAL is > specified, the file is read by the client program on the client host and sent > to the server. The file can be given as a full path name to specify its exact > location. If given as a relative path name, the name is interpreted relative > to the directory in which the client program was started". This is not to say > that the spark or hive teams are bound to what Oracle and Mysql do, but to > support the idea that the meaning of LOCAL is settled. For instance, the > Impala documentation says: "Currently, the Impala LOAD DATA statement only > imports files from HDFS, not from the local filesystem. It does not support > the LOCAL keyword of the Hive LOAD DATA statement." I think this is a better > solution. The way things are in thriftserver, I developed a client under the > assumption that I could use LOAD DATA LOCAL INPATH and all tests where > passing in standalone mode, only to find with the first distributed test that > # LOCAL means "local to server", a.k.a. "remote" > # INSERT INTO ... VALUES is not supported > # There is really no workaround unless one assumes access what data store > spark is running against , like HDFS, and that the user can upload data to > it. > In the space of workarounds it is not terrible, but if you are trying to > write a self-contained spark package, that's a defeat and makes writing tests > particularly hard. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10834) SPARK SQL doesn't support INSERT INTO ... VALUES
Antonio Piccolboni created SPARK-10834: -- Summary: SPARK SQL doesn't support INSERT INTO ... VALUES Key: SPARK-10834 URL: https://issues.apache.org/jira/browse/SPARK-10834 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Antonio Piccolboni I guess the Summary has most of it. I am testing from a custom JDBC client, but others have run into this. Because of the way the thrift server is written, I think this happens in a HiveContext. Surprisingly though, Hive server 2 does support this syntax, at least tested in HDP2 sandbox at defaults. This issues was created as a fork in the discussion of SPARK-10804 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10561) Provide tooling for auto-generating Spark SQL reference manual
[ https://issues.apache.org/jira/browse/SPARK-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-10561: --- Description: Here is the discussion thread: http://search-hadoop.com/m/q3RTtcD20F1o62xE Richard Hillegas made the following suggestion: A machine-generated BNF, however, is easy to imagine. But perhaps not so easy to implement. Spark's SQL grammar is implemented in Scala, extending the DSL support provided by the Scala language. I am new to programming in Scala, so I don't know whether the Scala ecosystem provides any good tools for reverse-engineering a BNF from a class which extends scala.util.parsing.combinator.syntactical.StandardTokenParsers. was: Here is the discussion thread: http://search-hadoop.com/m/q3RTtcD20F1o62xE Richard Hillegas made the following suggestion: A machine-generated BNF, however, is easy to imagine. But perhaps not so easy to implement. Spark's SQL grammar is implemented in Scala, extending the DSL support provided by the Scala language. I am new to programming in Scala, so I don't know whether the Scala ecosystem provides any good tools for reverse-engineering a BNF from a class which extends scala.util.parsing.combinator.syntactical.StandardTokenParsers. > Provide tooling for auto-generating Spark SQL reference manual > -- > > Key: SPARK-10561 > URL: https://issues.apache.org/jira/browse/SPARK-10561 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Reporter: Ted Yu > > Here is the discussion thread: > http://search-hadoop.com/m/q3RTtcD20F1o62xE > Richard Hillegas made the following suggestion: > A machine-generated BNF, however, is easy to imagine. But perhaps not so easy > to implement. Spark's SQL grammar is implemented in Scala, extending the DSL > support provided by the Scala language. I am new to programming in Scala, so > I don't know whether the Scala ecosystem provides any good tools for > reverse-engineering a BNF from a class which extends > scala.util.parsing.combinator.syntactical.StandardTokenParsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10808) LDA user guide: discuss running time of LDA
[ https://issues.apache.org/jira/browse/SPARK-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908198#comment-14908198 ] Mohamed Baddar commented on SPARK-10808: Thanks [~josephkb] , working on it > LDA user guide: discuss running time of LDA > --- > > Key: SPARK-10808 > URL: https://issues.apache.org/jira/browse/SPARK-10808 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Based on feedback like [SPARK-10791], we should discuss the computational and > communication complexity of LDA and its optimizers in the MLlib Programming > Guide. E.g.: > * Online LDA can be faster than EM. > * To make online LDA run faster, you can use a smaller miniBatchFraction. > * Communication > ** For EM, communication on each iteration is on the order of # topics * > (vocabSize + # docs). > ** For online LDA, communication on each iteration is on the order of # > topics * vocabSize. > * Decreasing vocabSize and # topics can speed things up. It's often fine to > eliminate uncommon words, unless you are trying to create a very large number > of topics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10296) add preservesParitioning parameter to RDD.map
[ https://issues.apache.org/jira/browse/SPARK-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908078#comment-14908078 ] Esteban Donato commented on SPARK-10296: re-establish activity on this thread. I'm not sure what is the verdict on this proposal. If you prefer I can create a pull request with this change and then decide what to do based on it. > add preservesParitioning parameter to RDD.map > - > > Key: SPARK-10296 > URL: https://issues.apache.org/jira/browse/SPARK-10296 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Esteban Donato >Priority: Minor > > It would be nice to add the Boolean parameter preservesParitioning with > default false to RDD.map method just as it is in RDD.mapPartitions method. > If you agree I can submit a pull request with this enhancement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark
[ https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908050#comment-14908050 ] Chris Heller commented on SPARK-8734: - I pushed up to my branch some code for the parameters. Though its untested at the moment ... tried to rebase the code to master and am now getting build errors. > Expose all Mesos DockerInfo options to Spark > > > Key: SPARK-8734 > URL: https://issues.apache.org/jira/browse/SPARK-8734 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Chris Heller >Priority: Minor > Attachments: network.diff > > > SPARK-2691 only exposed a few options from the DockerInfo message. It would > be reasonable to expose them all, especially given one can now specify > arbitrary parameters to docker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10833) Inline, organize BSD/MIT licenses in LICENSE
[ https://issues.apache.org/jira/browse/SPARK-10833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10833: Assignee: Sean Owen (was: Apache Spark) > Inline, organize BSD/MIT licenses in LICENSE > > > Key: SPARK-10833 > URL: https://issues.apache.org/jira/browse/SPARK-10833 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.5.0 >Reporter: Sean Owen >Assignee: Sean Owen > > In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to > light that the guidance at > http://www.apache.org/dev/licensing-howto.html#permissive-deps means that > permissively-licensed dependencies has a different interpretation than we > (er, I) had been operating under. "pointer ... to the license within the > source tree" specifically means a copy of the license within Spark's > distribution, whereas at the moment, Spark's LICENSE has a pointer to the > project's license in the *other project's* source tree. > The remedy is simply to inline all such license references (i.e. BSD/MIT > licenses) or include their text in "licenses" subdirectory and point to that. > Along the way, we can also treat other BSD/MIT licenses, whose text has been > inlined into LICENSE, in the same way. > The LICENSE file can continue to provide a helpful list of BSD/MIT licensed > projects and a pointer to their sites. This would be over and above including > license text in the distro, which is the essential thing. > I do not think this blocks a current release, since there's a good-faith > argument that the current practice satisfies the terms of the third-party > licenses as well. (If it didn't, this would be a blocker for any further > release.) However, of course it's better to follow the best practice going > forward. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10833) Inline, organize BSD/MIT licenses in LICENSE
[ https://issues.apache.org/jira/browse/SPARK-10833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908038#comment-14908038 ] Apache Spark commented on SPARK-10833: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/8919 > Inline, organize BSD/MIT licenses in LICENSE > > > Key: SPARK-10833 > URL: https://issues.apache.org/jira/browse/SPARK-10833 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.5.0 >Reporter: Sean Owen >Assignee: Sean Owen > > In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to > light that the guidance at > http://www.apache.org/dev/licensing-howto.html#permissive-deps means that > permissively-licensed dependencies has a different interpretation than we > (er, I) had been operating under. "pointer ... to the license within the > source tree" specifically means a copy of the license within Spark's > distribution, whereas at the moment, Spark's LICENSE has a pointer to the > project's license in the *other project's* source tree. > The remedy is simply to inline all such license references (i.e. BSD/MIT > licenses) or include their text in "licenses" subdirectory and point to that. > Along the way, we can also treat other BSD/MIT licenses, whose text has been > inlined into LICENSE, in the same way. > The LICENSE file can continue to provide a helpful list of BSD/MIT licensed > projects and a pointer to their sites. This would be over and above including > license text in the distro, which is the essential thing. > I do not think this blocks a current release, since there's a good-faith > argument that the current practice satisfies the terms of the third-party > licenses as well. (If it didn't, this would be a blocker for any further > release.) However, of course it's better to follow the best practice going > forward. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10833) Inline, organize BSD/MIT licenses in LICENSE
[ https://issues.apache.org/jira/browse/SPARK-10833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10833: Assignee: Apache Spark (was: Sean Owen) > Inline, organize BSD/MIT licenses in LICENSE > > > Key: SPARK-10833 > URL: https://issues.apache.org/jira/browse/SPARK-10833 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.5.0 >Reporter: Sean Owen >Assignee: Apache Spark > > In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to > light that the guidance at > http://www.apache.org/dev/licensing-howto.html#permissive-deps means that > permissively-licensed dependencies has a different interpretation than we > (er, I) had been operating under. "pointer ... to the license within the > source tree" specifically means a copy of the license within Spark's > distribution, whereas at the moment, Spark's LICENSE has a pointer to the > project's license in the *other project's* source tree. > The remedy is simply to inline all such license references (i.e. BSD/MIT > licenses) or include their text in "licenses" subdirectory and point to that. > Along the way, we can also treat other BSD/MIT licenses, whose text has been > inlined into LICENSE, in the same way. > The LICENSE file can continue to provide a helpful list of BSD/MIT licensed > projects and a pointer to their sites. This would be over and above including > license text in the distro, which is the essential thing. > I do not think this blocks a current release, since there's a good-faith > argument that the current practice satisfies the terms of the third-party > licenses as well. (If it didn't, this would be a blocker for any further > release.) However, of course it's better to follow the best practice going > forward. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10833) Inline, organize BSD/MIT licenses in LICENSE
Sean Owen created SPARK-10833: - Summary: Inline, organize BSD/MIT licenses in LICENSE Key: SPARK-10833 URL: https://issues.apache.org/jira/browse/SPARK-10833 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.5.0 Reporter: Sean Owen Assignee: Sean Owen In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to light that the guidance at http://www.apache.org/dev/licensing-howto.html#permissive-deps means that permissively-licensed dependencies has a different interpretation than we (er, I) had been operating under. "pointer ... to the license within the source tree" specifically means a copy of the license within Spark's distribution, whereas at the moment, Spark's LICENSE has a pointer to the project's license in the *other project's* source tree. The remedy is simply to inline all such license references (i.e. BSD/MIT licenses) or include their text in "licenses" subdirectory and point to that. Along the way, we can also treat other BSD/MIT licenses, whose text has been inlined into LICENSE, in the same way. The LICENSE file can continue to provide a helpful list of BSD/MIT licensed projects and a pointer to their sites. This would be over and above including license text in the distro, which is the essential thing. I do not think this blocks a current release, since there's a good-faith argument that the current practice satisfies the terms of the third-party licenses as well. (If it didn't, this would be a blocker for any further release.) However, of course it's better to follow the best practice going forward. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9941) Try ML pipeline API on Kaggle competitions
[ https://issues.apache.org/jira/browse/SPARK-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908036#comment-14908036 ] Kristina Plazonic commented on SPARK-9941: -- I would love to do Avito Context Ad Clicks - https://www.kaggle.com/c/avito-context-ad-clicks - but it involves a lot of feature engineering and preprocessing. I would love to split this with somebody else if anybody is interested on working with this. Thanks! Kristina > Try ML pipeline API on Kaggle competitions > -- > > Key: SPARK-9941 > URL: https://issues.apache.org/jira/browse/SPARK-9941 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This is an umbrella JIRA to track some fun tasks :) > We have built many features under the ML pipeline API, and we want to see how > it works on real-world datasets, e.g., Kaggle competition datasets > (https://www.kaggle.com/competitions). We want to invite community members to > help test. The goal is NOT to win the competitions but to provide code > examples and to find out missing features and other issues to help shape the > roadmap. > For people who are interested, please do the following: > 1. Create a subtask (or leave a comment if you cannot create a subtask) to > claim a Kaggle dataset. > 2. Use the ML pipeline API to build and tune an ML pipeline that works for > the Kaggle dataset. > 3. Paste the code to gist (https://gist.github.com/) and provide the link > here. > 4. Report missing features, issues, running times, and accuracy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908001#comment-14908001 ] Zsolt Tóth commented on SPARK-7736: --- As I see, this is also a problem for SparkR applications in yarn-cluster mode. Is there an open JIRA for that? > Exception not failing Python applications (in yarn cluster mode) > > > Key: SPARK-7736 > URL: https://issues.apache.org/jira/browse/SPARK-7736 > Project: Spark > Issue Type: Bug > Components: YARN > Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 >Reporter: Shay Rojansky >Assignee: Marcelo Vanzin > Fix For: 1.5.1, 1.6.0 > > > It seems that exceptions thrown in Python spark apps after the SparkContext > is instantiated don't cause the application to fail, at least in Yarn: the > application is marked as SUCCEEDED. > Note that any exception right before the SparkContext correctly places the > application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
[ https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907986#comment-14907986 ] Sean Owen commented on SPARK-10390: --- My guess is it ends up building in a different Guava dependency when built via SBT? I'm still not entirely sure. I do know the dependency resolution rules are different and that's why only the Maven build 'counts'. I'd try Maven, anyway, just to see if it works. If not then we know this guess isn't correct. > Py4JJavaError java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > > > Key: SPARK-10390 > URL: https://issues.apache.org/jira/browse/SPARK-10390 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Zoltán Zvara > > While running PySpark through iPython. > {code} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > {{spark-env.sh}} > {code} > export IPYTHON=1 > export PYSPARK_PYTHON=/usr/bin/python3 > export PYSPARK_DRIVER_PYTHON=ipython3 > export PYSPARK_DRIVER_PYTHON_OPTS="notebook" > {code} > Spark built with: > {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} > Not a problem, when built against {{Hadoop 2.4}}! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Description: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows "No event logs found for application",but a majority of same application is nomal . the wrong log picture is screenshot-1.png and screenshot-1-1.png and master log as screenshot-2.png was: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows "No event logs found for application",but a majority of same application is nomal . the wrong log picture is screenshot-1.png and screenshot-1-1.png and The following is master log as screenshot-2.png > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1-1.png, screenshot-1.png, screenshot-2.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows "No event logs > found for application",but a majority of same application is nomal . > the wrong log picture is screenshot-1.png and screenshot-1-1.png > and master log as screenshot-2.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Description: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows "No event logs found for application",but a majority of same application is nomal . the wrong log picture is screenshot-1.png and screenshot-1-1.png and The following is master log as screenshot-2.png was: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is screenshot-1.png and screenshot-1-1.png and The following is master log as screenshot-2.png > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1-1.png, screenshot-1.png, screenshot-2.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows "No event logs > found for application",but a majority of same application is nomal . > the wrong log picture is screenshot-1.png and screenshot-1-1.png > and The following is master log as screenshot-2.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907984#comment-14907984 ] Ricky Yang commented on SPARK-10832: 15/09/25 19:00:09 INFO Master: Registering app JavaSparkSQL 15/09/25 19:00:09 INFO Master: Registered app JavaSparkSQL with ID app-20150925190009-0242 15/09/25 19:00:09 INFO Master: Launching executor app-20150925190009-0242/0 on worker worker-20150923201210-10.27.1.142-8 079 15/09/25 19:00:09 INFO Master: Launching executor app-20150925190009-0242/1 on worker worker-20150923201210-10.27.1.138-8 15/09/25 19:00:09 INFO Master: Launching executor app-20150925190009-0242/1 on worker worker-20150923201210-10.27.1.138-8 079 15/09/25 19:00:11 INFO Master: akka.tcp://driverClient@10.27.1.143:47123 got disassociated, removing it. 15/09/25 19:00:11 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://driverClient@10.27.1.143:47 123] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/09/25 19:00:11 INFO Master: akka.tcp://driverClient@10.27.1.143:47123 got disassociated, removing it. 15/09/25 19:00:14 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper 15/09/25 19:00:14 INFO Master: Launching driver driver-20150925190014-0205 on worker worker-20150923201210-10.27.1.143-80 79 15/09/25 19:00:17 INFO Master: Registering app JavaSparkPi 15/09/25 19:00:17 INFO Master: Registered app JavaSparkPi with ID app-20150925190017-0243 15/09/25 19:00:17 INFO Master: Launching executor app-20150925190017-0243/0 on worker worker-20150923201210-10.27.1.142-8 079 15/09/25 19:00:17 INFO Master: Launching executor app-20150925190017-0243/1 on worker worker-20150923201210-10.27.1.138-8 079 15/09/25 19:00:20 INFO Master: akka.tcp://driverClient@10.27.1.143:44975 got disassociated, removing it. 15/09/25 19:00:20 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://driverClient@10.27.1.143:44 975] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/09/25 19:00:20 INFO Master: akka.tcp://driverClient@10.27.1.143:44975 got disassociated, removing it. 15/09/25 19:00:21 INFO Master: Received unregister request from application app-20150925190009-0242 15/09/25 19:00:21 INFO Master: Removing app app-20150925190009-0242 15/09/25 19:00:21 WARN Master: Application JavaSparkSQL is still in progress, it may be terminated abnormally. 15/09/25 19:00:21 WARN Master: No event logs found for application JavaSparkSQL in hdfs://SuningHadoop2/sparklogs/sparklo gshistorylog. 15/09/25 19:00:21 INFO Master: akka.tcp://sparkDriver@10.27.1.143:57388 got disassociated, removing it. 15/09/25 19:00:22 WARN Master: Got status update for unknown executor app-20150925190009-0242/1 15/09/25 19:00:21 INFO Master: Removing app app-20150925190009-0242 15/09/25 19:00:21 WARN Master: Application JavaSparkSQL is still in progress, it may be terminated abnormally. 15/09/25 19:00:21 WARN Master: No event logs found for application JavaSparkSQL in hdfs://SuningHadoop2/sparklogs/sparklo gshistorylog. 15/09/25 19:00:21 INFO Master: akka.tcp://sparkDriver@10.27.1.143:57388 got disassociated, removing it. 15/09/25 19:00:22 WARN Master: Got status update for unknown executor app-20150925190009-0242/1 15/09/25 19:00:22 WARN Master: Got status update for unknown executor app-20150925190009-0242/0 > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1-1.png, screenshot-1.png, screenshot-2.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is screenshot-1.png and screenshot-1-1.png > and The following is master log as screenshot-2.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Description: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is screenshot-1.png and screenshot-1-1.png and The following is master log as screenshot-2.png was: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is screenshot-1.png and The following is master log as screenshot-2.png > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1-1.png, screenshot-1.png, screenshot-2.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is screenshot-1.png and screenshot-1-1.png > and The following is master log as screenshot-2.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Attachment: screenshot-1-1.png > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1-1.png, screenshot-1.png, screenshot-2.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is screenshot-1.png > and The following is master log as screenshot-2.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Attachment: (was: 1.jpg) > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1.png, screenshot-2.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is screenshot-1.png > and The following is master log as -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Description: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is screenshot-1.png and The following is master log as was: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is screenshot-1.png > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1.png, screenshot-2.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is screenshot-1.png > and The following is master log as -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Attachment: screenshot-2.png > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1.png, screenshot-2.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is screenshot-1.png > and The following is master log as -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Description: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is screenshot-1.png and The following is master log as screenshot-2.png was: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is screenshot-1.png and The following is master log as > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1.png, screenshot-2.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is screenshot-1.png > and The following is master log as screenshot-2.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Description: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is screenshot-1.png was: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: 1.jpg, screenshot-1.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is screenshot-1.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Attachment: 1.jpg > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: 1.jpg, screenshot-1.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Description: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is was: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Description: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is was: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Description: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is was: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Attachment: screenshot-1.png > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Description: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is was: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > Attachments: screenshot-1.png > > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Description: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is was: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
[ https://issues.apache.org/jira/browse/SPARK-10832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricky Yang updated SPARK-10832: --- Description: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is was: hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . the wrong log picture is > sometimes No event logs found for application using same JavaSparkSQL example > -- > > Key: SPARK-10832 > URL: https://issues.apache.org/jira/browse/SPARK-10832 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.0 >Reporter: Ricky Yang > > hi all, >when using JavaSparkSQL example,the code was submit many times as > following: > /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class > org.apache.spark.examples.sql.JavaSparkSQL > hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar > unfortunately , sometimes completed applications web shows has"No event > logs found for application",but a majority of same application is nomal . > the wrong log picture is -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10832) sometimes No event logs found for application using same JavaSparkSQL example
Ricky Yang created SPARK-10832: -- Summary: sometimes No event logs found for application using same JavaSparkSQL example Key: SPARK-10832 URL: https://issues.apache.org/jira/browse/SPARK-10832 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Ricky Yang hi all, when using JavaSparkSQL example,the code was submit many times as following: /home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar unfortunately , sometimes completed applications web shows has"No event logs found for application",but a majority of same application is nomal . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance
[ https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907975#comment-14907975 ] Marko Asplund commented on SPARK-10791: --- This performance issue was actually discussed on the spark mailing list. Please see full discussion here: https://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/browser My tests were performed on a single node. > Optimize MLlib LDA topic distribution query performance > --- > > Key: SPARK-10791 > URL: https://issues.apache.org/jira/browse/SPARK-10791 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 > Environment: Ubuntu 13.10, Oracle Java 8 >Reporter: Marko Asplund > > I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size > and ~3.4 M documents using EMLDAOptimizer. > Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit > training with the same data and on the same system set took ~5 minutes. > Loading the persisted model from disk (~2 minutes), as well as querying LDA > model topic distributions (~4 seconds for one document) are also quite slow > operations. > Our application is querying LDA model topic distribution (for one doc at a > time) as part of end-user operation execution flow, so a ~4 second execution > time is very problematic. > The log includes the following message, which AFAIK, should mean that > netlib-java is using machine optimised native implementation: > "com.github.fommil.jni.JniLoader - successfully loaded > /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so" > My test code can be found here: > https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57 > I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable > change in training performance. Model loading time was reduced to ~ 5 seconds > from ~ 2 minutes (now persisted as LocalLDAModel). However, query / > prediction time was unchanged. > Unfortunately, this is the critical performance characteristic in our case. > I did some profiling for my LDA prototype code that requests topic > distributions from a model. According to Java Mission Control more than 80 % > of execution time during sample interval is spent in the following methods: > - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07% > - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91% > - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50; > 6.98% > - java.lang.Double.valueOf(double); count: 31; 4.33% > Is there any way of using the API more optimally? > Are there any opportunities for optimising the "topicDistributions" code > path in MLlib? > My query test code looks like this essentially: > // executed once > val model = LocalLDAModel.load(ctx, ModelFileName) > // executed four times > val samples = Transformers.toSparseVectors(vocabularySize, > ctx.parallelize(Seq(input))) // fast > model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this > seems to take about 4 seconds to execute -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark
[ https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907966#comment-14907966 ] Martin Tapp commented on SPARK-8734: Same here, spark.mesos.executor.docker.parameter. is fined by me. On Fri, Sep 25, 2015 at 7:50 AM, Ondřej Smola (JIRA) > Expose all Mesos DockerInfo options to Spark > > > Key: SPARK-8734 > URL: https://issues.apache.org/jira/browse/SPARK-8734 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Chris Heller >Priority: Minor > Attachments: network.diff > > > SPARK-2691 only exposed a few options from the DockerInfo message. It would > be reasonable to expose them all, especially given one can now specify > arbitrary parameters to docker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark
[ https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907963#comment-14907963 ] Ondřej Smola commented on SPARK-8734: - +1 spark.mesos.executor.docker.parameter. > Expose all Mesos DockerInfo options to Spark > > > Key: SPARK-8734 > URL: https://issues.apache.org/jira/browse/SPARK-8734 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Chris Heller >Priority: Minor > Attachments: network.diff > > > SPARK-2691 only exposed a few options from the DockerInfo message. It would > be reasonable to expose them all, especially given one can now specify > arbitrary parameters to docker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark
[ https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907952#comment-14907952 ] Chris Heller commented on SPARK-8734: - Responsibilities have sort of pulled me away from focusing on this. I did managed to get the network code in my branch. I was thinking about parameters, and considered a scheme such as: spark.mesos.executor.docker.parameter. = This follows from how you set environment variables on the executor. Would this scheme be reasonable? > Expose all Mesos DockerInfo options to Spark > > > Key: SPARK-8734 > URL: https://issues.apache.org/jira/browse/SPARK-8734 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Chris Heller >Priority: Minor > Attachments: network.diff > > > SPARK-2691 only exposed a few options from the DockerInfo message. It would > be reasonable to expose them all, especially given one can now specify > arbitrary parameters to docker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark
[ https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907949#comment-14907949 ] Ondřej Smola commented on SPARK-8734: - I think parameters can be supported as comma separated key=value pairs under spark.mesos.executor.docker.parameters, from what i can see in mesos source code only long parameter names are supported. > Expose all Mesos DockerInfo options to Spark > > > Key: SPARK-8734 > URL: https://issues.apache.org/jira/browse/SPARK-8734 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Chris Heller >Priority: Minor > Attachments: network.diff > > > SPARK-2691 only exposed a few options from the DockerInfo message. It would > be reasonable to expose them all, especially given one can now specify > arbitrary parameters to docker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
[ https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907948#comment-14907948 ] Zoltán Zvara commented on SPARK-10390: -- What could be the problem that causes SBT to pack a wrong Guava version? Thanks! > Py4JJavaError java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > > > Key: SPARK-10390 > URL: https://issues.apache.org/jira/browse/SPARK-10390 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Zoltán Zvara > > While running PySpark through iPython. > {code} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > {{spark-env.sh}} > {code} > export IPYTHON=1 > export PYSPARK_PYTHON=/usr/bin/python3 > export PYSPARK_DRIVER_PYTHON=ipython3 > export PYSPARK_DRIVER_PYTHON_OPTS="notebook" > {code} > Spark built with: > {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} > Not a problem, when built against {{Hadoop 2.4}}! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
[ https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907935#comment-14907935 ] Sean Owen commented on SPARK-10390: --- Yes, but as I say, it appears to work with the build of reference. Use Maven. > Py4JJavaError java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > > > Key: SPARK-10390 > URL: https://issues.apache.org/jira/browse/SPARK-10390 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Zoltán Zvara > > While running PySpark through iPython. > {code} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > {{spark-env.sh}} > {code} > export IPYTHON=1 > export PYSPARK_PYTHON=/usr/bin/python3 > export PYSPARK_DRIVER_PYTHON=ipython3 > export PYSPARK_DRIVER_PYTHON_OPTS="notebook" > {code} > Spark built with: > {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} > Not a problem, when built against {{Hadoop 2.4}}! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
[ https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907913#comment-14907913 ] Zoltán Zvara edited comment on SPARK-10390 at 9/25/15 11:05 AM: Problem still exists. To reproduce, simply clone the latest snapshot from GitHub, build and setup as I've wrote in the description, open iPython and issue {{sc.textFile(...).collect()}}. (Start iPython with {{sudo bin/pyspark}}) was (Author: ehnalis): Problem still exists. To reproduce, simply clone the latest snapshot from GitHub, build and setup as I've wrote in the description, open iPython and issue {{sc.textFile(...).collect()}}. > Py4JJavaError java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > > > Key: SPARK-10390 > URL: https://issues.apache.org/jira/browse/SPARK-10390 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Zoltán Zvara > > While running PySpark through iPython. > {code} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > {{spark-env.sh}} > {code} > export IPYTHON=1 > export PYSPARK_PYTHON=/usr/bin/python3 > export PYSPARK_DRIVER_PYTHON=ipython3 > export PYSPARK_DRIVER_PYTHON_OPTS="notebook" > {code} > Spark built with: > {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} > Not a problem, when built against {{Hadoop 2.4}}! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
[ https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907913#comment-14907913 ] Zoltán Zvara edited comment on SPARK-10390 at 9/25/15 10:59 AM: Problem still exists. To reproduce, simply clone the latest snapshot from GitHub, build and setup as I've wrote in the description, open iPython and issue {{sc.textFile(...).collect()}}. was (Author: ehnalis): Problem still exists. To reproduce, simply clone the latest snapshot from GitHub, build and setup as I've wrote in the description, open iPython and issue {{sc.textFile("random.text.file").collect()}}. > Py4JJavaError java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > > > Key: SPARK-10390 > URL: https://issues.apache.org/jira/browse/SPARK-10390 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Zoltán Zvara > > While running PySpark through iPython. > {code} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > {{spark-env.sh}} > {code} > export IPYTHON=1 > export PYSPARK_PYTHON=/usr/bin/python3 > export PYSPARK_DRIVER_PYTHON=ipython3 > export PYSPARK_DRIVER_PYTHON_OPTS="notebook" > {code} > Spark built with: > {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} > Not a problem, when built against {{Hadoop 2.4}}! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10390) Py4JJavaError java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
[ https://issues.apache.org/jira/browse/SPARK-10390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907913#comment-14907913 ] Zoltán Zvara commented on SPARK-10390: -- Problem still exists. To reproduce, simply clone the latest snapshot from GitHub, build and setup as I've wrote in the description, open iPython and issue {{sc.textFile("random.text.file").collect()}}. > Py4JJavaError java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > > > Key: SPARK-10390 > URL: https://issues.apache.org/jira/browse/SPARK-10390 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Zoltán Zvara > > While running PySpark through iPython. > {code} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.elapsedMillis()J > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:245) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > {{spark-env.sh}} > {code} > export IPYTHON=1 > export PYSPARK_PYTHON=/usr/bin/python3 > export PYSPARK_DRIVER_PYTHON=ipython3 > export PYSPARK_DRIVER_PYTHON_OPTS="notebook" > {code} > Spark built with: > {{build/sbt -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 assembly --error}} > Not a problem, when built against {{Hadoop 2.4}}! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10755) Set the driver also update the token for long-running application
[ https://issues.apache.org/jira/browse/SPARK-10755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus closed SPARK-10755. Resolution: Not A Problem It's not a good idea to do so, since the hadoop rpc can automatically renew the token by the keytab but it can't renew sucessfully. This solution is not a root cause of this problem. > Set the driver also update the token for long-running application > - > > Key: SPARK-10755 > URL: https://issues.apache.org/jira/browse/SPARK-10755 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: SaintBacchus > > In the yarn-client mode, driver will write the event logs into hdfs and get > the partition information from hdfs, so it's nessary to update the token from > the *AMDelegationTokenRenewer*. > In the yarn-cluster mode, driver is company with AM and token will update by > AM. But it's still better to update the token for client process since the > client wants to delete the staging dir with a expired token. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10766) Add some configurations for the client process in yarn-cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-10766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10766: Assignee: (was: Apache Spark) > Add some configurations for the client process in yarn-cluster mode. > - > > Key: SPARK-10766 > URL: https://issues.apache.org/jira/browse/SPARK-10766 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.0 >Reporter: SaintBacchus > > In the yarn-cluster mode, it's hard to find the correct configuration for the > client process. > But this is necessary such as the client process's class path: if I want to > use hbase on spark, I have to include the hbase jars into client's classpath. > But *spark.driver.extraClassPath* can't take effect. The way I can do is set > the hbase jars into the Enviroment of SPARK_CLASSPATH. > It isn't a better way so I want to add some configuration for this client > process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10766) Add some configurations for the client process in yarn-cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-10766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10766: Assignee: Apache Spark > Add some configurations for the client process in yarn-cluster mode. > - > > Key: SPARK-10766 > URL: https://issues.apache.org/jira/browse/SPARK-10766 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.0 >Reporter: SaintBacchus >Assignee: Apache Spark > > In the yarn-cluster mode, it's hard to find the correct configuration for the > client process. > But this is necessary such as the client process's class path: if I want to > use hbase on spark, I have to include the hbase jars into client's classpath. > But *spark.driver.extraClassPath* can't take effect. The way I can do is set > the hbase jars into the Enviroment of SPARK_CLASSPATH. > It isn't a better way so I want to add some configuration for this client > process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10766) Add some configurations for the client process in yarn-cluster mode.
[ https://issues.apache.org/jira/browse/SPARK-10766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907857#comment-14907857 ] Apache Spark commented on SPARK-10766: -- User 'SaintBacchus' has created a pull request for this issue: https://github.com/apache/spark/pull/8918 > Add some configurations for the client process in yarn-cluster mode. > - > > Key: SPARK-10766 > URL: https://issues.apache.org/jira/browse/SPARK-10766 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.0 >Reporter: SaintBacchus > > In the yarn-cluster mode, it's hard to find the correct configuration for the > client process. > But this is necessary such as the client process's class path: if I want to > use hbase on spark, I have to include the hbase jars into client's classpath. > But *spark.driver.extraClassPath* can't take effect. The way I can do is set > the hbase jars into the Enviroment of SPARK_CLASSPATH. > It isn't a better way so I want to add some configuration for this client > process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8734) Expose all Mesos DockerInfo options to Spark
[ https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907843#comment-14907843 ] Alan Braithwaite edited comment on SPARK-8734 at 9/25/15 9:21 AM: -- Actually, I've attached a rudimentary diff for network support. As a user though, I'd personally like this done right and have support for arbitrary docker parameters to be passed in. It also looks like there's room to clean up this section of code a bit. Edit: to add it looks like they closed a different issue specific to networking, which is why I didn't bother to submit this back yet. was (Author: abraithwaite): Actually, I've attached a rudimentary diff for network support. As a user though, I'd personally like this done right and have support for arbitrary docker parameters to be passed in. It also looks like there's room to clean up this section of code a bit. > Expose all Mesos DockerInfo options to Spark > > > Key: SPARK-8734 > URL: https://issues.apache.org/jira/browse/SPARK-8734 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Chris Heller >Priority: Minor > Attachments: network.diff > > > SPARK-2691 only exposed a few options from the DockerInfo message. It would > be reasonable to expose them all, especially given one can now specify > arbitrary parameters to docker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8734) Expose all Mesos DockerInfo options to Spark
[ https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Braithwaite updated SPARK-8734: Attachment: network.diff Actually, I've attached a rudimentary diff for network support. As a user though, I'd personally like this done right and have support for arbitrary docker parameters to be passed in. It also looks like there's room to clean up this section of code a bit. > Expose all Mesos DockerInfo options to Spark > > > Key: SPARK-8734 > URL: https://issues.apache.org/jira/browse/SPARK-8734 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Chris Heller >Priority: Minor > Attachments: network.diff > > > SPARK-2691 only exposed a few options from the DockerInfo message. It would > be reasonable to expose them all, especially given one can now specify > arbitrary parameters to docker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6873) Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements
[ https://issues.apache.org/jira/browse/SPARK-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907841#comment-14907841 ] Pete Robbins commented on SPARK-6873: - I no longer see these errors in my 1.5 branch Java 8 build. Did someone fix them, remove the tests or is it just chance? > Some Hive-Catalyst comparison tests fail due to unimportant order of some > printed elements > -- > > Key: SPARK-6873 > URL: https://issues.apache.org/jira/browse/SPARK-6873 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 1.3.1 >Reporter: Sean Owen >Assignee: Cheng Lian >Priority: Minor > > As I mentioned, I've been seeing 4 test failures in Hive tests for a while, > and actually it still affects master. I think it's a superficial problem that > only turns up when running on Java 8, but still, would probably be an easy > fix and good to fix. > Specifically, here are four tests and the bit that fails the comparison, > below. I tried to diagnose this but had trouble even finding where some of > this occurs, like the list of synonyms? > {code} > - show_tblproperties *** FAILED *** > Results do not match for show_tblproperties: > ... > !== HIVE - 2 row(s) == == CATALYST - 2 row(s) == > !tmptruebar bar value > !barbar value tmp true (HiveComparisonTest.scala:391) > {code} > {code} > - show_create_table_serde *** FAILED *** > Results do not match for show_create_table_serde: > ... >WITH SERDEPROPERTIES ( WITH > SERDEPROPERTIES ( > ! 'serialization.format'='$', > 'field.delim'=',', > ! 'field.delim'=',') > 'serialization.format'='$') > {code} > {code} > - udf_std *** FAILED *** > Results do not match for udf_std: > ... > !== HIVE - 2 row(s) == == CATALYST > - 2 row(s) == >std(x) - Returns the standard deviation of a set of numbers std(x) - > Returns the standard deviation of a set of numbers > !Synonyms: stddev_pop, stddev Synonyms: > stddev, stddev_pop (HiveComparisonTest.scala:391) > {code} > {code} > - udf_stddev *** FAILED *** > Results do not match for udf_stddev: > ... > !== HIVE - 2 row(s) ==== > CATALYST - 2 row(s) == >stddev(x) - Returns the standard deviation of a set of numbers stddev(x) > - Returns the standard deviation of a set of numbers > !Synonyms: stddev_pop, stdSynonyms: > std, stddev_pop (HiveComparisonTest.scala:391) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10797) RDD's coalesce should not write out the temporary key
[ https://issues.apache.org/jira/browse/SPARK-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltán Zvara updated SPARK-10797: - Description: It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle files) temporary keys used on the shuffle code path. Consider the following code: {code:title=RDD.scala|borderStyle=solid} if (shuffle) { /** Distributes elements evenly across output partitions, starting from a random partition. */ val distributePartition = (index: Int, items: Iterator[T]) => { var position = (new Random(index)).nextInt(numPartitions) items.map { t => // Note that the hash code of the key will just be the key itself. The HashPartitioner // will mod it with the number of total partitions. position = position + 1 (position, t) } } : Iterator[(Int, T)] // include a shuffle step so that our upstream tasks are still distributed new CoalescedRDD( new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition), new HashPartitioner(numPartitions)), numPartitions).values } else { {code} {{ShuffledRDD}} will hash using {{position}} as keys as in the {{distributePartition}} function. After the bucket has been chosen by the sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the {{DiskBlockObjectWriter}} writes out both the (temporary) key and value to the specified partition. On the next stage, after reading we take only the values with {{PairRDDFunctions}}. This certainly has a performance impact, as we unnecessarily write/read {{Int}} and transform the data. was: It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle files) temporary keys used on the shuffle code path. Consider the following code: {code:title=RDD.scala|borderStyle=solid} if (shuffle) { /** Distributes elements evenly across output partitions, starting from a random partition. */ val distributePartition = (index: Int, items: Iterator[T]) => { var position = (new Random(index)).nextInt(numPartitions) items.map { t => // Note that the hash code of the key will just be the key itself. The HashPartitioner // will mod it with the number of total partitions. position = position + 1 (position, t) } } : Iterator[(Int, T)] // include a shuffle step so that our upstream tasks are still distributed new CoalescedRDD( new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition), new HashPartitioner(numPartitions)), numPartitions).values } else { {code} {{ShuffledRDD}} will hash using {{position}} as keys as in the {{distributePartition}} function. After the bucket has been chosen by the sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the {{DiskBlockObjectWriter}} writes out both the (temporary) key and value to the specified partition. On the next stage, after reading we take only the values with {{PairRDDFunctions}}. This certainly has a performance impact, as we unnecessarily write/read {{Int}}s and transform the data. > RDD's coalesce should not write out the temporary key > - > > Key: SPARK-10797 > URL: https://issues.apache.org/jira/browse/SPARK-10797 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Zoltán Zvara > > It seems that {{RDD.coalesce}} will unnecessarily write out (to shuffle > files) temporary keys used on the shuffle code path. Consider the following > code: > {code:title=RDD.scala|borderStyle=solid} > if (shuffle) { > /** Distributes elements evenly across output partitions, starting from > a random partition. */ > val distributePartition = (index: Int, items: Iterator[T]) => { > var position = (new Random(index)).nextInt(numPartitions) > items.map { t => > // Note that the hash code of the key will just be the key itself. > The HashPartitioner > // will mod it with the number of total partitions. > position = position + 1 > (position, t) > } > } : Iterator[(Int, T)] > // include a shuffle step so that our upstream tasks are still > distributed > new CoalescedRDD( > new ShuffledRDD[Int, T, > T](mapPartitionsWithIndex(distributePartition), > new HashPartitioner(numPartitions)), > numPartitions).values > } else { > {code} > {{ShuffledRDD}} will hash using {{position}} as keys as in the > {{distributePartition}} function. After the bucket has been chosen by the > sorter {{ExternalSorter}} or {{BypassMergeSortShuffleWriter}}, the > {{DiskBlockObjectWriter}} writes out both the (temporary) key and value to > the specified partition. On th
[jira] [Comment Edited] (SPARK-8734) Expose all Mesos DockerInfo options to Spark
[ https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907817#comment-14907817 ] Ondřej Smola edited comment on SPARK-8734 at 9/25/15 8:53 AM: -- I am going to work on this - i need at least ability to set docker network type. Any suggestions on what else can be useful? was (Author: ondrej.smola): I am going to work on this - i need at least ability to set docker network type. Any suggestions on what else will be useful? > Expose all Mesos DockerInfo options to Spark > > > Key: SPARK-8734 > URL: https://issues.apache.org/jira/browse/SPARK-8734 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Chris Heller >Priority: Minor > > SPARK-2691 only exposed a few options from the DockerInfo message. It would > be reasonable to expose them all, especially given one can now specify > arbitrary parameters to docker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark
[ https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907817#comment-14907817 ] Ondřej Smola commented on SPARK-8734: - I am going to work on this - i need at least ability to set docker network type. Any suggestions on what else will be useful? > Expose all Mesos DockerInfo options to Spark > > > Key: SPARK-8734 > URL: https://issues.apache.org/jira/browse/SPARK-8734 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Chris Heller >Priority: Minor > > SPARK-2691 only exposed a few options from the DockerInfo message. It would > be reasonable to expose them all, especially given one can now specify > arbitrary parameters to docker. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10831) Spark SQL Configuration missing in the doc
[ https://issues.apache.org/jira/browse/SPARK-10831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10831: Assignee: (was: Apache Spark) > Spark SQL Configuration missing in the doc > -- > > Key: SPARK-10831 > URL: https://issues.apache.org/jira/browse/SPARK-10831 > Project: Spark > Issue Type: Documentation > Components: SQL >Reporter: Cheng Hao > > E.g. > spark.sql.codegen > spark.sql.planner.sortMergeJoin > spark.sql.dialect > spark.sql.caseSensitive -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10831) Spark SQL Configuration missing in the doc
[ https://issues.apache.org/jira/browse/SPARK-10831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10831: Assignee: Apache Spark > Spark SQL Configuration missing in the doc > -- > > Key: SPARK-10831 > URL: https://issues.apache.org/jira/browse/SPARK-10831 > Project: Spark > Issue Type: Documentation > Components: SQL >Reporter: Cheng Hao >Assignee: Apache Spark > > E.g. > spark.sql.codegen > spark.sql.planner.sortMergeJoin > spark.sql.dialect > spark.sql.caseSensitive -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10831) Spark SQL Configuration missing in the doc
[ https://issues.apache.org/jira/browse/SPARK-10831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907796#comment-14907796 ] Apache Spark commented on SPARK-10831: -- User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/8917 > Spark SQL Configuration missing in the doc > -- > > Key: SPARK-10831 > URL: https://issues.apache.org/jira/browse/SPARK-10831 > Project: Spark > Issue Type: Documentation > Components: SQL >Reporter: Cheng Hao > > E.g. > spark.sql.codegen > spark.sql.planner.sortMergeJoin > spark.sql.dialect > spark.sql.caseSensitive -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10831) Spark SQL Configuration missing in the doc
Cheng Hao created SPARK-10831: - Summary: Spark SQL Configuration missing in the doc Key: SPARK-10831 URL: https://issues.apache.org/jira/browse/SPARK-10831 Project: Spark Issue Type: Documentation Components: SQL Reporter: Cheng Hao E.g. spark.sql.codegen spark.sql.planner.sortMergeJoin spark.sql.dialect spark.sql.caseSensitive -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org