[jira] [Comment Edited] (SPARK-1859) Linear, Ridge and Lasso Regressions with SGD yield unexpected results
[ https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001136#comment-14001136 ] Xiangrui Meng edited comment on SPARK-1859 at 5/18/14 5:27 PM: --- The step size should be smaller than 1 over the Lipschitz constant L. Your example contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 1500. To make it converge, you need to set step size smaller than (1.0/1500/1500). Yes, it looks like a simple problem, but it is actually ill-conditioned. scikit-learn may use line search or directly solve the least square problem, while we didn't implement line search in LinearRegressionWithSGD. You can try LBFGS in the current master, which should work for your example. was (Author: mengxr): The step size should be smaller than the Lipschitz constant L. Your example contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 1500. To make it converge, you need to set step size smaller than (1.0/1500/1500). Yes, it looks like a simple problem, but it is actually ill-conditioned. scikit-learn may use line search or directly solve the least square problem, while we didn't implement line search in LinearRegressionWithSGD. You can try LBFGS in the current master, which should work for your example. Linear, Ridge and Lasso Regressions with SGD yield unexpected results - Key: SPARK-1859 URL: https://issues.apache.org/jira/browse/SPARK-1859 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 0.9.1 Environment: OS: Ubuntu Server 12.04 x64 PySpark Reporter: Vlad Frolov Labels: algorithm, machine_learning, regression Issue: Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one). Ridge Regression with SGD *sometimes* works ok. Lasso Regression with SGD *sometimes* works ok. Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 : {code:title=regression_example.py} parsedData = sc.parallelize([ array([2400., 1500.]), array([240., 150.]), array([24., 15.]), array([2.4, 1.5]), array([0.24, 0.15]) ]) # Build the model model = LinearRegressionWithSGD.train(parsedData) print model._coeffs {code} So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! :) The resulting model has nan coeffs: {{array([ nan])}}. Furthermore, if you comment records line by line you will get: * [-1.55897475e+296] coeff (the first record is commented), * [-8.62115396e+104] coeff (the first two records are commented), * etc It looks like the implemented regression algorithms diverges somehow. I get almost the same results on Ridge and Lasso. I've also tested these inputs in scikit-learn and it works as expected there. However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1859) Linear, Ridge and Lasso Regressions with SGD yield unexpected results
[ https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001136#comment-14001136 ] Xiangrui Meng commented on SPARK-1859: -- The step size should be smaller than the Lipschitz constant L. Your example contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 1500. To make it converge, you need to set step size smaller than (1.0/1500/1500). Yes, it looks like a simple problem, but it is actually ill-conditioned. scikit-learn may use line search or directly solve the least square problem, while we didn't implement line search in LinearRegressionWithSGD. You can try LBFGS in the current master, which should work for your example. Linear, Ridge and Lasso Regressions with SGD yield unexpected results - Key: SPARK-1859 URL: https://issues.apache.org/jira/browse/SPARK-1859 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 0.9.1 Environment: OS: Ubuntu Server 12.04 x64 PySpark Reporter: Vlad Frolov Labels: algorithm, machine_learning, regression Issue: Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one). Ridge Regression with SGD *sometimes* works ok. Lasso Regression with SGD *sometimes* works ok. Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 : {code:title=regression_example.py} parsedData = sc.parallelize([ array([2400., 1500.]), array([240., 150.]), array([24., 15.]), array([2.4, 1.5]), array([0.24, 0.15]) ]) # Build the model model = LinearRegressionWithSGD.train(parsedData) print model._coeffs {code} So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! :) The resulting model has nan coeffs: {{array([ nan])}}. Furthermore, if you comment records line by line you will get: * [-1.55897475e+296] coeff (the first record is commented), * [-8.62115396e+104] coeff (the first two records are commented), * etc It looks like the implemented regression algorithms diverges somehow. I get almost the same results on Ridge and Lasso. I've also tested these inputs in scikit-learn and it works as expected there. However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001151#comment-14001151 ] Andrew Ash commented on SPARK-1860: --- [~mkim] is going to take a look at this after discussion at https://issues.apache.org/jira/browse/SPARK-1154 I think the correct fix as Patrick outlines would be: {code} // pseudocode for folder in onDiskFolders: if folder is owned by a running application: continue if folder contains any folder/file (recursively) that is more recently touched (mtime) than the TTS: continue cleanUp(folder) {code} Schedule that to run periodically (interval configured by setting) and this should be all fixed up. Is that right? An alternative approach could be to have executor clean up the application's work directory when the application terminates, but un-clean executor shutdown could still leave work directories around so a TTL approach still needs to be included as well. Standalone Worker cleanup should not clean up running applications -- Key: SPARK-1860 URL: https://issues.apache.org/jira/browse/SPARK-1860 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Aaron Davidson Priority: Critical Fix For: 1.1.0 The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any applications that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Applications should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN
[ https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001161#comment-14001161 ] Patrick Wendell commented on SPARK-1870: The jars may not be present on the classpath because we add them through dynamic classloading and not by modifying the system classpath. What happens if you also call sc.addJar(X) with the filename of the jar inside your application? In the future it might be nice to automatically call this for you, but I think for now you need to do it yourself in YARN mode. Here are the relevant docs: http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/running-on-yarn.html#adding-additional-jars These were written by [~sandyr]. Jars specified via --jars in spark-submit are not added to executor classpath for YARN -- Key: SPARK-1870 URL: https://issues.apache.org/jira/browse/SPARK-1870 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Xiangrui Meng Priority: Critical With `spark-submit`, jars specified via `--jars` are added to distributed cache in `yarn-cluster` mode. The executor should add cached jars to classpath. However, {code} sc.parallelize(0 to 10, 10).map { i = System.getProperty(java.class.path) }.collect().foreach(println) {code} shows only system jars, `app.jar`, and `spark.jar` but not other jars in the distributed cache. The workaround is using assembly jar. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1783) Title contains html code in MLlib guide
[ https://issues.apache.org/jira/browse/SPARK-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001183#comment-14001183 ] Xiangrui Meng commented on SPARK-1783: -- Added `displayTitle` variable to the global layout. If this is defined, use it instead of `title` for page title in `h1`. Title contains html code in MLlib guide --- Key: SPARK-1783 URL: https://issues.apache.org/jira/browse/SPARK-1783 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor We use --- layout: global title: a href=mllib-guide.htmlMLlib/a - Clustering --- to create a link in the title to the main page of MLlib's guide. However, the generated title contains raw html code, which shows up in the tab or title bar of the browser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1871) Improve MLlib guide for v1.0
[ https://issues.apache.org/jira/browse/SPARK-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1871: - Summary: Improve MLlib guide for v1.0 (was: Improve MLlib guide) Improve MLlib guide for v1.0 Key: SPARK-1871 URL: https://issues.apache.org/jira/browse/SPARK-1871 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1872) Update api links for unidoc
Xiangrui Meng created SPARK-1872: Summary: Update api links for unidoc Key: SPARK-1872 URL: https://issues.apache.org/jira/browse/SPARK-1872 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Should use unidoc for API links. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1783) Title contains html code in MLlib guide
[ https://issues.apache.org/jira/browse/SPARK-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1783: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-1871 Title contains html code in MLlib guide --- Key: SPARK-1783 URL: https://issues.apache.org/jira/browse/SPARK-1783 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor We use --- layout: global title: a href=mllib-guide.htmlMLlib/a - Clustering --- to create a link in the title to the main page of MLlib's guide. However, the generated title contains raw html code, which shows up in the tab or title bar of the browser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1871) Improve MLlib guide for v1.0
[ https://issues.apache.org/jira/browse/SPARK-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1871: - Description: More improvements to MLlib guide. Improve MLlib guide for v1.0 Key: SPARK-1871 URL: https://issues.apache.org/jira/browse/SPARK-1871 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Reporter: Xiangrui Meng More improvements to MLlib guide. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1869) `spark-shell --help` fails if called from outside spark home
[ https://issues.apache.org/jira/browse/SPARK-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1869. Resolution: Fixed Fixed by: https://github.com/apache/spark/pull/812 `spark-shell --help` fails if called from outside spark home Key: SPARK-1869 URL: https://issues.apache.org/jira/browse/SPARK-1869 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Patrick Wendell Priority: Critical Fix For: 1.1.0, 1.0.1 When a user runs the shell with `--help` from outside of the Spark directory, it doesn't call spark-submit in the direct location: {code} $ /home/patrick/Documents/spark/bin/spark-shell --help Usage: ./bin/spark-shell [options] /home/patrick/Documents/spark/bin/spark-shell: line 33: ./bin/spark-submit: No such file or directory {code} The fix is simple, we should just use the full path as in other places where we invoke the shell. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN
[ https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001195#comment-14001195 ] Xiangrui Meng commented on SPARK-1870: -- I specified the jar via `--jars` and add it with `sc.addJar` explicitly. In the Web UI, I see: {code} /mnt/yarn/nm/usercache/ubuntu/appcache/application_1398708946838_0152/container_1398708946838_0152_01_01/hello_2.10.jar System Classpath http://10.45.133.8:43576/jars/hello_2.10.jarAdded By User {code} So it is in distributed cache as well as served by master via http. However, I still got ClassNotFoundException. Jars specified via --jars in spark-submit are not added to executor classpath for YARN -- Key: SPARK-1870 URL: https://issues.apache.org/jira/browse/SPARK-1870 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Xiangrui Meng Priority: Critical With `spark-submit`, jars specified via `--jars` are added to distributed cache in `yarn-cluster` mode. The executor should add cached jars to classpath. However, {code} sc.parallelize(0 to 10, 10).map { i = System.getProperty(java.class.path) }.collect().foreach(println) {code} shows only system jars, `app.jar`, and `spark.jar` but not other jars in the distributed cache. The workaround is using assembly jar. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN
[ https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001195#comment-14001195 ] Xiangrui Meng edited comment on SPARK-1870 at 5/18/14 8:36 PM: --- I specified the jar via `--jars` and added it with `sc.addJar` explicitly. In the Web UI, I see: {code} /mnt/yarn/nm/usercache/ubuntu/appcache/application_1398708946838_0152/ container_1398708946838_0152_01_01/hello_2.10.jar System Classpath http://10.45.133.8:43576/jars/hello_2.10.jarAdded By User {code} So it is in distributed cache as well as served by master via http. However, I still got ClassNotFoundException. was (Author: mengxr): I specified the jar via `--jars` and added it with `sc.addJar` explicitly. In the Web UI, I see: {code} /mnt/yarn/nm/usercache/ubuntu/appcache/application_1398708946838_0152/container_1398708946838_0152_01_01/hello_2.10.jar System Classpath http://10.45.133.8:43576/jars/hello_2.10.jarAdded By User {code} So it is in distributed cache as well as served by master via http. However, I still got ClassNotFoundException. Jars specified via --jars in spark-submit are not added to executor classpath for YARN -- Key: SPARK-1870 URL: https://issues.apache.org/jira/browse/SPARK-1870 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Xiangrui Meng Priority: Critical With `spark-submit`, jars specified via `--jars` are added to distributed cache in `yarn-cluster` mode. The executor should add cached jars to classpath. However, {code} sc.parallelize(0 to 10, 10).map { i = System.getProperty(java.class.path) }.collect().foreach(println) {code} shows only system jars, `app.jar`, and `spark.jar` but not other jars in the distributed cache. The workaround is using assembly jar. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1871) Improve MLlib guide
[ https://issues.apache.org/jira/browse/SPARK-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1871: - Component/s: Documentation Improve MLlib guide --- Key: SPARK-1871 URL: https://issues.apache.org/jira/browse/SPARK-1871 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1873) Add README.md file when making distributions
Patrick Wendell created SPARK-1873: -- Summary: Add README.md file when making distributions Key: SPARK-1873 URL: https://issues.apache.org/jira/browse/SPARK-1873 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1874) Clean up MLlib sample data
Matei Zaharia created SPARK-1874: Summary: Clean up MLlib sample data Key: SPARK-1874 URL: https://issues.apache.org/jira/browse/SPARK-1874 Project: Spark Issue Type: Bug Components: MLlib Reporter: Matei Zaharia Fix For: 1.0.0 - Replace logistic regression example data with linear to make mllib.LinearRegression example easier to run - Move files from mllib/data into data/mllib to make them easier to find - Add a simple MovieLens data file -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1871) Improve MLlib guide
Xiangrui Meng created SPARK-1871: Summary: Improve MLlib guide Key: SPARK-1871 URL: https://issues.apache.org/jira/browse/SPARK-1871 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1874) Clean up MLlib sample data
[ https://issues.apache.org/jira/browse/SPARK-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001254#comment-14001254 ] Xiangrui Meng commented on SPARK-1874: -- Is `data/mllib` a better place than `mllib/data`? Clean up MLlib sample data -- Key: SPARK-1874 URL: https://issues.apache.org/jira/browse/SPARK-1874 Project: Spark Issue Type: Bug Components: MLlib Reporter: Matei Zaharia Fix For: 1.0.0 - Replace logistic regression example data with linear to make mllib.LinearRegression example easier to run - Move files from mllib/data into data/mllib to make them easier to find - Add a simple MovieLens data file -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1873) Add README.md file when making distributions
Patrick Wendell created SPARK-1873: -- Summary: Add README.md file when making distributions Key: SPARK-1873 URL: https://issues.apache.org/jira/browse/SPARK-1873 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1875: - Fix Version/s: 1.0.0 NoClassDefFoundError: StringUtils when building against Hadoop 1 Key: SPARK-1875 URL: https://issues.apache.org/jira/browse/SPARK-1875 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Priority: Critical Fix For: 1.0.0 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 and Hive enabled, if I go into it and run spark-shell, I get this: {code} java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34) at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236) at org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1875: - Priority: Blocker (was: Critical) NoClassDefFoundError: StringUtils when building against Hadoop 1 Key: SPARK-1875 URL: https://issues.apache.org/jira/browse/SPARK-1875 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Priority: Blocker Fix For: 1.0.0 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 and Hive enabled, if I go into it and run spark-shell, I get this: {code} java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34) at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236) at org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001297#comment-14001297 ] Matei Zaharia commented on SPARK-1875: -- This may have been broken by https://issues.apache.org/jira/browse/SPARK-1629 / https://github.com/apache/spark/pull/569, which added an explicit dependency on commons-lang, though it's not clear. NoClassDefFoundError: StringUtils when building against Hadoop 1 Key: SPARK-1875 URL: https://issues.apache.org/jira/browse/SPARK-1875 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Priority: Blocker Fix For: 1.0.0 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 and Hive enabled, if I go into it and run spark-shell, I get this: {code} java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34) at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236) at org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001336#comment-14001336 ] Patrick Wendell commented on SPARK-1875: The issue was caused by this patch. I need to look further to figure out what was going on. https://github.com/apache/spark/pull/754 NoClassDefFoundError: StringUtils when building against Hadoop 1 Key: SPARK-1875 URL: https://issues.apache.org/jira/browse/SPARK-1875 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Priority: Blocker Fix For: 1.0.0 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 and Hive enabled, if I go into it and run spark-shell, I get this: {code} java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34) at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236) at org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001358#comment-14001358 ] Patrick Wendell commented on SPARK-1875: [~witgo]. Here is how I reproduced it: {code} ./make-distribution.sh --with-hive --tgz {code} Then run spark-shell from the distribution. This is mostly equivalent to running {code} mvn package -Phive {code} NoClassDefFoundError: StringUtils when building against Hadoop 1 Key: SPARK-1875 URL: https://issues.apache.org/jira/browse/SPARK-1875 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Assignee: Guoqiang Li Priority: Blocker Fix For: 1.0.0 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 and Hive enabled, if I go into it and run spark-shell, I get this: {code} java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34) at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236) at org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001358#comment-14001358 ] Patrick Wendell edited comment on SPARK-1875 at 5/19/14 3:36 AM: - [~witgo]. Here is how I reproduced it: {code} ./make-distribution.sh --with-hive --tgz {code} Then run spark-shell from the distribution. This is mostly equivalent to running {code} mvn package -Phive -DskipTests {code} was (Author: pwendell): [~witgo]. Here is how I reproduced it: {code} ./make-distribution.sh --with-hive --tgz {code} Then run spark-shell from the distribution. This is mostly equivalent to running {code} mvn package -Phive {code} NoClassDefFoundError: StringUtils when building against Hadoop 1 Key: SPARK-1875 URL: https://issues.apache.org/jira/browse/SPARK-1875 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Assignee: Guoqiang Li Priority: Blocker Fix For: 1.0.0 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 and Hive enabled, if I go into it and run spark-shell, I get this: {code} java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34) at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236) at org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001366#comment-14001366 ] Patrick Wendell commented on SPARK-1875: The issue here is that somehow the commons-lang exclusion from the hive project is being respected when building an assembly for Hadoop 1. So it's excluded from hadoop-client even though hadoop-client 1.0.4 depends on it. {code} mvn -Phive install mvn -pl assembly -Phive dependency:tree [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Spark Project Assembly 1.0.1-SNAPSHOT [INFO] [INFO] [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ spark-assembly_2.10 --- [INFO] org.apache.spark:spark-assembly_2.10:pom:1.0.1-SNAPSHOT [INFO] +- org.apache.spark:spark-core_2.10:jar:1.0.1-SNAPSHOT:compile [INFO] | +- org.apache.hadoop:hadoop-client:jar:1.0.4:compile [INFO] | | \- org.apache.hadoop:hadoop-core:jar:1.0.4:compile [INFO] | | +- xmlenc:xmlenc:jar:0.52:compile [INFO] | | +- org.apache.commons:commons-math:jar:2.1:compile [INFO] | | +- commons-el:commons-el:jar:1.0:compile [INFO] | | +- hsqldb:hsqldb:jar:1.8.0.10:compile [INFO] | | \- oro:oro:jar:2.0.8:compile [INFO] | +- net.java.dev.jets3t:jets3t:jar:0.7.1:runtime [INFO] | | +- commons-codec:commons-codec:jar:1.3:compile [INFO] | | \- commons-httpclient:commons-httpclient:jar:3.1:runtime [INFO] | +- org.apache.curator:curator-recipes:jar:2.4.0:compile [INFO] | | +- org.apache.curator:curator-framework:jar:2.4.0:compile [INFO] | | | \- org.apache.curator:curator-client:jar:2.4.0:compile [INFO] | | \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile [INFO] | +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile [INFO] | | +- org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile [INFO] | | +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile [INFO] | | | +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile [INFO] | | | \- org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile [INFO] | |\- org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile [INFO] | +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile [INFO] | +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile [INFO] | +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile [INFO] | | +- org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile [INFO] | | +- org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile {code} If you run {code} mvn -pl assembly dependency:tree {code} it includes commons-lang correctly. NoClassDefFoundError: StringUtils when building against Hadoop 1 Key: SPARK-1875 URL: https://issues.apache.org/jira/browse/SPARK-1875 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Assignee: Guoqiang Li Priority: Blocker Fix For: 1.0.0 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 and Hive enabled, if I go into it and run spark-shell, I get this: {code} java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34) at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236) at org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209) at
[jira] [Commented] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing
[ https://issues.apache.org/jira/browse/SPARK-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001377#comment-14001377 ] Mridul Muralidharan commented on SPARK-1855: Did not realize that mail replies to JIRA mails did not get mirrored to JIRA ! Replicating my mail here : – cut and paste – We don't have 3x replication in spark :-) And if we use replicated storagelevel, while decreasing odds of failure, it does not eliminate it (since we are not doing a great job with replication anyway from fault tolerance point of view). Also it does take a nontrivial performance hit with replicated levels. Regards, Mridul Provide memory-and-local-disk RDD checkpointing --- Key: SPARK-1855 URL: https://issues.apache.org/jira/browse/SPARK-1855 Project: Spark Issue Type: New Feature Components: MLlib, Spark Core Affects Versions: 1.0.0 Reporter: Xiangrui Meng Checkpointing is used to cut long lineage while maintaining fault tolerance. The current implementation is HDFS-based. Using the BlockRDD we can create in-memory-and-local-disk (with replication) checkpoints that are not as reliable as HDFS-based solution but faster. It can help applications that require many iterations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001379#comment-14001379 ] Guoqiang Li edited comment on SPARK-1875 at 5/19/14 4:23 AM: - [~pwendell], [~matei] Do you have time to review the code? https://github.com/apache/spark/pull/820 was (Author: gq): [~ pwendell], [~ matei] Do you have time to review the code? https://github.com/apache/spark/pull/820 NoClassDefFoundError: StringUtils when building against Hadoop 1 Key: SPARK-1875 URL: https://issues.apache.org/jira/browse/SPARK-1875 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Assignee: Guoqiang Li Priority: Blocker Fix For: 1.0.0 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 and Hive enabled, if I go into it and run spark-shell, I get this: {code} java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34) at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236) at org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1
[ https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001379#comment-14001379 ] Guoqiang Li commented on SPARK-1875: [~ pwendell], [~ matei] Do you have time to review the code? https://github.com/apache/spark/pull/820 NoClassDefFoundError: StringUtils when building against Hadoop 1 Key: SPARK-1875 URL: https://issues.apache.org/jira/browse/SPARK-1875 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Assignee: Guoqiang Li Priority: Blocker Fix For: 1.0.0 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 and Hive enabled, if I go into it and run spark-shell, I get this: {code} java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34) at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236) at org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)