[jira] [Comment Edited] (SPARK-1859) Linear, Ridge and Lasso Regressions with SGD yield unexpected results

2014-05-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001136#comment-14001136
 ] 

Xiangrui Meng edited comment on SPARK-1859 at 5/18/14 5:27 PM:
---

The step size should be smaller than 1 over the Lipschitz constant L. Your 
example contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 
1500. To make it converge, you need to set step size smaller than 
(1.0/1500/1500). Yes, it looks like a simple problem, but it is actually 
ill-conditioned.

scikit-learn may use line search or directly solve the least square problem, 
while we didn't implement line search in LinearRegressionWithSGD. You can try 
LBFGS in the current master, which should work for your example.


was (Author: mengxr):
The step size should be smaller than the Lipschitz constant L. Your example 
contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 1500. To 
make it converge, you need to set step size smaller than (1.0/1500/1500). Yes, 
it looks like a simple problem, but it is actually ill-conditioned.

scikit-learn may use line search or directly solve the least square problem, 
while we didn't implement line search in LinearRegressionWithSGD. You can try 
LBFGS in the current master, which should work for your example.

 Linear, Ridge and Lasso Regressions with SGD yield unexpected results
 -

 Key: SPARK-1859
 URL: https://issues.apache.org/jira/browse/SPARK-1859
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 0.9.1
 Environment: OS: Ubuntu Server 12.04 x64
 PySpark
Reporter: Vlad Frolov
  Labels: algorithm, machine_learning, regression

 Issue:
 Linear Regression with SGD don't work as expected on any data, but lpsa.dat 
 (example one).
 Ridge Regression with SGD *sometimes* works ok.
 Lasso Regression with SGD *sometimes* works ok.
 Code example (PySpark) based on 
 http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :
 {code:title=regression_example.py}
 parsedData = sc.parallelize([
 array([2400., 1500.]),
 array([240., 150.]),
 array([24., 15.]),
 array([2.4, 1.5]),
 array([0.24, 0.15])
 ])
 # Build the model
 model = LinearRegressionWithSGD.train(parsedData)
 print model._coeffs
 {code}
 So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! 
 :)
 The resulting model has nan coeffs: {{array([ nan])}}.
 Furthermore, if you comment records line by line you will get:
 * [-1.55897475e+296] coeff (the first record is commented), 
 * [-8.62115396e+104] coeff (the first two records are commented),
 * etc
 It looks like the implemented regression algorithms diverges somehow.
 I get almost the same results on Ridge and Lasso.
 I've also tested these inputs in scikit-learn and it works as expected there.
 However, I'm still not sure whether it's a bug or SGD 'feature'. Should I 
 preprocess my datasets somehow?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1859) Linear, Ridge and Lasso Regressions with SGD yield unexpected results

2014-05-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001136#comment-14001136
 ] 

Xiangrui Meng commented on SPARK-1859:
--

The step size should be smaller than the Lipschitz constant L. Your example 
contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 1500. To 
make it converge, you need to set step size smaller than (1.0/1500/1500). Yes, 
it looks like a simple problem, but it is actually ill-conditioned.

scikit-learn may use line search or directly solve the least square problem, 
while we didn't implement line search in LinearRegressionWithSGD. You can try 
LBFGS in the current master, which should work for your example.

 Linear, Ridge and Lasso Regressions with SGD yield unexpected results
 -

 Key: SPARK-1859
 URL: https://issues.apache.org/jira/browse/SPARK-1859
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 0.9.1
 Environment: OS: Ubuntu Server 12.04 x64
 PySpark
Reporter: Vlad Frolov
  Labels: algorithm, machine_learning, regression

 Issue:
 Linear Regression with SGD don't work as expected on any data, but lpsa.dat 
 (example one).
 Ridge Regression with SGD *sometimes* works ok.
 Lasso Regression with SGD *sometimes* works ok.
 Code example (PySpark) based on 
 http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :
 {code:title=regression_example.py}
 parsedData = sc.parallelize([
 array([2400., 1500.]),
 array([240., 150.]),
 array([24., 15.]),
 array([2.4, 1.5]),
 array([0.24, 0.15])
 ])
 # Build the model
 model = LinearRegressionWithSGD.train(parsedData)
 print model._coeffs
 {code}
 So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! 
 :)
 The resulting model has nan coeffs: {{array([ nan])}}.
 Furthermore, if you comment records line by line you will get:
 * [-1.55897475e+296] coeff (the first record is commented), 
 * [-8.62115396e+104] coeff (the first two records are commented),
 * etc
 It looks like the implemented regression algorithms diverges somehow.
 I get almost the same results on Ridge and Lasso.
 I've also tested these inputs in scikit-learn and it works as expected there.
 However, I'm still not sure whether it's a bug or SGD 'feature'. Should I 
 preprocess my datasets somehow?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications

2014-05-18 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001151#comment-14001151
 ] 

Andrew Ash commented on SPARK-1860:
---

[~mkim] is going to take a look at this after discussion at 
https://issues.apache.org/jira/browse/SPARK-1154

I think the correct fix as Patrick outlines would be:

{code}
// pseudocode
for folder in onDiskFolders:
if folder is owned by a running application:
continue
if folder contains any folder/file (recursively) that is more recently 
touched (mtime) than the TTS:
continue
cleanUp(folder)
{code}

Schedule that to run periodically (interval configured by setting) and this 
should be all fixed up.

Is that right?

An alternative approach could be to have executor clean up the application's 
work directory when the application terminates, but un-clean executor shutdown 
could still leave work directories around so a TTL approach still needs to be 
included as well.

 Standalone Worker cleanup should not clean up running applications
 --

 Key: SPARK-1860
 URL: https://issues.apache.org/jira/browse/SPARK-1860
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Priority: Critical
 Fix For: 1.1.0


 The default values of the standalone worker cleanup code cleanup all 
 application data every 7 days. This includes jars that were added to any 
 applications that happen to be running for longer than 7 days, hitting 
 streaming jobs especially hard.
 Applications should not be cleaned up if they're still running. Until then, 
 this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN

2014-05-18 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001161#comment-14001161
 ] 

Patrick Wendell commented on SPARK-1870:


The jars may not be present on the classpath because we add them through 
dynamic classloading and not by modifying the system classpath.

What happens if you also call sc.addJar(X) with the filename of the jar inside 
your application? In the future it might be nice to automatically call this for 
you, but I think for now you need to do it yourself in YARN mode. Here are the 
relevant docs:

http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/running-on-yarn.html#adding-additional-jars

These were written by [~sandyr].

 Jars specified via --jars in spark-submit are not added to executor classpath 
 for YARN
 --

 Key: SPARK-1870
 URL: https://issues.apache.org/jira/browse/SPARK-1870
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Critical

 With `spark-submit`, jars specified via `--jars` are added to distributed 
 cache in `yarn-cluster` mode. The executor should add cached jars to 
 classpath. However, 
 {code}
 sc.parallelize(0 to 10, 10).map { i =
   System.getProperty(java.class.path)
 }.collect().foreach(println)
 {code}
 shows only system jars, `app.jar`, and `spark.jar` but not other jars in the 
 distributed cache.
 The workaround is using assembly jar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1783) Title contains html code in MLlib guide

2014-05-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001183#comment-14001183
 ] 

Xiangrui Meng commented on SPARK-1783:
--

Added `displayTitle` variable to the global layout. If this is defined, use it 
instead of `title` for page title in `h1`.

 Title contains html code in MLlib guide
 ---

 Key: SPARK-1783
 URL: https://issues.apache.org/jira/browse/SPARK-1783
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor

 We use 
 ---
 layout: global
 title: a href=mllib-guide.htmlMLlib/a - Clustering
 ---
 to create a link in the title to the main page of MLlib's guide. However, the 
 generated title contains raw html code, which shows up in the tab or title 
 bar of the browser.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1871) Improve MLlib guide for v1.0

2014-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1871:
-

Summary: Improve MLlib guide for v1.0  (was: Improve MLlib guide)

 Improve MLlib guide for v1.0
 

 Key: SPARK-1871
 URL: https://issues.apache.org/jira/browse/SPARK-1871
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1872) Update api links for unidoc

2014-05-18 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1872:


 Summary: Update api links for unidoc
 Key: SPARK-1872
 URL: https://issues.apache.org/jira/browse/SPARK-1872
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng


Should use unidoc for API links.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1783) Title contains html code in MLlib guide

2014-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1783:
-

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-1871

 Title contains html code in MLlib guide
 ---

 Key: SPARK-1783
 URL: https://issues.apache.org/jira/browse/SPARK-1783
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor

 We use 
 ---
 layout: global
 title: a href=mllib-guide.htmlMLlib/a - Clustering
 ---
 to create a link in the title to the main page of MLlib's guide. However, the 
 generated title contains raw html code, which shows up in the tab or title 
 bar of the browser.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1871) Improve MLlib guide for v1.0

2014-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1871:
-

Description: More improvements to MLlib guide.

 Improve MLlib guide for v1.0
 

 Key: SPARK-1871
 URL: https://issues.apache.org/jira/browse/SPARK-1871
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Reporter: Xiangrui Meng

 More improvements to MLlib guide.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1869) `spark-shell --help` fails if called from outside spark home

2014-05-18 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1869.


Resolution: Fixed

Fixed by: https://github.com/apache/spark/pull/812

 `spark-shell --help` fails if called from outside spark home
 

 Key: SPARK-1869
 URL: https://issues.apache.org/jira/browse/SPARK-1869
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Priority: Critical
 Fix For: 1.1.0, 1.0.1


 When a user runs the shell with `--help` from outside of the Spark directory, 
 it doesn't call spark-submit in the direct location:
 {code}
 $ /home/patrick/Documents/spark/bin/spark-shell --help
 Usage: ./bin/spark-shell [options]
 /home/patrick/Documents/spark/bin/spark-shell: line 33: ./bin/spark-submit: 
 No such file or directory
 {code}
 The fix is simple, we should just use the full path as in other places where 
 we invoke the shell.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN

2014-05-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001195#comment-14001195
 ] 

Xiangrui Meng commented on SPARK-1870:
--

I specified the jar via `--jars` and add it with `sc.addJar` explicitly. In the 
Web UI, I see:

{code}
/mnt/yarn/nm/usercache/ubuntu/appcache/application_1398708946838_0152/container_1398708946838_0152_01_01/hello_2.10.jar
 System Classpath
http://10.45.133.8:43576/jars/hello_2.10.jarAdded By User
{code}

So it is in distributed cache as well as served by master via http. However, I 
still got ClassNotFoundException.

 Jars specified via --jars in spark-submit are not added to executor classpath 
 for YARN
 --

 Key: SPARK-1870
 URL: https://issues.apache.org/jira/browse/SPARK-1870
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Critical

 With `spark-submit`, jars specified via `--jars` are added to distributed 
 cache in `yarn-cluster` mode. The executor should add cached jars to 
 classpath. However, 
 {code}
 sc.parallelize(0 to 10, 10).map { i =
   System.getProperty(java.class.path)
 }.collect().foreach(println)
 {code}
 shows only system jars, `app.jar`, and `spark.jar` but not other jars in the 
 distributed cache.
 The workaround is using assembly jar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN

2014-05-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001195#comment-14001195
 ] 

Xiangrui Meng edited comment on SPARK-1870 at 5/18/14 8:36 PM:
---

I specified the jar via `--jars` and added it with `sc.addJar` explicitly. In 
the Web UI, I see:

{code}
/mnt/yarn/nm/usercache/ubuntu/appcache/application_1398708946838_0152/
container_1398708946838_0152_01_01/hello_2.10.jar   System Classpath
http://10.45.133.8:43576/jars/hello_2.10.jarAdded By User
{code}

So it is in distributed cache as well as served by master via http. However, I 
still got ClassNotFoundException.


was (Author: mengxr):
I specified the jar via `--jars` and added it with `sc.addJar` explicitly. In 
the Web UI, I see:

{code}
/mnt/yarn/nm/usercache/ubuntu/appcache/application_1398708946838_0152/container_1398708946838_0152_01_01/hello_2.10.jar
 System Classpath
http://10.45.133.8:43576/jars/hello_2.10.jarAdded By User
{code}

So it is in distributed cache as well as served by master via http. However, I 
still got ClassNotFoundException.

 Jars specified via --jars in spark-submit are not added to executor classpath 
 for YARN
 --

 Key: SPARK-1870
 URL: https://issues.apache.org/jira/browse/SPARK-1870
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Critical

 With `spark-submit`, jars specified via `--jars` are added to distributed 
 cache in `yarn-cluster` mode. The executor should add cached jars to 
 classpath. However, 
 {code}
 sc.parallelize(0 to 10, 10).map { i =
   System.getProperty(java.class.path)
 }.collect().foreach(println)
 {code}
 shows only system jars, `app.jar`, and `spark.jar` but not other jars in the 
 distributed cache.
 The workaround is using assembly jar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1871) Improve MLlib guide

2014-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1871:
-

Component/s: Documentation

 Improve MLlib guide
 ---

 Key: SPARK-1871
 URL: https://issues.apache.org/jira/browse/SPARK-1871
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1873) Add README.md file when making distributions

2014-05-18 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1873:
--

 Summary: Add README.md file when making distributions
 Key: SPARK-1873
 URL: https://issues.apache.org/jira/browse/SPARK-1873
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1874) Clean up MLlib sample data

2014-05-18 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1874:


 Summary: Clean up MLlib sample data
 Key: SPARK-1874
 URL: https://issues.apache.org/jira/browse/SPARK-1874
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Matei Zaharia
 Fix For: 1.0.0


- Replace logistic regression example data with linear to make 
mllib.LinearRegression example easier to run
- Move files from mllib/data into data/mllib to make them easier to find
- Add a simple MovieLens data file



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1871) Improve MLlib guide

2014-05-18 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1871:


 Summary: Improve MLlib guide
 Key: SPARK-1871
 URL: https://issues.apache.org/jira/browse/SPARK-1871
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1874) Clean up MLlib sample data

2014-05-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001254#comment-14001254
 ] 

Xiangrui Meng commented on SPARK-1874:
--

Is `data/mllib` a better place than `mllib/data`?

 Clean up MLlib sample data
 --

 Key: SPARK-1874
 URL: https://issues.apache.org/jira/browse/SPARK-1874
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Matei Zaharia
 Fix For: 1.0.0


 - Replace logistic regression example data with linear to make 
 mllib.LinearRegression example easier to run
 - Move files from mllib/data into data/mllib to make them easier to find
 - Add a simple MovieLens data file



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1873) Add README.md file when making distributions

2014-05-18 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1873:
--

 Summary: Add README.md file when making distributions
 Key: SPARK-1873
 URL: https://issues.apache.org/jira/browse/SPARK-1873
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-18 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1875:
-

Fix Version/s: 1.0.0

 NoClassDefFoundError: StringUtils when building against Hadoop 1
 

 Key: SPARK-1875
 URL: https://issues.apache.org/jira/browse/SPARK-1875
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Priority: Critical
 Fix For: 1.0.0


 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 
 and Hive enabled, if I go into it and run spark-shell, I get this:
 {code}
 java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
   at 
 org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34)
   at 
 org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216)
   at 
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
   at 
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
   at 
 org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-18 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1875:
-

Priority: Blocker  (was: Critical)

 NoClassDefFoundError: StringUtils when building against Hadoop 1
 

 Key: SPARK-1875
 URL: https://issues.apache.org/jira/browse/SPARK-1875
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Priority: Blocker
 Fix For: 1.0.0


 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 
 and Hive enabled, if I go into it and run spark-shell, I get this:
 {code}
 java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
   at 
 org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34)
   at 
 org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216)
   at 
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
   at 
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
   at 
 org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-18 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001297#comment-14001297
 ] 

Matei Zaharia commented on SPARK-1875:
--

This may have been broken by https://issues.apache.org/jira/browse/SPARK-1629 / 
https://github.com/apache/spark/pull/569, which added an explicit dependency on 
commons-lang, though it's not clear.

 NoClassDefFoundError: StringUtils when building against Hadoop 1
 

 Key: SPARK-1875
 URL: https://issues.apache.org/jira/browse/SPARK-1875
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Priority: Blocker
 Fix For: 1.0.0


 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 
 and Hive enabled, if I go into it and run spark-shell, I get this:
 {code}
 java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
   at 
 org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34)
   at 
 org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216)
   at 
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
   at 
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
   at 
 org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-18 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001336#comment-14001336
 ] 

Patrick Wendell commented on SPARK-1875:


The issue was caused by this patch. I need to look further to figure out what 
was going on.

https://github.com/apache/spark/pull/754

 NoClassDefFoundError: StringUtils when building against Hadoop 1
 

 Key: SPARK-1875
 URL: https://issues.apache.org/jira/browse/SPARK-1875
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Priority: Blocker
 Fix For: 1.0.0


 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 
 and Hive enabled, if I go into it and run spark-shell, I get this:
 {code}
 java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
   at 
 org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34)
   at 
 org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216)
   at 
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
   at 
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
   at 
 org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-18 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001358#comment-14001358
 ] 

Patrick Wendell commented on SPARK-1875:


[~witgo]. Here is how I reproduced it:

{code}
./make-distribution.sh --with-hive --tgz
{code}

Then run spark-shell from the distribution. This is mostly equivalent to running

{code}
mvn package -Phive
{code}

 NoClassDefFoundError: StringUtils when building against Hadoop 1
 

 Key: SPARK-1875
 URL: https://issues.apache.org/jira/browse/SPARK-1875
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Assignee: Guoqiang Li
Priority: Blocker
 Fix For: 1.0.0


 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 
 and Hive enabled, if I go into it and run spark-shell, I get this:
 {code}
 java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
   at 
 org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34)
   at 
 org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216)
   at 
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
   at 
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
   at 
 org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-18 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001358#comment-14001358
 ] 

Patrick Wendell edited comment on SPARK-1875 at 5/19/14 3:36 AM:
-

[~witgo]. Here is how I reproduced it:

{code}
./make-distribution.sh --with-hive --tgz
{code}

Then run spark-shell from the distribution. This is mostly equivalent to running

{code}
mvn package -Phive -DskipTests
{code}


was (Author: pwendell):
[~witgo]. Here is how I reproduced it:

{code}
./make-distribution.sh --with-hive --tgz
{code}

Then run spark-shell from the distribution. This is mostly equivalent to running

{code}
mvn package -Phive
{code}

 NoClassDefFoundError: StringUtils when building against Hadoop 1
 

 Key: SPARK-1875
 URL: https://issues.apache.org/jira/browse/SPARK-1875
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Assignee: Guoqiang Li
Priority: Blocker
 Fix For: 1.0.0


 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 
 and Hive enabled, if I go into it and run spark-shell, I get this:
 {code}
 java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
   at 
 org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34)
   at 
 org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216)
   at 
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
   at 
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
   at 
 org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-18 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001366#comment-14001366
 ] 

Patrick Wendell commented on SPARK-1875:


The issue here is that somehow the commons-lang exclusion from the hive project 
is being respected when building an assembly for Hadoop 1. So it's excluded 
from hadoop-client even though hadoop-client 1.0.4 depends on it.

{code}
mvn -Phive install
mvn -pl assembly -Phive  dependency:tree 
[INFO] Scanning for projects...
[INFO]
[INFO] 
[INFO] Building Spark Project Assembly 1.0.1-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ spark-assembly_2.10 
---
[INFO] org.apache.spark:spark-assembly_2.10:pom:1.0.1-SNAPSHOT
[INFO] +- org.apache.spark:spark-core_2.10:jar:1.0.1-SNAPSHOT:compile
[INFO] |  +- org.apache.hadoop:hadoop-client:jar:1.0.4:compile
[INFO] |  |  \- org.apache.hadoop:hadoop-core:jar:1.0.4:compile
[INFO] |  | +- xmlenc:xmlenc:jar:0.52:compile
[INFO] |  | +- org.apache.commons:commons-math:jar:2.1:compile
[INFO] |  | +- commons-el:commons-el:jar:1.0:compile
[INFO] |  | +- hsqldb:hsqldb:jar:1.8.0.10:compile
[INFO] |  | \- oro:oro:jar:2.0.8:compile
[INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:runtime
[INFO] |  |  +- commons-codec:commons-codec:jar:1.3:compile
[INFO] |  |  \- commons-httpclient:commons-httpclient:jar:3.1:runtime
[INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:compile
[INFO] |  |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
[INFO] |  |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
[INFO] |  |  \- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
[INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
[INFO] |  |  +- 
org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
[INFO] |  |  +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
[INFO] |  |  |  +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
[INFO] |  |  |  \- org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
[INFO] |  |  \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
[INFO] |  | \- 
org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
[INFO] |  |\- 
org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
[INFO] |  +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
[INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
[INFO] |  +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
[INFO] |  |  +- 
org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
[INFO] |  |  +- 
org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
[INFO] |  |  \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
[INFO] |  | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
{code}

If you run
{code}
mvn -pl assembly dependency:tree 
{code}
it includes commons-lang correctly.

 NoClassDefFoundError: StringUtils when building against Hadoop 1
 

 Key: SPARK-1875
 URL: https://issues.apache.org/jira/browse/SPARK-1875
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Assignee: Guoqiang Li
Priority: Blocker
 Fix For: 1.0.0


 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 
 and Hive enabled, if I go into it and run spark-shell, I get this:
 {code}
 java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
   at 
 org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34)
   at 
 org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216)
   at 
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
   at 
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
   at 
 org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209)
   at 
 

[jira] [Commented] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-18 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001377#comment-14001377
 ] 

Mridul Muralidharan commented on SPARK-1855:


Did not realize that mail replies to JIRA mails did not get mirrored to JIRA ! 
Replicating my mail here :

– cut and paste –

We don't have 3x replication in spark :-)
And if we use replicated storagelevel, while decreasing odds of failure, it 
does not eliminate it (since we are not doing a great job with replication 
anyway from fault tolerance point of view).
Also it does take a nontrivial performance hit with replicated levels.

Regards,
Mridul

 Provide memory-and-local-disk RDD checkpointing
 ---

 Key: SPARK-1855
 URL: https://issues.apache.org/jira/browse/SPARK-1855
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Spark Core
Affects Versions: 1.0.0
Reporter: Xiangrui Meng

 Checkpointing is used to cut long lineage while maintaining fault tolerance. 
 The current implementation is HDFS-based. Using the BlockRDD we can create 
 in-memory-and-local-disk (with replication) checkpoints that are not as 
 reliable as HDFS-based solution but faster.
 It can help applications that require many iterations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-18 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001379#comment-14001379
 ] 

Guoqiang Li edited comment on SPARK-1875 at 5/19/14 4:23 AM:
-

[~pwendell], [~matei]
Do you have time to review the code?
https://github.com/apache/spark/pull/820


was (Author: gq):
[~ pwendell], [~ matei]
Do you have time to review the code?
https://github.com/apache/spark/pull/820

 NoClassDefFoundError: StringUtils when building against Hadoop 1
 

 Key: SPARK-1875
 URL: https://issues.apache.org/jira/browse/SPARK-1875
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Assignee: Guoqiang Li
Priority: Blocker
 Fix For: 1.0.0


 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 
 and Hive enabled, if I go into it and run spark-shell, I get this:
 {code}
 java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
   at 
 org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34)
   at 
 org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216)
   at 
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
   at 
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
   at 
 org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-18 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001379#comment-14001379
 ] 

Guoqiang Li commented on SPARK-1875:


[~ pwendell], [~ matei]
Do you have time to review the code?
https://github.com/apache/spark/pull/820

 NoClassDefFoundError: StringUtils when building against Hadoop 1
 

 Key: SPARK-1875
 URL: https://issues.apache.org/jira/browse/SPARK-1875
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Assignee: Guoqiang Li
Priority: Blocker
 Fix For: 1.0.0


 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 
 and Hive enabled, if I go into it and run spark-shell, I get this:
 {code}
 java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
   at 
 org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34)
   at 
 org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216)
   at 
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
   at 
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
   at 
 org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)