Re: Google Summer of Code - ideas

2015-02-24 Thread Xiangrui Meng
Would you be interested in working on MLlib's Python API during the
summer? We want everything we implemented in Scala can be used in both
Java and Python, but we are not there yet. It would be great if
someone is willing to help. -Xiangrui

On Sat, Feb 21, 2015 at 11:24 AM, Manoj Kumar
manojkumarsivaraj...@gmail.com wrote:
 Hello,

 I've been working on the Spark codebase for quite some time right now,
 especially on issues related to MLlib and a very small amount of PySpark
 and SparkSQL (https://github.com/apache/spark/pulls/MechCoder) .

 I would like to extend my work with Spark as a Google Summer of Code
 project.
 I want to know if there are specific projects related to MLlib that people
 would like to see. (I notice, there is no idea page for GSoC yet). There
 are a number of issues related to DecisionTrees, Ensembles, LDA (in the
 issue tracker) that I find really interesting that could probably club into
 a project, but if the spark community has anything else in mind, I could
 work on the other issues pre-GSoC and try out something new during GSoC.

 Looking forward!
 --
 Godspeed,
 Manoj Kumar,
 http://manojbits.wordpress.com
 http://goog_1017110195
 http://github.com/MechCoder

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: spark slave cannot execute without admin permission on windows

2015-02-24 Thread Judy Nash
Update to the thread.

Upon investigation, this is a bug on windows. Windows does not grant user 
permission read permission to jar files by default.
Have created a pull request for 
SPARK-5914https://issues.apache.org/jira/browse/SPARK-5914 to grant read 
permission to jar owner (slave service account in this case). With this fix, 
slave will be able to run without admin permission.
FYI: master  thrift server works fine with only user permission, so no issue 
there.

From: Judy Nash [mailto:judyn...@exchange.microsoft.com]
Sent: Thursday, February 19, 2015 12:26 AM
To: Akhil Das; dev@spark.apache.org
Cc: u...@spark.apache.org
Subject: RE: spark slave cannot execute without admin permission on windows

+ dev mailing list

If this is supposed to work, is there a regression then?

The spark core code shows the permission for copied file to \work is set to a+x 
at Line 442 of 
Utils.scalahttps://github.com/apache/spark/blob/b271c265b742fa6947522eda4592e9e6a7fd1f3a/core/src/main/scala/org/apache/spark/util/Utils.scala
 .
The example jar I used had all permissions including Read  Execute prior 
spark-submit:
[cid:image001.png@01D04FCA.85961CE0]
However after copied to worker node’s \work folder, only limited permission 
left on the jar with no execution right.
[cid:image002.png@01D04FCA.85961CE0]

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Wednesday, February 18, 2015 10:40 PM
To: Judy Nash
Cc: u...@spark.apache.orgmailto:u...@spark.apache.org
Subject: Re: spark slave cannot execute without admin permission on windows

You need not require admin permission, but just make sure all those jars has 
execute permission ( read/write access)

Thanks
Best Regards

On Thu, Feb 19, 2015 at 11:30 AM, Judy Nash 
judyn...@exchange.microsoft.commailto:judyn...@exchange.microsoft.com wrote:
Hi,

Is it possible to configure spark to run without admin permission on windows?

My current setup run master  slave successfully with admin permission.
However, if I downgrade permission level from admin to user, SparkPi fails with 
the following exception on the slave node:
Exception in thread main org.apache.spark.SparkException: Job aborted due to s
tage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task
0.3 in stage 0.0 (TID 9, 
workernode0.jnashsparkcurr2.d10.internal.cloudapp.nethttp://workernode0.jnashsparkcurr2.d10.internal.cloudapp.net)
: java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi$$anonfun$1

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)

Upon investigation, it appears that sparkPi jar under 
spark_home\worker\appname\*.jar does not have execute permission set, causing 
spark not able to find class.

Advice would be very much appreciated.

Thanks,
Judy




Does Spark delete shuffle files of lost executor in running system(on YARN)?

2015-02-24 Thread nitin
Hi All,

I noticed that Spark doesn't delete local shuffle files of a lost executor
in a running system(running in yarn-client mode). For long running system,
this might fill up disk space in case of frequent executor failures. Can we
delete these files when executor loss reported to driver?

Thanks
-Nitin



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Does-Spark-delete-shuffle-files-of-lost-executor-in-running-system-on-YARN-tp10755.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ERROR] bin/compute-classpath.sh: fails with false positive test for java 1.7 vs 1.6

2015-02-24 Thread Mike Hynes
I don't see any version flag for /usr/bin/jar, but I think I see the
problem now; the openjdk version is 7, but javac -version gives
1.6.0_34; so spark was compiled with java 6 despite the system using
jre 1.7.
Thanks for the sanity check! Now I just need to find out why javac is
downgraded on the system..

On 2/24/15, Sean Owen so...@cloudera.com wrote:
 So you mean that the script is checking for this error, and takes it
 as a sign that you compiled with java 6.

 Your command seems to confirm that reading the assembly jar does fail
 on your system though. What version does the jar command show? are you
 sure you don't have JRE 7 but JDK 6 installed?

 On Tue, Feb 24, 2015 at 11:02 PM, Mike Hynes 91m...@gmail.com wrote:
 ./bin/compute-classpath.sh fails with error:

 $ jar -tf
 assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar
 nonexistent/class/path
 java.util.zip.ZipException: invalid CEN header (bad signature)
 at java.util.zip.ZipFile.open(Native Method)
 at java.util.zip.ZipFile.init(ZipFile.java:132)
 at java.util.zip.ZipFile.init(ZipFile.java:93)
 at sun.tools.jar.Main.list(Main.java:997)
 at sun.tools.jar.Main.run(Main.java:242)
 at sun.tools.jar.Main.main(Main.java:1167)

 However, I both compiled the distribution and am running spark with Java
 1.7;
 $ java -version
 java version 1.7.0_75
 OpenJDK Runtime Environment (IcedTea 2.5.4)
 (7u75-2.5.4-1~trusty1)
 OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
 on a system running Ubuntu:
 $ uname -srpov
 Linux 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
 x86_64 GNU/Linux
 $ uname -srpo
 Linux 3.13.0-44-generic x86_64 GNU/Linux

 This problem was reproduced on Arch Linux:

 $ uname -srpo
 Linux 3.18.5-1-ARCH x86_64 GNU/Linux
 with
 $ java -version
 java version 1.7.0_75
 OpenJDK Runtime Environment (IcedTea 2.5.4) (Arch Linux build
 7.u75_2.5.4-1-x86_64)
 OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

 In both of these cases, the problem is not the java versioning;
 neither system even has a java 6 installation. This seems like a false
 positive to me in compute-classpath.sh.

 When I comment out the relevant lines in compute-classpath.sh, the
 scripts start-{master,slaves,...}.sh all run fine, and I have no
 problem launching applications.

 Could someone please offer some insight into this issue?

 Thanks,
 Mike

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-- 
Thanks,
Mike

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ERROR] bin/compute-classpath.sh: fails with false positive test for java 1.7 vs 1.6

2015-02-24 Thread shane knapp
it's not downgraded, it's your /etc/alternatives setup that's causing this.

you can update all of those entries by executing the following commands (as
root):

update-alternatives --install /usr/bin/java java
/usr/java/latest/bin/java 1
update-alternatives --install /usr/bin/javah javah
/usr/java/latest/bin/javah 1
update-alternatives --install /usr/bin/javac javac
/usr/java/latest/bin/javac 1
update-alternatives --install /usr/bin/jar jar
/usr/java/latest/bin/jar 1

(i have the latest jdk installed in /usr/java/ with a /usr/java/latest/
symlink pointing to said jdk's dir)

On Tue, Feb 24, 2015 at 3:32 PM, Mike Hynes 91m...@gmail.com wrote:

 I don't see any version flag for /usr/bin/jar, but I think I see the
 problem now; the openjdk version is 7, but javac -version gives
 1.6.0_34; so spark was compiled with java 6 despite the system using
 jre 1.7.
 Thanks for the sanity check! Now I just need to find out why javac is
 downgraded on the system..

 On 2/24/15, Sean Owen so...@cloudera.com wrote:
  So you mean that the script is checking for this error, and takes it
  as a sign that you compiled with java 6.
 
  Your command seems to confirm that reading the assembly jar does fail
  on your system though. What version does the jar command show? are you
  sure you don't have JRE 7 but JDK 6 installed?
 
  On Tue, Feb 24, 2015 at 11:02 PM, Mike Hynes 91m...@gmail.com wrote:
  ./bin/compute-classpath.sh fails with error:
 
  gt; jar -tf
 
assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar
  nonexistent/class/path
  java.util.zip.ZipException: invalid CEN header (bad signature)
  at java.util.zip.ZipFile.open(Native Method)
  at java.util.zip.ZipFile.init(ZipFile.java:132)
  at java.util.zip.ZipFile.init(ZipFile.java:93)
  at sun.tools.jar.Main.list(Main.java:997)
  at sun.tools.jar.Main.run(Main.java:242)
  at sun.tools.jar.Main.main(Main.java:1167)
 
  However, I both compiled the distribution and am running spark with
Java
  1.7;
  $ java -version
  java version 1.7.0_75
  OpenJDK Runtime Environment (IcedTea 2.5.4)
  (7u75-2.5.4-1~trusty1)
  OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
  on a system running Ubuntu:
  $ uname -srpov
  Linux 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
  x86_64 GNU/Linux
  $ uname -srpo
  Linux 3.13.0-44-generic x86_64 GNU/Linux
 
  This problem was reproduced on Arch Linux:
 
  $ uname -srpo
  Linux 3.18.5-1-ARCH x86_64 GNU/Linux
  with
  $ java -version
  java version 1.7.0_75
  OpenJDK Runtime Environment (IcedTea 2.5.4) (Arch Linux build
  7.u75_2.5.4-1-x86_64)
  OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
 
  In both of these cases, the problem is not the java versioning;
  neither system even has a java 6 installation. This seems like a false
  positive to me in compute-classpath.sh.
 
  When I comment out the relevant lines in compute-classpath.sh, the
  scripts start-{master,slaves,...}.sh all run fine, and I have no
  problem launching applications.
 
  Could someone please offer some insight into this issue?
 
  Thanks,
  Mike
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 


 --
 Thanks,
 Mike

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



Help vote for Spark talks at the Hadoop Summit

2015-02-24 Thread Reynold Xin
Hi all,

The Hadoop Summit uses community choice voting to decide which talks to
feature. It would be great if the community could help vote for Spark talks
so that Spark has a good showing at this event. You can make three votes on
each track. Below I've listed 3 talks that are important to Spark's
roadmap. Please give 3 votes to each of the following talks.

Committer Track: Lessons from Running Ultra Large Scale Spark Workloads on
Hadoop
https://hadoopsummit.uservoice.com/forums/283260-committer-track/suggestions/7074016

Data Science track: DataFrames: large-scale data science on Hadoop data
with Spark
https://hadoopsummit.uservoice.com/forums/283261-data-science-and-hadoop/suggestions/7074147

Future of Hadoop track: Online Approximate OLAP in SparkSQL
https://hadoopsummit.uservoice.com/forums/283266-the-future-of-apache-hadoop/suggestions/7074424


Thanks!


Re: Streaming partitions to driver for use in .toLocalIterator

2015-02-24 Thread Andrew Ash
I think a cheap way to repartition to a higher partition count without
shuffle would be valuable too.  Right now you can choose whether to execute
a shuffle when going down in partition count, but going up in partition
count always requires a shuffle.  For the need of having a smaller
partitions to make .toLocalIterator more efficient, no shuffle on increase
of partition count is necessary.

Filed as https://issues.apache.org/jira/browse/SPARK-5997

On Wed, Feb 18, 2015 at 3:21 PM, Mingyu Kim m...@palantir.com wrote:

 Another alternative would be to compress the partition in memory in a
 streaming fashion instead of calling .toArray on the iterator. Would it be
 an easier mitigation to the problem? Or, is it hard to compress the rows
 one by one without materializing the full partition in memory using the
 compression algo Spark uses currently?

 Mingyu





 On 2/18/15, 1:01 PM, Imran Rashid iras...@cloudera.com wrote:

 This would be pretty tricky to do -- the issue is that right now
 sparkContext.runJob has you pass in a function from a partition to *one*
 result object that gets serialized and sent back: Iterator[T] = U, and
 that idea is baked pretty deep into a lot of the internals, DAGScheduler,
 Task, Executors, etc.
 
 Maybe another possibility worth considering: should we make it easy to go
 from N partitions to 2N partitions (or any other multiple obviously)
 without requiring a shuffle?  for that matter, you should also be able to
 go from 2N to N without a shuffle as well.  That change is also somewhat
 involved, though.
 
 Both are in theory possible, but I imagine they'd need really compelling
 use cases.
 
 An alternative would be to write your RDD to some other data store (eg,
 hdfs) which has better support for reading data in a streaming fashion,
 though you would probably be unhappy with the overhead.
 
 
 
 On Wed, Feb 18, 2015 at 9:09 AM, Andrew Ash and...@andrewash.com wrote:
 
  Hi Spark devs,
 
  I'm creating a streaming export functionality for RDDs and am having
 some
  trouble with large partitions.  The RDD.toLocalIterator() call pulls
 over a
  partition at a time to the driver, and then streams the RDD out from
 that
  partition before pulling in the next one.  When you have large
 partitions
  though, you can OOM the driver, especially when multiple of these
 exports
  are happening in the same SparkContext.
 
  One idea I had was to repartition the RDD so partitions are smaller, but
  it's hard to know a priori what the partition count should be, and I'd
 like
  to avoid paying the shuffle cost if possible -- I think repartition to a
  higher partition count forces a shuffle.
 
  Is it feasible to rework this so the executor - driver transfer in
  .toLocalIterator is a steady stream rather than a partition at a time?
 
  Thanks!
  Andrew
 




PySpark SPARK_CLASSPATH doesn't distribute jars to executors

2015-02-24 Thread Michael Nazario
Has anyone experienced a problem with the SPARK_CLASSPATH not distributing jars 
for PySpark? I have a detailed description of what I tried in the ticket below, 
and this seems like a problem that is not a configuration problem. The only 
other case I can think of is that configuration changed between Spark 1.1.1 and 
Spark 1.2.1 about distributing jars for PySpark.

https://issues.apache.org/jira/browse/SPARK-5977

Thanks,
Michael


Re: Have Friedman's glmnet algo running in Spark

2015-02-24 Thread Joseph Bradley
Hi Mike,

I'm not aware of a standard big dataset, but there are a number available:
* The YearPredictionMSD dataset from the LIBSVM datasets is sizeable (in #
instances but not # features):
www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html
* I've used this text dataset from which one can generate lots of n-gram
features (but not many instances): http://www.ark.cs.cmu.edu/10K/
* I've seen some papers use the KDD Cup datasets, which might be the best
option I know of.  The KDD Cup 2012 track 2 one seems promising.

Good luck!
Joseph

On Tue, Feb 24, 2015 at 1:56 PM, m...@mbowles.com wrote:

 Joseph,
 Thanks for your reply.  We'll take the steps you suggest - generate some
 timing comparisons and post them in the GLMNET JIRA with a link from the
 OWLQN JIRA.

 We've got the regression version of GLMNET programmed.  The regression
 version only requires a pass through the data each time the active set of
 coefficients changes.  That's usualy less than or equal to the number of
 decrements in the penalty coefficient (typical default = 100).  The
 intermediate iterations can be done using results of previous passes
 through the full data set.  We're expecting the number of data passes will
 be independent of either number of rows or columns in the data set.  We're
 eager to demonstrate this scaling.  Do you have any suggestions regarding
 data sets for large scale regression problems?  It would be nice to
 demonstrate scaling for both number of rows and number of columns.

 Thanks for your help.
 Mike

 -Original Message-
 *From:* Joseph Bradley [mailto:jos...@databricks.com]
 *Sent:* Sunday, February 22, 2015 06:48 PM
 *To:* m...@mbowles.com
 *Cc:* dev@spark.apache.org
 *Subject:* Re: Have Friedman's glmnet algo running in Spark

 Hi Mike, glmnet has definitely been very successful, and it would be great
 to see how we can improve optimization in MLlib! There is some related work
 ongoing; here are the JIRAs: GLMNET implementation in Spark
 LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package
 The GLMNET JIRA has actually been closed in favor of the latter JIRA.
 However, if you're getting good results in your experiments, could you
 please post them on the GLMNET JIRA and link them from the other JIRA? If
 it's faster and more scalable, that would be great to find out. As far as
 where the code should go and the APIs, that can be discussed on the JIRA. I
 hope this helps, and I'll keep an eye out for updates on the JIRAs! Joseph
 On Thu, Feb 19, 2015 at 10:59 AM,  wrote:  Dev List,  A couple of
 colleagues and I have gotten several versions of glmnet algo  coded and
 running on Spark RDD. glmnet algo ( 
 http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for 
 generating coefficient paths solving penalized regression with elastic net
  penalties. The algorithm runs fast by taking an approach that generates 
 solutions for a wide variety of penalty parameter. We're able to integrate
  into Mllib class structure a couple of different ways. The algorithm may
  fit better into the new pipeline structure since it naturally returns a 
 multitide of models (corresponding to different vales of penalty 
 parameters). That appears to fit better into pipeline than Mllib linear 
 regression (for example).   We've got regression running with the speed
 optimizations that Friedman  recommends. We'll start working on the
 logistic regression version next.   We're eager to make the code
 available as open source and would like to  get some feedback about how
 best to do that. Any thoughts?  Mike Bowles.   




Re: Have Friedman's glmnet algo running in Spark

2015-02-24 Thread mike
 Joseph,
Thanks for your reply. We'll take the steps you suggest - generate some timing 
comparisons and post them in the GLMNET JIRA with a link from the OWLQN JIRA.

We've got the regression version of GLMNET programmed. The regression version 
only requires a pass through the data each time the active set of coefficients 
changes. That's usualy less than or equal to the number of decrements in the 
penalty coefficient (typical default = 100). The intermediate iterations can be 
done using results of previous passes through the full data set. We're 
expecting the number of data passes will be independent of either number of 
rows or columns in the data set. We're eager to demonstrate this scaling. Do 
you have any suggestions regarding data sets for large scale regression 
problems? It would be nice to demonstrate scaling for both number of rows and 
number of columns.

Thanks for your help.
Mike

-Original Message-
From: Joseph Bradley [mailto:jos...@databricks.com]
Sent: Sunday, February 22, 2015 06:48 PM
To: m...@mbowles.com
Cc: dev@spark.apache.org
Subject: Re: Have Friedman's glmnet algo running in Spark

Hi Mike,glmnet has definitely been very successful, and it would be great to 
seehow we can improve optimization in MLlib! There is some related workongoing; 
here are the JIRAs:GLMNET implementation in SparkLinearRegression with L1/L2 
(elastic net) using OWLQN in new ML packageThe GLMNET JIRA has actually been 
closed in favor of the latter JIRA.However, if you're getting good results in 
your experiments, could youplease post them on the GLMNET JIRA and link them 
from the other JIRA? Ifit's faster and more scalable, that would be great to 
find out.As far as where the code should go and the APIs, that can be discussed 
onthe JIRA.I hope this helps, and I'll keep an eye out for updates on the 
JIRAs!JosephOn Thu, Feb 19, 2015 at 10:59 AM,  wrote: Dev List, A couple of 
colleagues and I have gotten several versions of glmnet algo coded and running 
on Spark RDD. glmnet algo ( http://www.jstatsoft.org/v33/i01/paper) is a very 
fast algorithm for generating coefficient paths solving penalized regression 
with elastic net penalties. The algorithm runs fast by taking an approach that 
generates solutions for a wide variety of penalty parameter. We're able to 
integrate into Mllib class structure a couple of different ways. The algorithm 
may fit better into the new pipeline structure since it naturally returns a 
multitide of models (corresponding to different vales of penalty parameters). 
That appears to fit better into pipeline than Mllib linear regression (for 
example). We've got regression running with the speed optimizations that 
Friedman recommends. We'll start working on the logistic regression version 
next. We're eager to make the code available as open source and would like 
to get some feedback about how best to do that. Any thoughts? Mike Bowles.


[ERROR] bin/compute-classpath.sh: fails with false positive test for java 1.7 vs 1.6

2015-02-24 Thread Mike Hynes
./bin/compute-classpath.sh fails with error:

$ jar -tf 
assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar
nonexistent/class/path
java.util.zip.ZipException: invalid CEN header (bad signature)
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.init(ZipFile.java:132)
at java.util.zip.ZipFile.init(ZipFile.java:93)
at sun.tools.jar.Main.list(Main.java:997)
at sun.tools.jar.Main.run(Main.java:242)
at sun.tools.jar.Main.main(Main.java:1167)

However, I both compiled the distribution and am running spark with Java 1.7;
$ java -version
java version 1.7.0_75
OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
on a system running Ubuntu:
$ uname -srpov
Linux 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
x86_64 GNU/Linux
$ uname -srpo
Linux 3.13.0-44-generic x86_64 GNU/Linux

This problem was reproduced on Arch Linux:

$ uname -srpo
Linux 3.18.5-1-ARCH x86_64 GNU/Linux
with
$ java -version
java version 1.7.0_75
OpenJDK Runtime Environment (IcedTea 2.5.4) (Arch Linux build
7.u75_2.5.4-1-x86_64)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

In both of these cases, the problem is not the java versioning;
neither system even has a java 6 installation. This seems like a false
positive to me in compute-classpath.sh.

When I comment out the relevant lines in compute-classpath.sh, the
scripts start-{master,slaves,...}.sh all run fine, and I have no
problem launching applications.

Could someone please offer some insight into this issue?

Thanks,
Mike

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ERROR] bin/compute-classpath.sh: fails with false positive test for java 1.7 vs 1.6

2015-02-24 Thread Sean Owen
So you mean that the script is checking for this error, and takes it
as a sign that you compiled with java 6.

Your command seems to confirm that reading the assembly jar does fail
on your system though. What version does the jar command show? are you
sure you don't have JRE 7 but JDK 6 installed?

On Tue, Feb 24, 2015 at 11:02 PM, Mike Hynes 91m...@gmail.com wrote:
 ./bin/compute-classpath.sh fails with error:

 $ jar -tf 
 assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar
 nonexistent/class/path
 java.util.zip.ZipException: invalid CEN header (bad signature)
 at java.util.zip.ZipFile.open(Native Method)
 at java.util.zip.ZipFile.init(ZipFile.java:132)
 at java.util.zip.ZipFile.init(ZipFile.java:93)
 at sun.tools.jar.Main.list(Main.java:997)
 at sun.tools.jar.Main.run(Main.java:242)
 at sun.tools.jar.Main.main(Main.java:1167)

 However, I both compiled the distribution and am running spark with Java 1.7;
 $ java -version
 java version 1.7.0_75
 OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
 OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
 on a system running Ubuntu:
 $ uname -srpov
 Linux 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
 x86_64 GNU/Linux
 $ uname -srpo
 Linux 3.13.0-44-generic x86_64 GNU/Linux

 This problem was reproduced on Arch Linux:

 $ uname -srpo
 Linux 3.18.5-1-ARCH x86_64 GNU/Linux
 with
 $ java -version
 java version 1.7.0_75
 OpenJDK Runtime Environment (IcedTea 2.5.4) (Arch Linux build
 7.u75_2.5.4-1-x86_64)
 OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

 In both of these cases, the problem is not the java versioning;
 neither system even has a java 6 installation. This seems like a false
 positive to me in compute-classpath.sh.

 When I comment out the relevant lines in compute-classpath.sh, the
 scripts start-{master,slaves,...}.sh all run fine, and I have no
 problem launching applications.

 Could someone please offer some insight into this issue?

 Thanks,
 Mike

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: PySpark SPARK_CLASSPATH doesn't distribute jars to executors

2015-02-24 Thread Denny Lee
Can you try extraClassPath or driver-class-path and see if that helps with
the distribution?
On Tue, Feb 24, 2015 at 14:54 Michael Nazario mnaza...@palantir.com wrote:

 Has anyone experienced a problem with the SPARK_CLASSPATH not distributing
 jars for PySpark? I have a detailed description of what I tried in the
 ticket below, and this seems like a problem that is not a configuration
 problem. The only other case I can think of is that configuration changed
 between Spark 1.1.1 and Spark 1.2.1 about distributing jars for PySpark.

 https://issues.apache.org/jira/browse/SPARK-5977

 Thanks,
 Michael