Re: JAVA_HOME problem with upgrade to 1.3.0

2015-03-23 Thread Williams, Ken


> From: , Ken Williams 
> mailto:ken.willi...@windlogics.com>>
> Date: Thursday, March 19, 2015 at 10:59 AM
> To: Spark list mailto:user@spark.apache.org>>
> Subject: JAVA_HOME problem with upgrade to 1.3.0
>
> […]
> Finally, I go and check the YARN app master’s web interface (since the job is 
> shown, I know it at least made it that far), and the
> only logs it shows are these:
>
> Log Type: stderr
> Log Length: 61
> /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory
>
> Log Type: stdout
> Log Length: 0

I’m still interested in a solution to this issue if anyone has comments.  I 
also posted to SO if that’s more convenient:


http://stackoverflow.com/questions/29170280/java-home-error-with-upgrade-to-spark-1-3-0

Thanks,

  -Ken




CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution of any kind is strictly 
prohibited. If you are not the intended recipient, please contact the sender 
via reply e-mail and destroy all copies of the original message. Thank you.


Re: JAVA_HOME problem with upgrade to 1.3.0

2015-03-19 Thread Williams, Ken

> From: Ted Yu mailto:yuzhih...@gmail.com>>
> Date: Thursday, March 19, 2015 at 11:05 AM
>
> JAVA_HOME, an environment variable, should be defined on the node where 
> appattempt_1420225286501_4699_02 ran.

Has this behavior changed in 1.3.0 since 1.2.1 though?  Using 1.2.1 and making 
no other changes, the job completes fine.

I do have JAVA_HOME set in the hadoop config files on all the nodes of the 
cluster:

% grep JAVA_HOME /etc/hadoop/conf/*.sh
/etc/hadoop/conf/hadoop-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
/etc/hadoop/conf/yarn-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31

 -Ken




CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution of any kind is strictly 
prohibited. If you are not the intended recipient, please contact the sender 
via reply e-mail and destroy all copies of the original message. Thank you.


JAVA_HOME problem with upgrade to 1.3.0

2015-03-19 Thread Williams, Ken
I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to 
1.3.0, so I changed my `build.sbt` like so:

   -libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1" % 
“provided"
   +libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % 
"provided"

then make an `assembly` jar, and submit it:

   HADOOP_CONF_DIR=/etc/hadoop/conf \
spark-submit \
--driver-class-path=/etc/hbase/conf \
--conf spark.hadoop.validateOutputSpecs=false \
--conf 
spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.3.0-hadoop2.4.0.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--deploy-mode=cluster \
--master=yarn \
--class=TestObject \
--num-executors=54 \
target/scala-2.11/myapp-assembly-1.2.jar

The job fails to submit, with the following exception in the terminal:

15/03/19 10:30:07 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1420225286501_4699 failed 2 times due to 
AM Container for appattempt_1420225286501_4699_02 exited with  exitCode: 
127 due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Finally, I go and check the YARN app master’s web interface (since the job is 
shown, I know it at least made it that far), and the only logs it shows are 
these:

Log Type: stderr
Log Length: 61
/bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory

Log Type: stdout
Log Length: 0

I’m not sure how to interpret that – is '{{JAVA_HOME}}' a literal (including 
the brackets) that’s somehow making it into a script?  Is this coming from the 
worker nodes or the driver?  Anything I can do to experiment & troubleshoot?

  -Ken





CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution of any kind is strictly 
prohibited. If you are not the intended recipient, please contact the sender 
via reply e-mail and destroy all copies of the original message. Thank you.


RE: Build times for Spark

2014-04-25 Thread Williams, Ken
I am indeed, but it's a pretty fast NFS.  I don't have any SSD I can use, but I 
could try to use local disk to see what happens.

For me, a large portion of the time seems to be spent on lines like "Resolving 
org.fusesource.jansi#jansi;1.4 ..." or similar .  Is this going out to find 
Maven resources?  Any way to tell it to just use my local ~/.m2 repository 
instead when the resource already exists there?  Sometimes I even get sporadic 
errors like this:

  [info] Resolving org.apache.hadoop#hadoop-yarn;2.2.0 ...
  [error] SERVER ERROR: Bad Gateway 
url=http://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-yarn-server/2.2.0/hadoop-yarn-server-2.2.0.jar


-Ken

From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
Sent: Friday, April 25, 2014 4:31 PM
To: user@spark.apache.org
Subject: Re: Build times for Spark

Are you by any chance building this on NFS ? As far as I know the build is 
severely bottlenecked by filesystem calls during assembly (each class file in 
each dependency gets a fstat call or something like that).  That is partly why 
building from say a local ext4 filesystem or a SSD is much faster irrespective 
of memory / CPU.

Thanks
Shivaram

On Fri, Apr 25, 2014 at 2:09 PM, Akhil Das 
mailto:ak...@sigmoidanalytics.com>> wrote:
You can always increase the sbt memory by setting

export JAVA_OPTS="-Xmx10g"



Thanks
Best Regards

On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken 
mailto:ken.willi...@windlogics.com>> wrote:
No, I haven't done any config for SBT.  Is there somewhere you might be able to 
point me toward for how to do that?

-Ken

From: Josh Rosen [mailto:rosenvi...@gmail.com<mailto:rosenvi...@gmail.com>]
Sent: Friday, April 25, 2014 3:27 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Build times for Spark

Did you configure SBT to use the extra memory?

On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken 
mailto:ken.willi...@windlogics.com>> wrote:
I've cloned the github repo and I'm building Spark on a pretty beefy machine 
(24 CPUs, 78GB of RAM) and it takes a pretty long time.

For instance, today I did a 'git pull' for the first time in a week or two, and 
then doing 'sbt/sbt assembly' took 43 minutes of wallclock time (88 minutes of 
CPU time).  After that, I did 'SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true 
sbt/sbt assembly' and that took 25 minutes wallclock, 73 minutes CPU.

Is that typical?  Or does that indicate some setup problem in my environment?

--
Ken Williams, Senior Research Scientist
WindLogics
http://windlogics.com




CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution of any kind is strictly 
prohibited. If you are not the intended recipient, please contact the sender 
via reply e-mail and destroy all copies of the original message. Thank you.





RE: Build times for Spark

2014-04-25 Thread Williams, Ken
No, I haven’t done any config for SBT.  Is there somewhere you might be able to 
point me toward for how to do that?

-Ken

From: Josh Rosen [mailto:rosenvi...@gmail.com]
Sent: Friday, April 25, 2014 3:27 PM
To: user@spark.apache.org
Subject: Re: Build times for Spark

Did you configure SBT to use the extra memory?

On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken 
mailto:ken.willi...@windlogics.com>> wrote:
I’ve cloned the github repo and I’m building Spark on a pretty beefy machine 
(24 CPUs, 78GB of RAM) and it takes a pretty long time.

For instance, today I did a ‘git pull’ for the first time in a week or two, and 
then doing ‘sbt/sbt assembly’ took 43 minutes of wallclock time (88 minutes of 
CPU time).  After that, I did ‘SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true 
sbt/sbt assembly’ and that took 25 minutes wallclock, 73 minutes CPU.

Is that typical?  Or does that indicate some setup problem in my environment?

--
Ken Williams, Senior Research Scientist
WindLogics
http://windlogics.com




CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution of any kind is strictly 
prohibited. If you are not the intended recipient, please contact the sender 
via reply e-mail and destroy all copies of the original message. Thank you.



Build times for Spark

2014-04-25 Thread Williams, Ken
I've cloned the github repo and I'm building Spark on a pretty beefy machine 
(24 CPUs, 78GB of RAM) and it takes a pretty long time.

For instance, today I did a 'git pull' for the first time in a week or two, and 
then doing 'sbt/sbt assembly' took 43 minutes of wallclock time (88 minutes of 
CPU time).  After that, I did 'SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true 
sbt/sbt assembly' and that took 25 minutes wallclock, 73 minutes CPU.

Is that typical?  Or does that indicate some setup problem in my environment?

--
Ken Williams, Senior Research Scientist
WindLogics
http://windlogics.com




CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution of any kind is strictly 
prohibited. If you are not the intended recipient, please contact the sender 
via reply e-mail and destroy all copies of the original message. Thank you.


RE: Problem connecting to HDFS in Spark shell

2014-04-21 Thread Williams, Ken
> -Original Message-
> From: Marcelo Vanzin [mailto:van...@cloudera.com]
> Hi Ken,
>
> On Mon, Apr 21, 2014 at 1:39 PM, Williams, Ken
>  wrote:
> > I haven't figured out how to let the hostname default to the host
> mentioned in our /etc/hadoop/conf/hdfs-site.xml like the Hadoop
> command-line tools do, but that's not so important.
>
> Try adding "/etc/hadoop/conf" to SPARK_CLASSPATH.

It looks like I already had my config set up properly, but I didn't understand 
the URL syntax - the following works:

  sc.textFile("hdfs:///user/kwilliams/dat/part-m-0")

In other words, just omit the hostname between the 2nd and 3rd slash of the URL.

 -Ken




CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution of any kind is strictly 
prohibited. If you are not the intended recipient, please contact the sender 
via reply e-mail and destroy all copies of the original message. Thank you.


RE: Problem connecting to HDFS in Spark shell

2014-04-21 Thread Williams, Ken
I figured it out - I should be using textFile(...), not hadoopFile(...).  And 
my HDFS URL should include the host:

  hdfs://host/user/kwilliams/corTable2/part-m-0

I haven't figured out how to let the hostname default to the host mentioned in 
our /etc/hadoop/conf/hdfs-site.xml like the Hadoop command-line tools do, but 
that's not so important.

 -Ken


> -Original Message-
> From: Williams, Ken [mailto:ken.willi...@windlogics.com]
> Sent: Monday, April 21, 2014 2:04 PM
> To: Spark list
> Subject: Problem connecting to HDFS in Spark shell
> 
> I'm trying to get my feet wet with Spark.  I've done some simple stuff in the
> shell in standalone mode, and now I'm trying to connect to HDFS resources,
> but I'm running into a problem.
> 
> I synced to git's master branch (c399baa - "SPARK-1456 Remove view bounds
> on Ordered in favor of a context bound on Ordering. (3 days ago)  Armbrust>" and built like so:
> 
> SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly
> 
> This created various jars in various places, including these (I think):
> 
>./examples/target/scala-2.10/spark-examples-assembly-1.0.0-
> SNAPSHOT.jar
>./tools/target/scala-2.10/spark-tools-assembly-1.0.0-SNAPSHOT.jar
>./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-
> hadoop2.2.0.jar
> 
> In `conf/spark-env.sh`, I added this (actually before I did the assembly):
> 
> export HADOOP_CONF_DIR=/etc/hadoop/conf
> 
> Now I fire up the shell (bin/spark-shell) and try to grab data from HFDS, and
> get the following exception:
> 
> scala> var hdf = sc.hadoopFile("hdfs:///user/kwilliams/dat/part-m-0")
> hdf: org.apache.spark.rdd.RDD[(Nothing, Nothing)] = HadoopRDD[0] at
> hadoopFile at :12
> 
> scala> hdf.count()
> java.lang.RuntimeException: java.lang.InstantiationException
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131
> )
> at
> org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:155)
> at
> org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:209)
> at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:207)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1064)
> at org.apache.spark.rdd.RDD.count(RDD.scala:806)
> at $iwC$$iwC$$iwC$$iwC.(:15)
> at $iwC$$iwC$$iwC.(:20)
> at $iwC$$iwC.(:22)
> at $iwC.(:24)
> at (:26)
> at .(:30)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
> ava:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> sorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:777)
> at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:10
> 45)
> at
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
> at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:84
> 1)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
> at 
> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spark
> ILoop.scala:936)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.sc
> ala:884)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.sc
> ala:884)
> at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.
> scala:135)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at

Problem connecting to HDFS in Spark shell

2014-04-21 Thread Williams, Ken
I'm trying to get my feet wet with Spark.  I've done some simple stuff in the 
shell in standalone mode, and now I'm trying to connect to HDFS resources, but 
I'm running into a problem.

I synced to git's master branch (c399baa - "SPARK-1456 Remove view bounds on 
Ordered in favor of a context bound on Ordering. (3 days ago) " and built like so:

SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly

This created various jars in various places, including these (I think):

   ./examples/target/scala-2.10/spark-examples-assembly-1.0.0-SNAPSHOT.jar
   ./tools/target/scala-2.10/spark-tools-assembly-1.0.0-SNAPSHOT.jar
   ./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.2.0.jar

In `conf/spark-env.sh`, I added this (actually before I did the assembly):

export HADOOP_CONF_DIR=/etc/hadoop/conf

Now I fire up the shell (bin/spark-shell) and try to grab data from HFDS, and 
get the following exception:

scala> var hdf = sc.hadoopFile("hdfs:///user/kwilliams/dat/part-m-0")
hdf: org.apache.spark.rdd.RDD[(Nothing, Nothing)] = HadoopRDD[0] at hadoopFile 
at :12

scala> hdf.count()
java.lang.RuntimeException: java.lang.InstantiationException
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:155)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:209)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:207)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1064)
at org.apache.spark.rdd.RDD.count(RDD.scala:806)
at $iwC$$iwC$$iwC$$iwC.(:15)
at $iwC$$iwC$$iwC.(:20)
at $iwC$$iwC.(:22)
at $iwC.(:24)
at (:26)
at .(:30)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:777)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1045)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
Caused by: java.lang.InstantiationException
at 
sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129)
... 41 more


Is this recognizable to anyone as a build problem, or a config problem, or 
anything?  Failing that, any way to get more information about where in the 
process it's failing?

Thanks.

--
Ken Williams, Senior Research Scientist
WindLogics
http://windlogics.com





CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution of any kind is strictly 
prohibited. If you are not the intended recipient, please contact the sender 
via reply e-mail and destroy all copies of the original message. Thank you.