Re: how to save RDD partitions in different folders?

2014-04-07 Thread dmpour23
Can you provide an example?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754p3823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


hang on sorting operation

2014-04-07 Thread Stuart Zakon
I am seeing a small standalone cluster (master, slave) hang when I reach a 
certain memory threshold, but I cannot detect how to configure memory to avoid 
this.
I added memory by configuring SPARK_DAEMON_MEMORY=2G and I can see this 
allocated, but it does not help.

The reduce is by key to get the counts by key:
        rdd = sc.parallelize(self.phrases)

        # do a distributed count using reduceByKey
        counts = rdd.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

        # reverse the (key, count) pairs into (count, key) and then sort in 
descending order
        sorted = counts.map(lambda (key, count): (count, key)).sortByKey(False)

Below is the log to the point of hanging:

14/04/06 19:39:15 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 
(PairwiseRDD[2] at reduceByKey)
14/04/06 19:39:15 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
14/04/06 19:39:15 INFO SparkDeploySchedulerBackend: Registered executor: 
Actor[akka.tcp://sparkExecutor@localhost:64370/user/Executor#-2031138316] with 
ID 0
14/04/06 19:39:15 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on executor 
0: localhost (PROCESS_LOCAL)
14/04/06 19:39:15 INFO TaskSetManager: Serialized task 1.0:0 as 10417848 bytes 
in 18 ms
14/04/06 19:39:15 INFO TaskSetManager: Starting task 1.0:1 as TID 1 on executor 
0: localhost (PROCESS_LOCAL)
14/04/06 19:39:15 INFO TaskSetManager: Serialized task 1.0:1 as 10571697 bytes 
in 13 ms
14/04/06 19:39:15 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
block manager localhost:64375 with 294.9 MB RAM
14/04/06 19:39:16 INFO TaskSetManager: Finished TID 0 in 1397 ms on localhost 
(progress: 0/2)
14/04/06 19:39:16 INFO DAGScheduler: Completed ShuffleMapTask(1, 0)


When I interrupt the running program, here is the stack trace, which appears 
stuck after the reduce in the sorting by count in descending order:

    sorted = counts.map(lambda (key, count): (count, key)).sortByKey(False)
  File 
/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py, 
line 361, in sortByKey
    rddSize = self.count()
  File 
/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py, 
line 542, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File 
/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py, 
line 533, in sum
    return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File 
/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py, 
line 499, in reduce
    vals = self.mapPartitions(func).collect()
  File 
/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py, 
line 463, in collect
    bytesInJava = self._jrdd.collect().iterator()
  File 
/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 535, in __call__
  File 
/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 363, in send_command
  File 
/Users/zakons/Spark/spark-0.9.0-incubating-bin-hadoop1/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 472, in send_command
  File 
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py,
 line 430, in readline
    data = recv(1)
KeyboardInterrupt

Is there a reason why the sorting gets stuck?  I can easily remove the problem 
by reducing the size of the RDD below the threshold of about 800,000 items 
prior to the reduce is run.
It would help to see where resources like memory are depleted, but this does 
not show up in the console.

Many thanks,
Stuart Zakon

Recommended way to develop spark application with both java and python

2014-04-07 Thread Wush Wu
Dear all,

We have a spark 0.8.1 cluster on mesos 0.15. Some of my colleagues are
familiar with python, but some of features are developed under java. I am
looking for a way to integrate java and python on spark.

I notice that the initialization of pyspark does not include a field to
distribute jar files to slaves. After exploring the source code and do some
hacking, I could control the java sparkcontext object through py4j, but the
jar files are not delivered to slaves. Moreover, it seems that the spark
lauch the process through the spark home on pyspark but through the
spark.executor.uri on scala.

Is there a recommended way to develop spark application with both
java/scala and python? Should I suggest my team to unify the language?

Thanks!


Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
Hi,

Any thoughts on this? Thanks.

-Suren



On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman 
suren.hira...@velos.io wrote:

 Hi,

 I know if we call persist with the right options, we can have Spark
 persist an RDD's data on disk.

 I am wondering what happens in intermediate operations that could
 conceivably create large collections/Sequences, like GroupBy and shuffling.

 Basically, one part of the question is when is disk used internally?

 And is calling persist() on the RDD returned by such transformations what
 let's it know to use disk in those situations? Trying to understand if
 persist() is applied during the transformation or after it.

 Thank you.


 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io




-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v suren.hira...@sociocast.comelos.io
W: www.velos.io


Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Jerry Lam
Hi Shark,

Should I assume that Shark users should not use the shark APIs since there
are no documentations for it? If there are documentations, can you point it
out?

Best Regards,

Jerry


On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam chiling...@gmail.com wrote:

 Hello everyone,

 I have successfully installed Shark 0.9 and Spark 0.9 in standalone mode
 in a cluster of 6 nodes for testing purposes.

 I would like to use Shark API in Spark programs. So far I could only find
 the following:

 $./bin/shark-shell
 scala val youngUsers = sc.sql2rdd(SELECT * FROM users WHERE age  20)
 scala println(youngUsers.count)
 ...
 scala val featureMatrix = youngUsers.map(extractFeatures(_))
 scala kmeans(featureMatrix)

 Is there a more complete sample code to start a program using Shark API in
 Spark?

 Thanks!

 Jerry



Require some clarity on partitioning

2014-04-07 Thread Sanjay Awatramani
Hi,

I was going through Matei's Advanced Spark presentation at 
https://www.youtube.com/watch?v=w0Tisli7zn4 , and had few questions.
The presentation of this video is at 
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf

The PageRank example introduces partitioning in the below way:
val ranks = // RDD of (url, rank) pairs
val links = sc.textFile(...).map(...).partitionBy(new HashPartitioner(8))

However later on, it is said that
1) Any shuffle operation on two RDDs will take on the Partitioner of one of 
them, if one is set
Question1: Could we have applied partitionBy on the ranks RDD and have the same 
result/performance ?
2) Otherwise, by default use HashPartitioner
Question2: If partitionBy applies HashPartitioner in this example, could we 
simply not have any partitioner and relied on the default HashPartitioner to 
achieve the same result/performance ?

I had another question unrelated to this presentation.
Question3: If my processing is something like this
rdd3 = rdd1.join(rdd2)
rdd4 = rdd3.map((k,(v1,v2))=(v1,k))
rdd6 = rdd4.join(rdd5)
rdd6.saveAsTextFiles(out.txt)

Would I benefit by partitioning ? Unlike the PageRank example, I do not have to 
join/shuffle the same RDD or key more than once.

Regards,
Sanjay

Null Pointer Exception in Spark Application with Yarn Client Mode

2014-04-07 Thread Sai Prasanna
Hi All,

I wanted Spark on Yarn to up and running.

I did *SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly*

Then i ran
*SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar
SPARK_YARN_APP_JAR=examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.1-incubating.jar
MASTER=yarn-client ./spark-shell*

I have SPARK_HOME, YARN_CONF_DIR/HADOOP_CONF_DIR set. Still i get the
following error.

Any clues ??

*ERROR:*
*...Using Scala version 2.9.3 (Java
HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)Initializing
interpreter...Creating SparkContext...java.lang.NullPointerException at
scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:115) at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38) at
org.apache.spark.deploy.yarn.Client$.populateHadoopClasspath(Client.scala:489)
at org.apache.spark.deploy.yarn.Client$.populateClasspath(Client.scala:510)
at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:327) at
org.apache.spark.deploy.yarn.Client.runApp(Client.scala:90) at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:71)
at
org.apache.spark.scheduler.cluster.ClusterScheduler.start(ClusterScheduler.scala:119)
at org.apache.spark.SparkContext.init(SparkContext.scala:273) at
org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:862)
at init(console:10) at init(console:22) at init(console:24) at
.init(console:28) at .clinit(console) at .init(console:7) at
.clinit(console) at $export(console) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629)
at
org.apache.spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:897)
at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
at scala.tools.nsc.io.package$$anon$2.run(package.scala:25) at
java.lang.Thread.run(Thread.java:744)*
*.*


Re: How to create a RPM package

2014-04-07 Thread Will Benton
 For issue #2 I was concerned that the build  packaging had to be
 internal. So I am using the already packaged make-distribution.sh
 (modified to use a maven build) to create a tar ball which I then package
 it using a RPM spec file.

Hi Rahul, so the issue for downstream operating system distributions is that 
they basically need to be able to build everything from source.  So while 
you're building Spark itself locally, a package that appears in Fedora or 
Debian will go a bit further and also build all of the library dependencies, 
compilers, etc., from sources (possibly with some exceptions for bootstrapping 
compilers that can't be built without themselves) rather than pulling down 
binaries from a public Maven or Ivy repository.

 Although on a side note, it would interesting to learn how the list of
 files does not need to be maintained in the spec file (the spec file that
 Christophe attached was using a explicit list).

If you take a look at the spec that is in Fedora 
(http://pkgs.fedoraproject.org/cgit/spark.git/tree/spark.spec, starting at line 
257 in the current revision), you'll see the following %files section:

%files -f .mfiles
%dir %{_javadir}/%{name}

%doc LICENSE README.md

%files javadoc
%{_javadocdir}/%{name}
%doc LICENSE

As you can see, this list is pretty generic (and obviously doesn't include all 
the files in the package).  The list of JAR and POM files is provided in a file 
called .mfiles (which is automatically generated by macros) and we just have to 
specify the directories that the package owns (which aren't picked up by the 
macros at this time), the license, and the README.



best,
wb


PySpark SocketConnect Issue in Cluster

2014-04-07 Thread Surendranauth Hiraman
Hi,

We have a situation where a Pyspark script works fine as a local process
(local url) on the Master and the Worker nodes, which would indicate that
all python dependencies are set up properly on each machine.

But when we try to run the script at the cluster level (using the master's
url), if fails partway through the flow on a GroupBy with a SocketConnect
error and python crashes.

This is on ec2 using the AMI. This doesn't seem to be an issue of the
master not seeing the workers, since they show up in the web ui.

Also, we can see the job running on the cluster until it reaches the
GroupBy transform step, which is when we get the SocketConnect error.

Any ideas?

-Suren


SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v suren.hira...@sociocast.comelos.io
W: www.velos.io


Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
It might help if I clarify my questions. :-)

1. Is persist() applied during the transformation right before the
persist() call in the graph? Or is is applied after the transform's
processing is complete? In the case of things like GroupBy, is the Seq
backed by disk as it is being created? We're trying to get a sense of how
the processing is handled behind the scenes with respect to disk.

2. When else is disk used internally?

Any pointers are appreciated.

-Suren




On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman 
suren.hira...@velos.io wrote:

 Hi,

 Any thoughts on this? Thanks.

 -Suren



 On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman 
 suren.hira...@velos.io wrote:

 Hi,

 I know if we call persist with the right options, we can have Spark
 persist an RDD's data on disk.

 I am wondering what happens in intermediate operations that could
 conceivably create large collections/Sequences, like GroupBy and shuffling.

 Basically, one part of the question is when is disk used internally?

 And is calling persist() on the RDD returned by such transformations what
 let's it know to use disk in those situations? Trying to understand if
 persist() is applied during the transformation or after it.

 Thank you.


 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io




 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io




-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v suren.hira...@sociocast.comelos.io
W: www.velos.io


Re: reduceByKeyAndWindow Java

2014-04-07 Thread Eduardo Costa Alfaia

Hi TD
Could you explain me this code part?

.reduceByKeyAndWindow(
109 new Function2Integer, Integer, Integer() {
110   public Integer call(Integer i1, Integer i2) { return i1 + 
i2; }

111 },
112 new Function2Integer, Integer, Integer() {
113   public Integer call(Integer i1, Integer i2) { return i1 - 
i2; }

114 },
115 new Duration(60 * 5 * 1000),
116 new Duration(1 * 1000)
117   );

Thanks

Em 4/4/14, 22:56, Tathagata Das escreveu:
I havent really compiled the code, but it looks good to me. Why? Is 
there any problem you are facing?


TD


On Fri, Apr 4, 2014 at 8:03 AM, Eduardo Costa Alfaia 
e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote:



Hi guys,

I would like knowing if the part of code is right to use in Window.

JavaPairDStreamString, Integer wordCounts = words.map(
103   new PairFunctionString, String, Integer() {
104 @Override
105 public Tuple2String, Integer call(String s) {
106   return new Tuple2String, Integer(s, 1);
107 }
108   }).reduceByKeyAndWindow(
109 new Function2Integer, Integer, Integer() {
110   public Integer call(Integer i1, Integer i2) { return
i1 + i2; }
111 },
112 new Function2Integer, Integer, Integer() {
113   public Integer call(Integer i1, Integer i2) { return
i1 - i2; }
114 },
115 new Duration(60 * 5 * 1000),
116 new Duration(1 * 1000)
117   );



Thanks

-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155






--
Informativa sulla Privacy: http://www.unibs.it/node/8155


AWS Spark-ec2 script with different user

2014-04-07 Thread Marco Costantini
Hi all,
On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
Also, it is the default user for the Spark-EC2 script.

Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
instead of 'root'.

I can see that the Spark-EC2 script allows you to specify which user to log
in with, but even when I change this, the script fails for various reasons.
And the output SEEMS that the script is still based on the specified user's
home directory being '/root'.

Am I using this script wrong?
Has anyone had success with this 'ec2-user' user?
Any ideas?

Please and thank you,
Marco.


SparkContext.addFile() and FileNotFoundException

2014-04-07 Thread Thierry Herrmann
Hi,
I'm trying to use SparkContext.addFile() to propagate a file to worker
nodes, in a standalone cluster (2 nodes, 1 master, 1 worker connected to the
master). I don't have HDFS or any distributed file system. Just playing with
basic stuff.
Here's the code in my driver (actually spark-shell running on the master
node). In the current directory I have file spam.data
The following commands are taken from the book
http://www.packtpub.com/fast-data-processing-with-spark/book , page 44


*scala sc.addFile(spam.data)*

14/04/07 14:03:48 INFO Utils: Copying
/home/thierry/dev/spark-samples/packt-book/LoadSaveExample/spam.data to
/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data
14/04/07 14:03:49 INFO SparkContext: Added file spam.data at
http://192.168.1.51:59008/files/spam.data with timestamp 1396893828972

*scala import org.apache.spark.SparkFiles*
import org.apache.spark.SparkFiles

*scala val inFile = sc.textFile(SparkFiles.get(spam.data))*

14/04/07 14:05:00 INFO MemoryStore: ensureFreeSpace(138763) called with
curMem=0, maxMem=311387750
14/04/07 14:05:00 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 135.5 KB, free 296.8 MB)
inFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
console:13


Now trigger some action to make the worker work.

*scala inFile.count()*


In the stderr.log of the app on the worker :

14/04/07 14:05:33 INFO Executor: Fetching
http://192.168.1.51:59008/files/spam.data with timestamp 1396893828972
14/04/07 14:05:33 INFO Utils: Fetching
http://192.168.1.51:59008/files/spam.data to
/tmp/fetchFileTemp435286457200696761.tmp

So apparently the file was successfully downloaded from the driver to the
worker. The jar of the application is also successfully downloaded.
But a bit later, in the same stderr.log:

14/04/07 14:05:34 INFO HttpBroadcast: Reading broadcast variable 0 took
0.352334273 s
14/04/07 14:05:34 INFO HadoopRDD: Input split:
file:/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data:0+349170
14/04/07 14:05:34 ERROR Executor: Exception in task ID 0
java.io.FileNotFoundException: File
file:/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.data does not
exist
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:137)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
at
org.apache.hadoop.mapred.LineRecordReader.init(LineRecordReader.java:106)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

It looks like the file is looked for in:

/tmp/spark-ad9ec403-7894-463b-9e67-08610cd1ae91/spam.dat

which is the temp location on the master node where the driver is running,
while it was downloaded in the worker node in
/tmp/fetchFileTemp435286457200696761.tmp

I see hadoop related classes in the stack trace. Does it mean HDFS is used ?
If that's the case, is it because I'm using the precompiled
spark-0.9.0-incubating-bin-hadoop2 ?

I couldn't find any response, neither in the spark user list, nor by
googling it or in the spark guides (sorry for that probably very basic
question)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-addFile-and-FileNotFoundException-tp3844.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Status of MLI?

2014-04-07 Thread Evan R. Sparks
That work is under submission at an academic conference and will be made
available if/when the paper is published.

In terms of algorithms for hyperparameter tuning, we consider Grid Search,
Random Search, a couple of older derivative-free optimization methods, and
a few newer methods - TPE (aka HyperOpt from James Bergstra), SMAC (from
Frank Hutter's group), and Spearmint (Jasper Snoek's method) - the short
answer is that in our hands Random Search works surprisingly well for the
low-dimensional problems we looked at, but TPE and SMAC perform slightly
better. I've got a private branch with TPE (as well as random and grid
search) integrated with MLI, but the code is research quality right now and
not extremely general.

We're actively working on bringing these things up to snuff for a proper
open source release.




On Fri, Apr 4, 2014 at 11:28 AM, Yi Zou yi.zou.li...@gmail.com wrote:

 Hi, Evan,

 Just noticed this thread, do you mind sharing more details regarding
 algorithms targetted at hyperparameter tuning/model selection? or a link
 to dev git repo for that work.

 thanks,
 yi


 On Wed, Apr 2, 2014 at 6:03 PM, Evan R. Sparks evan.spa...@gmail.comwrote:

 Targeting 0.9.0 should work out of the box (just a change to the
 build.sbt) - I'll push some changes I've been sitting on to the public repo
 in the next couple of days.


 On Wed, Apr 2, 2014 at 4:05 AM, Krakna H shankark+...@gmail.com wrote:

 Thanks for the update Evan! In terms of using MLI, I see that the Github
 code is linked to Spark 0.8; will it not work with 0.9 (which is what I
 have set up) or higher versions?


 On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User
 List] [hidden email]http://user/SendEmail.jtp?type=nodenode=3632i=0
  wrote:

 Hi there,

 MLlib is the first component of MLbase - MLI and the higher levels of
 the stack are still being developed. Look for updates in terms of our
 progress on the hyperparameter tuning/model selection problem in the next
 month or so!

 - Evan


 On Tue, Apr 1, 2014 at 8:05 PM, Krakna H [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3615i=0
  wrote:

 Hi Nan,

 I was actually referring to MLI/MLBase (http://www.mlbase.org); is
 this being actively developed?

 I'm familiar with mllib and have been looking at its documentation.

 Thanks!


 On Tue, Apr 1, 2014 at 10:44 PM, Nan Zhu [via Apache Spark User List]
 [hidden email] http://user/SendEmail.jtp?type=nodenode=3612i=0wrote:

  mllib has been part of Spark distribution (under mllib directory),
 also check http://spark.apache.org/docs/latest/mllib-guide.html

 and for JIRA, because of the recent migration to apache JIRA, I think
 all mllib-related issues should be under the Spark umbrella,
 https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

 --
 Nan Zhu

 On Tuesday, April 1, 2014 at 10:38 PM, Krakna H wrote:

 What is the current development status of MLI/MLBase? I see that the
 github repo is lying dormant (https://github.com/amplab/MLI) and
 JIRA has had no activity in the last 30 days (
 https://spark-project.atlassian.net/browse/MLI/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
 Is the plan to add a lot of this into mllib itself without needing a
 separate API?

 Thanks!

 --
 View this message in context: Status of 
 MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at
 Nabble.com.




 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3611.html
  To start a new topic under Apache Spark User List, email [hidden
 email] http://user/SendEmail.jtp?type=nodenode=3612i=1
 To unsubscribe from Apache Spark User List, click here.
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: Status of 
 MLI?http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3612.html

 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at 
 Nabble.com.




 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-tp3610p3615.html
  To start a new topic under Apache Spark User List, email [hidden
 email] http://user/SendEmail.jtp?type=nodenode=3632i=1
 

Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Yana Kadiyska
I might be wrong here but I don't believe it's discouraged. Maybe part
of the reason there's not a lot of examples is that sql2rdd returns an
RDD (TableRDD that is
https://github.com/amplab/shark/blob/master/src/main/scala/shark/SharkContext.scala).
I haven't done anything too complicated yet but my impression is that
almost any Spark example of manipulating RDDs should applying from
that line onwards.

Are you asking for samples what to do with the RDD once you get it or
how to get a SharkContext from a standalone program?

Also, my reading of a recent email on this list is that SharkAPI will
be largely superceded by a more general SparkSQL API in 1.0
(http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html).
So if you're just starting out and you don't have short term needs
that might be a better place to start...

On Mon, Apr 7, 2014 at 9:14 AM, Jerry Lam chiling...@gmail.com wrote:
 Hi Shark,

 Should I assume that Shark users should not use the shark APIs since there
 are no documentations for it? If there are documentations, can you point it
 out?

 Best Regards,

 Jerry


 On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam chiling...@gmail.com wrote:

 Hello everyone,

 I have successfully installed Shark 0.9 and Spark 0.9 in standalone mode
 in a cluster of 6 nodes for testing purposes.

 I would like to use Shark API in Spark programs. So far I could only find
 the following:

 $./bin/shark-shell
 scala val youngUsers = sc.sql2rdd(SELECT * FROM users WHERE age  20)
 scala println(youngUsers.count)
 ...
 scala val featureMatrix = youngUsers.map(extractFeatures(_))
 scala kmeans(featureMatrix)

 Is there a more complete sample code to start a program using Shark API in
 Spark?

 Thanks!

 Jerry




Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Marco Costantini
Hi Shivaram,

OK so let's assume the script CANNOT take a different user and that it must
be 'root'. The typical workaround is as you said, allow the ssh with the
root user. Now, don't laugh, but, this worked last Friday, but today
(Monday) it no longer works. :D Why? ...

...It seems that NOW, when you launch a 'paravirtual' ami, the root user's
'authorized_keys' file is always overwritten. This means the workaround
doesn't work anymore! I would LOVE for someone to verify this.

Just to point out, I am trying to make this work with a paravirtual
instance and not an HVM instance.

Please and thanks,
Marco.


On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access and a lot
 of internal scripts assume have the user's home directory hard coded as
 /root.   However all the Spark AMIs we build should have root ssh access --
 Do you find this not to be the case ?

 You can also enable root ssh access in a vanilla AMI by editing
 /etc/ssh/sshd_config and setting PermitRootLogin to yes

 Thanks
 Shivaram



 On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi all,
 On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
 Also, it is the default user for the Spark-EC2 script.

 Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
 instead of 'root'.

 I can see that the Spark-EC2 script allows you to specify which user to
 log in with, but even when I change this, the script fails for various
 reasons. And the output SEEMS that the script is still based on the
 specified user's home directory being '/root'.

 Am I using this script wrong?
 Has anyone had success with this 'ec2-user' user?
 Any ideas?

 Please and thank you,
 Marco.





Driver Out of Memory

2014-04-07 Thread Eduardo Costa Alfaia

Hi Guys,

I would like understanding why the Driver's RAM goes down, Does the 
processing occur only in the workers?

Thanks
# Start Tests
computer1(Worker/Source Stream)
 23:57:18 up 12:03,  1 user,  load average: 0.03, 0.31, 0.44
 total   used   free sharedbuffers cached
Mem:  3945   1084   2860  0 44827
-/+ buffers/cache:212   3732
Swap:0  0  0
computer8 (Driver/Master)
 23:57:18 up 11:53,  5 users,  load average: 0.43, 1.19, 1.31
 total   used   free sharedbuffers cached
Mem:  5897   4430   1466  0 384   2662
-/+ buffers/cache:   1382   4514
Swap:0  0  0
computer10(Worker/Source Stream)
 23:57:18 up 12:02,  1 user,  load average: 0.55, 1.34, 0.98
 total   used   free sharedbuffers cached
Mem:  5897564   5332  0 18358
-/+ buffers/cache:187   5709
Swap:0  0  0
computer11(Worker/Source Stream)
 23:57:18 up 12:02,  1 user,  load average: 0.07, 0.19, 0.29
 total   used   free sharedbuffers cached
Mem:  3945603   3342  0 54355
-/+ buffers/cache:193   3751
Swap:0  0  0

 After 2 Minutes

computer1
 00:06:41 up 12:12,  1 user,  load average: 3.11, 1.32, 0.73
 total   used   free sharedbuffers cached
Mem:  3945   2950994  0 46 1095
-/+ buffers/cache:   1808   2136
Swap:0  0  0
computer8(Driver/Master)
 00:06:41 up 12:02,  5 users,  load average: 1.16, 0.71, 0.96
 total   used   free sharedbuffers cached
Mem:  5897   5191705  0 385   2792
-/+ buffers/cache:   2014   3882
Swap:0  0  0
computer10
 00:06:41 up 12:11,  1 user,  load average: 2.02, 1.07, 0.89
 total   used   free sharedbuffers cached
Mem:  5897   2567   3329  0 21647
-/+ buffers/cache:   1898   3998
Swap:0  0  0
computer11
 00:06:42 up 12:12,  1 user,  load average: 3.96, 1.83, 0.88
 total   used   free sharedbuffers cached
Mem:  3945   3542402  0 57 1099
-/+ buffers/cache:   2385   1559
Swap:0  0  0


--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Shivaram Venkataraman
Hmm -- That is strange. Can you paste the command you are using to launch
the instances ? The typical workflow is to use the spark-ec2 wrapper script
using the guidelines at http://spark.apache.org/docs/latest/ec2-scripts.html

Shivaram


On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini 
silvio.costant...@granatads.com wrote:

 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and that it
 must be 'root'. The typical workaround is as you said, allow the ssh with
 the root user. Now, don't laugh, but, this worked last Friday, but today
 (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the root user's
 'authorized_keys' file is always overwritten. This means the workaround
 doesn't work anymore! I would LOVE for someone to verify this.

 Just to point out, I am trying to make this work with a paravirtual
 instance and not an HVM instance.

 Please and thanks,
 Marco.


 On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman 
 shivaram.venkatara...@gmail.com wrote:

 Right now the spark-ec2 scripts assume that you have root access and a
 lot of internal scripts assume have the user's home directory hard coded as
 /root.   However all the Spark AMIs we build should have root ssh access --
 Do you find this not to be the case ?

 You can also enable root ssh access in a vanilla AMI by editing
 /etc/ssh/sshd_config and setting PermitRootLogin to yes

 Thanks
 Shivaram



 On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini 
 silvio.costant...@granatads.com wrote:

 Hi all,
 On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
 Also, it is the default user for the Spark-EC2 script.

 Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
 instead of 'root'.

 I can see that the Spark-EC2 script allows you to specify which user to
 log in with, but even when I change this, the script fails for various
 reasons. And the output SEEMS that the script is still based on the
 specified user's home directory being '/root'.

 Am I using this script wrong?
 Has anyone had success with this 'ec2-user' user?
 Any ideas?

 Please and thank you,
 Marco.






CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Paul Mogren
Hello, Spark community!  My name is Paul. I am a Spark newbie, evaluating 
version 0.9.0 without any Hadoop at all, and need some help. I run into the 
following error with the StatefulNetworkWordCount example (and similarly in my 
prototype app, when I use the updateStateByKey operation).  I get this when 
running against my small cluster, but not (so far) against local[2].

61904 [spark-akka.actor.default-dispatcher-2] ERROR 
org.apache.spark.streaming.scheduler.JobScheduler - Error running job streaming 
job 1396905956000 ms.0
org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[310] at take at 
DStream.scala:586(0) has different number of partitions than original RDD 
MapPartitionsRDD[309] at mapPartitions at StateDStream.scala:66(2)
at 
org.apache.spark.rdd.RDDCheckpointData.doCheckpoint(RDDCheckpointData.scala:99)
at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:989)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:855)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:870)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:884)
at org.apache.spark.rdd.RDD.take(RDD.scala:844)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:586)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:585)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:155)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)


Please let me know what other information would be helpful; I didn't find any 
question submission guidelines.

Thanks,
Paul


Re: Creating a SparkR standalone job

2014-04-07 Thread pawan kumar
Thanks Shivaram! Will give it a try and let you know.

Regards,
Pawan Venugopal


On Mon, Apr 7, 2014 at 3:38 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 You can create standalone jobs in SparkR as just R files that are run
 using the sparkR script. These commands will be sent to a Spark cluster and
 the examples on the SparkR repository (
 https://github.com/amplab-extras/SparkR-pkg#examples-unit-tests) are in
 fact standalone jobs.

 However I don't think that will completely solve your use case of using
 Streaming + R. We don't yet have a way to call R functions from Spark's
 Java or Scala API. So right now one thing you can try is to save data from
 SparkStreaming to HDFS and then run a SparkR job which reads in the file.

 Regarding the other idea of calling R from Scala -- it might be possible
 to do that in your code if the classpath etc. is setup correctly. I haven't
 tried it out though, but do let us know if you get it to work.

 Thanks
 Shivaram


 On Mon, Apr 7, 2014 at 2:21 PM, pawan kumar pkv...@gmail.com wrote:

 Hi,

 Is it possible to create a standalone job in scala using sparkR? If
 possible can you provide me with the information of the setup process.
 (Like the dependencies in SBT and where to include the JAR files)

 This is my use-case:

 1. I have a Spark Streaming standalone Job running in local machine which
 streams twitter data.
 2. I have an R script which performs Sentiment Analysis.

 I am looking for an optimal way where I could combine these two
 operations into a single job and run using SBT Run command.

 I came across this document which talks about embedding R into scala (
 http://dahl.byu.edu/software/jvmr/dahl-payne-uppalapati-2013.pdf) but
 was not sure if that would work well within the spark context.

 Thanks,
 Pawan Venugopal





job offering

2014-04-07 Thread Rault, Severan
Hi,

I am looking for users of spark to join my teams here at Amazon. If you are 
reading this you probably qualify.
I am looking for developer of ANY level, but with an interest in spark. My 
teams are leveraging spark to solve real business scenarios.
If you are interested, just shoot me a note and tell you more about the 
opportunities here in Seattle.

Best,

Severan Rault

Director Software development
Amazon


Re: CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Tathagata Das
Few things that would be helpful.

1. Environment settings - you can find them on the environment tab in the
Spark application UI
2. Are you setting the HDFS configuration correctly in your Spark program?
For example, can you write a HDFS file from a Spark program (say
spark-shell) to your HDFS installation and read it back into Spark (i.e.,
create a RDD)? You can test this by write an RDD as a text file from the
shell, and then try to read it back from another shell.
3. If that works, then lets try explicitly checkpointing an RDD. To do this
you can take any RDD and do the following.

myRDD.checkpoint()
myRDD.count()

If there is some issue, then this should reproduce the above error.

TD


On Mon, Apr 7, 2014 at 3:48 PM, Paul Mogren pmog...@commercehub.com wrote:

 Hello, Spark community!  My name is Paul. I am a Spark newbie, evaluating
 version 0.9.0 without any Hadoop at all, and need some help. I run into the
 following error with the StatefulNetworkWordCount example (and similarly in
 my prototype app, when I use the updateStateByKey operation).  I get this
 when running against my small cluster, but not (so far) against local[2].

 61904 [spark-akka.actor.default-dispatcher-2] ERROR
 org.apache.spark.streaming.scheduler.JobScheduler - Error running job
 streaming job 1396905956000 ms.0
 org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[310] at take
 at DStream.scala:586(0) has different number of partitions than original
 RDD MapPartitionsRDD[309] at mapPartitions at StateDStream.scala:66(2)
 at
 org.apache.spark.rdd.RDDCheckpointData.doCheckpoint(RDDCheckpointData.scala:99)
 at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:989)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:855)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:870)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:884)
 at org.apache.spark.rdd.RDD.take(RDD.scala:844)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:586)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachFunc$2$1.apply(DStream.scala:585)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:155)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:744)


 Please let me know what other information would be helpful; I didn't find
 any question submission guidelines.

 Thanks,
 Paul



RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers
any reason why RDDInfo suddenly became private in SPARK-1132?

we are using it to show users status of rdds


Re: RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers
ok yeah we are using StageInfo and TaskInfo too...


On Mon, Apr 7, 2014 at 8:51 PM, Andrew Or and...@databricks.com wrote:

 Hi Koert,

 Other users have expressed interest for us to expose similar classes too
 (i.e. StageInfo, TaskInfo). In the newest release, they will be available
 as part of the developer API. The particular PR that will change this is:
 https://github.com/apache/spark/pull/274.

 Cheers,
 Andrew


 On Mon, Apr 7, 2014 at 5:05 PM, Koert Kuipers ko...@tresata.com wrote:

 any reason why RDDInfo suddenly became private in SPARK-1132?

 we are using it to show users status of rdds





RE: CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Paul Mogren
1.:  I will paste the full content of the environment page of the example 
application running against the cluster at the end of this message.
2. and 3.:  Following #2 I was able to see that the count was incorrectly 0 
when running against the cluster, and following #3 I was able to get the 
message:
org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[4] at count at 
console:15(0) has different number of partitions than original RDD 
MappedRDD[3] at textFile at console:12(2)

I think I understand - state checkpoints and other file-exchange operations in 
Spark cluster require a distributed/shared filesystem, even with just a 
single-host cluster and the driver/shell on a second host. Is that correct?

Thank you,
Paul



Stages
Storage
Environment
Executors
NetworkWordCumulativeCountUpdateStateByKey application UI
Environment
Runtime Information

NameValue
Java Home   /usr/lib/jvm/jdk1.8.0/jre
Java Version1.8.0 (Oracle Corporation)
Scala Home  
Scala Version   version 2.10.3
Spark Properties

NameValue
spark.app.name  NetworkWordCumulativeCountUpdateStateByKey
spark.cleaner.ttl   3600
spark.deploy.recoveryMode   ZOOKEEPER
spark.deploy.zookeeper.url  pubsub01:2181
spark.driver.host   10.10.41.67
spark.driver.port   37360
spark.fileserver.urihttp://10.10.41.67:40368
spark.home  /home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2
spark.httpBroadcast.uri http://10.10.41.67:45440
spark.jars  
/home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar
spark.masterspark://10.10.41.19:7077
System Properties

NameValue
awt.toolkit sun.awt.X11.XToolkit
file.encoding   ANSI_X3.4-1968
file.encoding.pkg   sun.io
file.separator  /
java.awt.graphicsenvsun.awt.X11GraphicsEnvironment
java.awt.printerjob sun.print.PSPrinterJob
java.class.version  52.0
java.endorsed.dirs  /usr/lib/jvm/jdk1.8.0/jre/lib/endorsed
java.ext.dirs   /usr/lib/jvm/jdk1.8.0/jre/lib/ext:/usr/java/packages/lib/ext
java.home   /usr/lib/jvm/jdk1.8.0/jre
java.io.tmpdir  /tmp
java.library.path   
java.net.preferIPv4Stacktrue
java.runtime.name   Java(TM) SE Runtime Environment
java.runtime.version1.8.0-b132
java.specification.name Java Platform API Specification
java.specification.vendor   Oracle Corporation
java.specification.version  1.8
java.vendor Oracle Corporation
java.vendor.url http://java.oracle.com/
java.vendor.url.bug http://bugreport.sun.com/bugreport/
java.version1.8.0
java.vm.infomixed mode
java.vm.nameJava HotSpot(TM) 64-Bit Server VM
java.vm.specification.name  Java Virtual Machine Specification
java.vm.specification.vendorOracle Corporation
java.vm.specification.version   1.8
java.vm.vendor  Oracle Corporation
java.vm.version 25.0-b70
line.separator  
log4j.configuration conf/log4j.properties
os.arch amd64
os.name Linux
os.version  3.5.0-23-generic
path.separator  :
sun.arch.data.model 64
sun.boot.class.path 
/usr/lib/jvm/jdk1.8.0/jre/lib/resources.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/rt.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/sunrsasign.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/jsse.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/jce.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/charsets.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/jfr.jar:/usr/lib/jvm/jdk1.8.0/jre/classes
sun.boot.library.path   /usr/lib/jvm/jdk1.8.0/jre/lib/amd64
sun.cpu.endian  little
sun.cpu.isalist 
sun.io.serialization.extendedDebugInfo  true
sun.io.unicode.encoding UnicodeLittle
sun.java.command
org.apache.spark.streaming.examples.StatefulNetworkWordCount 
spark://10.10.41.19:7077 localhost 
sun.java.launcher   SUN_STANDARD
sun.jnu.encodingANSI_X3.4-1968
sun.management.compiler HotSpot 64-Bit Tiered Compilers
sun.nio.ch.bugLevel 
sun.os.patch.level  unknown
user.countryUS
user.dir/home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2
user.home   /home/pmogren
user.language   en
user.name   pmogren
user.timezone   America/New_York
Classpath Entries

ResourceSource
/home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar
 System Classpath
/home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/confSystem 
Classpath
/home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar
System Classpath
http://10.10.41.67:40368/jars/spark-examples_2.10-assembly-0.9.0-incubating.jar 
Added By User











From: Tathagata Das [mailto:tathagata.das1...@gmail.com] 
Sent: Monday, April 07, 2014 7:54 PM
To: user@spark.apache.org
Subject: Re: CheckpointRDD has different number of partitions than original RDD

Few things that would be helpful. 

1. Environment settings - you can find them on the environment tab in the Spark 
application UI
2. 

答复: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

2014-04-07 Thread Francis . Hu
Great!!!

When i built it on another disk whose format is ext4, it works right now.

hadoop@ubuntu-1:~$ df -Th
FilesystemType  Size  Used Avail Use% Mounted on
/dev/sdb6 ext4  135G  8.6G  119G   7% /
udev  devtmpfs  7.7G  4.0K  7.7G   1% /dev
tmpfs tmpfs 3.1G  316K  3.1G   1% /run
none  tmpfs 5.0M 0  5.0M   0% /run/lock
none  tmpfs 7.8G  4.0K  7.8G   1% /run/shm
/dev/sda1 ext4  112G  3.7G  103G   4% /faststore
/home/hadoop/.Private ecryptfs  135G  8.6G  119G   7% /home/hadoop

Thanks again, Marcelo Vanzin.


Francis.Hu

-邮件原件-
发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 
发送时间: Saturday, April 05, 2014 1:13
收件人: user@spark.apache.org
主题: Re: java.lang.NoClassDefFoundError: 
scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

Hi Francis,

This might be a long shot, but do you happen to have built spark on an
encrypted home dir?

(I was running into the same error when I was doing that. Rebuilding
on an unencrypted disk fixed the issue. This is a known issue /
limitation with ecryptfs. It's weird that the build doesn't fail, but
you do get warnings about the long file names.)


On Wed, Apr 2, 2014 at 3:26 AM, Francis.Hu francis...@reachjunction.com wrote:
 I stuck in a NoClassDefFoundError.  Any helps that would be appreciated.

 I download spark 0.9.0 source, and then run this command to build it :
 SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly


 java.lang.NoClassDefFoundError:
 scala/tools/nsc/transform/UnCurry$UnCurryTransformer$$anonfun$14$$anonfun$apply$5$$anonfun$scala$tools$nsc$transform$UnCurry$UnCurryTransformer$$anonfun$$anonfun$$transformInConstructor$1$1

-- 
Marcelo



Re: trouble with join on large RDDs

2014-04-07 Thread Patrick Wendell
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller bmill...@eecs.berkeley.eduwrote:

 I am running the latest version of PySpark branch-0.9 and having some
 trouble with join.

 One RDD is about 100G (25GB compressed and serialized in memory) with
 130K records, the other RDD is about 10G (2.5G compressed and
 serialized in memory) with 330K records.  I load each RDD from HDFS,
 invoke keyBy to key each record, and then attempt to join the RDDs.

 The join consistently crashes at the beginning of the reduce phase.
 Note that when joining the 10G RDD to itself there is no problem.

 Prior to the crash, several suspicious things happen:

 -All map output files from the map phase of the join are written to
 spark.local.dir, even though there should be plenty of memory left to
 contain the map output.  I am reasonably sure *all* map outputs are
 written to disk because the size of the map output spill directory
 matches the size of the shuffle write (as displayed in the user
 interface) for each machine.


The shuffle data is written through the buffer cache of the operating
system, so you would expect the files to show up there immediately and
probably to show up as being their full size when you do ls. In reality
though these are likely residing in the OS cache and not on disk.


 -In the beginning of the reduce phase of the join, memory consumption
 on each work spikes and each machine runs out of memory (as evidenced
 by a Cannon allocate memory exception in Java).  This is
 particularly surprising since each machine has 30G of ram and each
 spark worker has only been allowed 10G.


Could you paste the error here?


 -In the web UI both the Shuffle Spill (Memory) and Shuffle Spill
 (Disk) fields for each machine remain at 0.0 despite shuffle files
 being written into spark.local.dir.


Shuffle spill is different than the shuffle files written to
spark.local.dir. Shuffle spilling is for aggregations that occur on the
reduce side of the shuffle. A join like this might not see any spilling.



From the logs it looks like your executor has died. Would you be able to
paste the log from the executor with the exact failure? It would show up in
the /work directory inside of spark's directory on the cluster.


[BLOG] For Beginners

2014-04-07 Thread prabeesh k
Hi all,

Here I am sharing a blog for beginners, about creating spark streaming
stand alone application and bundle the app as single runnable jar. Take a
look and drop your comments in blog page.

http://prabstechblog.blogspot.in/2014/04/a-standalone-spark-application-in-scala.html

http://prabstechblog.blogspot.in/2014/04/creating-single-jar-for-spark-project.html

prabeesh


Mongo-Hadoop Connector with Spark

2014-04-07 Thread Pavan Kumar
Hi Everyone,

I saved a 2GB pdf file into MongoDB using GridFS. now i want process those
GridFS collection data using Java Spark Mapreduce. previously i have
successfully processed normal mongoDB collections(not GridFS) with Apache
spark using Mongo-Hadoop connector. now i'm unable to handle input GridFS
collections.

My question here is..

1) how to pass MongoDB GridFS data as input to our spark application
2)Do we have separate RDD to handle GridFS  binary data...

I'm trying with following snippet but I'm unable to get actual data.

 MongoConfigUtil.setInputURI(config,
mongodb://localhost:27017/pdfbooks.fs.chunks );
 MongoConfigUtil.setOutputURI(config,mongodb://localhost:27017/+output );
 JavaPairRDDObject, BSONObject mongoRDD = sc.newAPIHadoopRDD(config,
com.mongodb.hadoop.MongoInputFormat.class, Object.class,
BSONObject.class);
 JavaRDDString words = mongoRDD.flatMap(new
FlatMapFunctionTuple2Object,BSONObject,
   String() {
   @Override
   public IterableString call(Tuple2Object, BSONObject arg) {
   System.out.println(arg._2.toString());
   ...

please suggest me  available ways to do...Thank you in Advance!!!