Hi, DB
i tried including breeze library by using spark 1.0, it works. but how can
i call
the native library in standalone cluster mode.
in local mode
1. i include org.scalanlp % breeze-natives_2.10 % 0.7 dependency in
sbt build file
2. i install openblas
it works
in standalone
Hi Everyone,
I found it quite difficult to find good examples for Spark RDD API calls. So
my student and I decided to go through the entire API and write examples for
the vast majority of API calls (basically examples for anything that is
remotely interesting). I think these examples maybe useful
I recall someone from the Spark team (TD?) saying that Spark 9.1 will change
the logger and the circular loop error between slf4j and log4j wouldn't show up.
Yet on Spark 9.1 I still get
SLF4J: Detected both log4j-over-slf4j.jar AND slf4j-log4j12.jar on the class
path, preempting
Hi,
Is there any lower-bound on the size of RDD to optimally utilize the
in-memory framework Spark.
Say creating RDD for very small data set of some 64 MB is not as efficient
as that of some 256 MB, then accordingly the application can be tuned.
So is there a soft-lowerbound related to
Set `hadoop.tmp.dir` in `spark-env.sh` solved the problem. Spark job no
longer writes tmp files in /tmp/hadoop-root/.
SPARK_JAVA_OPTS+= -Dspark.local.dir=/mnt/spark,/mnt2/spark
-Dhadoop.tmp.dir=/mnt/ephemeral-hdfs
export SPARK_JAVA_OPTS
I'm wondering if we need to permanently add this in the
PS spark shell with all proper imports are also supported natively in
Mahout (mahout spark-shell command). See M-1489 for specifics. There's also
a tutorial somewhere but i suspect it has not been yet finished/publised
via public link yet. Again, you need trunk to use spark shell there.
On Wed,
PPS The shell/spark tutorial i've mentioned is actually being developed in
MAHOUT-1542. As it stands, i believe it is now complete in its core.
On Wed, May 14, 2014 at 5:48 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
PS spark shell with all proper imports are also supported natively in
My code just like follows:
1 var rdd1 = ...
2 var rdd2 = ...
3 var kv = ...
4 for (i - 0 until n) {
5var kvGlobal = sc.broadcast(kv) // broadcast kv
6rdd1 = rdd2.map {
7 case t = doSomething(t, kvGlobal.value)
8}
9var tmp =
I took a stab at it and wrote a
partitionerhttps://github.com/syedhashmi/spark/commit/4ca94cc155aea4be36505d5f37d037e209078196that
I intend to contribute back to main repo some time later. The
partitioner takes in parameter which governs minimum number of keys /
partition and once all partition
hey patrick,
i have a SparkConf i can add them too. i was looking for a way to do this
where they are not hardwired within scala, which is what SPARK_JAVA_OPTS
used to do.
i guess if i just set -Dspark.akka.frameSize=1 on my java app launch
then it will get picked up by the SparkConf too
finally i fixed it. previous failure is caused by lack of some jars.
i pasted the classpath in local mode to workers by using show
compile:dependencyClasspath
and it works!
--
View this message in context:
Hi Everyone,
I think all are pretty busy, the response time in this group has slightly
increased.
But anyways, this is a pretty silly problem, but could not get over.
I have a file in my localFS, but when i try to create an RDD out of it,
tasks fails with file not found exception is thrown at
Also try this out , we have already done this ..
It will help you..
http://docs.sigmoidanalytics.com/index.php/Setup_hadoop_2.0.0-cdh4.2.0_and_spark_0.9.0_on_ubuntu_12.04
On Tue, May 6, 2014 at 10:17 PM, Andrew Lee alee...@hotmail.com wrote:
Please check JAVA_HOME. Usually it should point to
Hi, mayur
i've met the same problem. the instances are on, i can see them from ec2
console, and connect to them
wxhsdp@ubuntu:~/spark/spark/tags/v1.0.0-rc3/ec2$ ssh -i wxhsdp-us-east.pem
root@54.86.181.108
The authenticity of host '54.86.181.108 (54.86.181.108)' can't be
established.
ECDSA key
Hello,
I'm trying to write a python function that does something like:
def foo(line):
try:
return stuff(line)
except Exception:
raise MoreInformativeException(line)
and then use it in a map like so:
rdd.map(foo)
and have my MoreInformativeException make it back if/when
Hi Wxhsdp,
I also have some difficulties witth sc.addJar(). Since we include the
breeze library by using Spark 1.0, we don't have the problem you ran into.
However, when we add external jars via sc.addJar(), I found that the
executors actually fetch the jars but the classloader still doesn't
This is my first implementation. There are a few rough edges, but when I run
this I get the following exception. The class extends Partitioner which in
turn extends Serializable. Any idea what I am doing wrong?
scala res156.partitionBy(new EqualWeightPartitioner(1000, res156,
weightFunction))
This seems unrelated to not being able to load native-hadoop library. Is it
failing to connect to ResourceManager? Have you verified that there is an
RM process listening on port 8032 at the specified IP?
On Tue, May 6, 2014 at 6:25 PM, Sophia sln-1...@163.com wrote:
Hi,everyone,
Hi,
I am trying to find a way to fill in missing values in an RDD. The RDD is a
sorted sequence.
For example, (1, 2, 3, 5, 8, 11, ...)
I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11)
One way to do this is to slide and zip
rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11,
same problem +1,
though does not change the program result
--
Nan Zhu
On Tuesday, May 6, 2014 at 11:58 PM, Tathagata Das wrote:
Okay, this needs to be fixed. Thanks for reporting this!
On Mon, May 5, 2014 at 11:00 PM, wxhsdp wxh...@gmail.com
(mailto:wxh...@gmail.com) wrote:
Hi,
I'm also using right now SPARK_EXECUTOR_URI, though I would prefer
distributing Spark as a binary package.
For running examples with `./bin/run-example ...` it works fine, however
tasks from spark-shell are getting lost.
Error: Could not find or load main class
Hi Ian,
Don't use SPARK_MEM in spark-env.sh. It will get it set for all of your jobs.
The better way is to use only the second option
sconf.setExecutorEnv(spark.executor.memory, 4g”) i.e. set it in the driver
program. In this way every job will have memory according to requirment.
For example
It seems the concept I had been missing is to invoke the DStream foreach
method. This method takes a function expecting an RDD and applies the
function to each RDD within the DStream.
2014-05-14 21:33 GMT-07:00 Stephen Boesch java...@gmail.com:
Looking further it appears the functionality I
Mahout now supports doing its distributed linalg natively on Spark so the
problem of sequence file input load into Spark is already solved there
(trunk, http://mahout.apache.org/users/sparkbindings/home.html,
drmFromHDFS() call -- and then you can access to the direct rdd via rdd
matrix property
Mostly your shark server is not started.
Are you connecting to the cluster or running in local mode?
What is the lowest error on the stack.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, May 12, 2014 at 2:07 PM,
Hi,
I use the following code for calculating average. The problem is that the
reduce operation return a DStream here and not a tuple as it normally does
without Streaming. So how can we get the sum and the count from the DStream.
Can we cast it to tuple?
val numbers =
Is anyone aware of a way to configure the mesos GroupProcess port on the
mesos slave/task which the mesos master calls back on?
The log line that shows this port looks like below (mesos 0.17.0)
I0507 02:37:20.893334 11638 group.cpp:310] Group process ((2)@1.2.3.4:54321)
connected to ZooKeeper.
Spark 1.0.0 rc5 is available and open for voting
Give it a try and vote on it at the dev user list.
-
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664p5716.html
Sent from the
Should add that I had to tweak the numbers a bit to keep above swap
threshold, but below the Too many open files error (`ulimit -n` is
32768).
On Wed, May 14, 2014 at 10:47 AM, Jim Blomo jim.bl...@gmail.com wrote:
That worked amazingly well, thank you Matei! Numbers that worked for
me were 400
This is something that I have bumped into time and again. the object that
contains your main() should also be serializable then you won't have this
issue.
For example
object Test extends serializable{
def main(){
// set up spark context
// read your data
// create your RDD's (grouped by key)
That worked amazingly well, thank you Matei! Numbers that worked for
me were 400 for the textFile()s, 1500 for the join()s.
On Mon, May 12, 2014 at 7:58 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hey Jim, unfortunately external spilling is not implemented in Python right
now. While it
Looking further it appears the functionality I am seeking is in the
following *private[spark] * class ForEachdStream
(version 0.8.1 , yes we are presently using an older release..)
private[streaming]
class ForEachDStream[T: ClassManifest] (
parent: DStream[T],
*foreachFunc: (RDD[T],
Cool, that’s good to hear. We’d also like to add spilling in Python itself, or
at least make it exit with a good message if it can’t do it.
Matei
On May 14, 2014, at 10:47 AM, Jim Blomo jim.bl...@gmail.com wrote:
That worked amazingly well, thank you Matei! Numbers that worked for
me were
Just wondering - how are you launching your application? If you want
to set values like this the right way is to add them to the SparkConf
when you create a SparkContext.
val conf = new SparkConf().set(spark.akka.frameSize,
1).setAppName(...).setMaster(...)
val sc = new SparkContext(conf)
-
RDD is not cached?
Because recomputing may be required, every broadcast object is included in
the dependences of RDDs, this may also have memory issue(when n and kv is
too large in your case).
--
View this message in context:
Solved: Putting HADOOP_CONF_DIR in spark-env of the workers solved the
problem.
The difference between HadoopRDD and NewHadoopRDD is that the old one
creates the JobConf on worker side, where the new one creates an instance
of JobConf on driver side and then broadcasts it.
I tried creating
Hi,
For each line that we read as textLine from HDFS, we have a schema..if
there is an API that takes the schema as List[Symbol] and maps each token
to the Symbol it will be helpful...
Does RDDs provide a schema view of the dataset on HDFS ?
Thanks.
Deb
rdd1 is cached, but it has no effect:
1 var rdd1 = ...
2 var rdd2 = ...
3 var kv = ...
4 for (i - 0 until n) {
5var kvGlobal = sc.broadcast(kv) // broadcast kv
6rdd1 = rdd2.map {
7 case t = doSomething(t, kvGlobal.value)
8}.cache()
9var tmp =
Hi,
patrick said The intermediate shuffle output gets written to disk, but it
often hits the OS-buffer cache
since it's not explicitly fsync'ed, so in many cases it stays entirely in
memory. The behavior of the
shuffle is agnostic to whether the base RDD is in cache or in disk.
i
The new Spark SQL component is defined for this!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Schema-view-of-HadoopRDD-tp5627p5723.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I can't receive any spark-user mail since yesterday. Can you guys receive any
new mail?
--
Cheney
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509.html
Sent from the Apache Spark User List mailing list
But when i put broadcast variable out of for-circle, it workes well(if not
concerned about memory issue as you pointed out):
1 var rdd1 = ...
2 var rdd2 = ...
3 var kv = ...
4 var kvGlobal = sc.broadcast(kv) // broadcast kv
5 for (i - 0 until n) {
6rdd1 =
I'me still fairly new to this, but I found problems using classes in maps if
they used instance variables in part of the map function. It seems like for
maps and such to work correctly, it needs to be purely functional
programming.
--
View this message in context:
Hi,
Spark's local mode is great to create simple unit tests for our spark
logic. The disadvantage however is that certain types of problems are never
exposed in local mode because things never need to be put on the wire.
E.g. if I accidentally use a closure which has something non-serializable
curious what the bug is and what it breaks? I have spark 0.9.0 running on
mesos 0.17.0 and seems to work correctly.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/is-Mesos-falling-out-of-favor-tp5444p5483.html
Sent from the Apache Spark User List mailing
Hi all,
There seems to be a problem. I am not getting mails from spark user group from
two days.
Regards,
Laeeq
Hi all,
Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via
sc.textFile() and other HDFS-related APIs?
http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html
Best regards,
-chanwit
--
Chanwit Kaewkasi
linkedin.com/in/chanwit
48 matches
Mail list logo