Yyou need cassandra 1.2.6 for Spark examples —
Sent from Mailbox
On Thu, Jun 5, 2014 at 12:02 AM, Tim Kellogg t...@2lemetry.com wrote:
Hi,
I’m following the directions to run the cassandra example
“org.apache.spark.examples.CassandraTest” and I get this error
Exception in thread main
Hi Krishna,
Also, the default optimizer with SGD converges really slow. If you are
willing to write scala code, there is a full working example for
training Logistic Regression with L-BFGS (a quasi-Newton method) in
scala. It converges a way faster than SGD.
See
Hi Krishna,
It should work, and we use it in production with great success.
However, the constructor of LogisticRegressionModel is private[mllib],
so you have to write your code, and have the package name under
org.apache.spark.mllib instead of using scala console.
Sincerely,
DB Tsai
Sorry for replying late. It was night here.
Lian/Matei,
Here is the code snippet -
sparkConf.set(spark.executor.memory, 10g)
sparkConf.set(spark.cores.max, 5)
val sc = new SparkContext(sparkConf)
val accId2LocRDD =
Hi Cheng,
Sorry Again.
In this method, i see that the values for
a - positions.iterator
b - positions.iterator
always remain the same. I tried to do a b - positions.iterator.next, it
throws an error: value filter is not a member of (Double, Double)
Is there something I
ok, i see
i imported wrong jar files which only work well on default hadoop version
2014-06-05
bluejoe2008
From: prabeesh k
Date: 2014-06-05 16:14
To: user
Subject: Re: Re: mismatched hdfs protocol
If you are not setting the Spark hadoop version, Spark built using default
hadoop
Lakshmi, this is orthogonal to your question, but in case it's useful.
It sounds like you're trying to determine the home location of a user, or
something similar.
If that's the problem statement, the data pattern may suggest a far more
computationally efficient approach. For example, first map
I shan't be far. I'm committed now. Spark and I are going to have a very
interesting future together, but hopefully future messages will be about
the algorithms and modules, and less how do I run make?.
I suspect doing this at the exact moment of the 0.9 - 1.0.0 transition
hasn't helped me. (I
try sbt clean command before build the app.
or delete .ivy2 ans .sbt folders(not a good methode). Then try to rebuild
the project.
On Thu, Jun 5, 2014 at 11:45 AM, Sean Owen so...@cloudera.com wrote:
I think this is SPARK-1949 again: https://github.com/apache/spark/pull/906
I think this
Thanx a lot for your reply. I can see kryo serialiser in the UI.
I have 1 another query :
I wanted to know the meaning of the following log message when running a
spark streaming job :
[spark-akka.actor.default-dispatcher-18] INFO
org.apache.spark.streaming.scheduler.JobScheduler - Total
Hi,
We're using Mllib (1.0.0 release version) on a k-means clustering problem.
We want to reduce the matrix column size before send the points to k-means
solver.
It works on my mac with the local mode: spark-test-run-assembly-1.0.jar
contains my application code, com.github.fommil, netlib code
Hi,
I am trying to use Spark Streaming with Kafka, which works like a
charm -- except for shutdown. When I run my program with sbt
run-main, sbt will never exit, because there are two non-daemon
threads left that don't die.
I created a minimal example at
Hi,
I have written my own custom Spark streaming code which connects to Kafka
server and fetch data. I have tested the code on local mode and it is
working fine. But when I am executing the same code on YARN mode, I am
getting KafkaReceiver class not found exception. I am providing the Spark
Hi Cheng,
Thanks a lot. That solved my problem.
Thanks again for the quick response and solution.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7047.html
Sent from the Apache Spark User List mailing
Hi Ajatix.
Yes the HADOOP_HOME is set on the nodes and i did update the bash.
As I said, adding MESOS_HADOOP_HOME did not work.
But what is causing the original error : Java.lang.Error:
java.io.IOException: failure to login ?
--
Thanks
--
View this message in context:
Hi,
I am trying to do something like following in Spark:
JavaPairRDDbyte[], MyObject eventRDD = hBaseRDD.map(new
PairFunctionTuple2ImmutableBytesWritable, Result, byte[], MyObject () {
@Override
public Tuple2byte[], MyObject
call(Tuple2ImmutableBytesWritable, Result
hi,
I have a JTree. I want to serialize it using
sc.saveAsObjectFile(path). I could save it in some location. The real
problem is that when I deserialize it back using sc.objectFile(), I am not
getting the jtree. Can anyone please help me on this..
Thanks
Dear Aneesh,
Your particular use case of using Swing GUI components with Spark is a bit
unclear to me.
Assuming that you want Spark to operate on a tree object, you could use an
implementation of the TreeModel (
http://docs.oracle.com/javase/8/docs/api/javax/swing/tree/DefaultTreeModel.html
On Wed, Jun 4, 2014 at 10:39 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
That’s a good idea too, maybe we can change CallSiteInfo to do that.
I've filed an issue: https://issues.apache.org/jira/browse/SPARK-2035
Matei
On Jun 4, 2014, at 8:44 AM, Daniel Darabos
Any inputs on this will be helpful.
Thanks,
-Vibhor
On Thu, Jun 5, 2014 at 3:41 PM, Vibhor Banga vibhorba...@gmail.com wrote:
Hi,
I am trying to do something like following in Spark:
JavaPairRDDbyte[], MyObject eventRDD = hBaseRDD.map(new
PairFunctionTuple2ImmutableBytesWritable, Result,
The same issue persists in spark-1.0.0 as well (was using 0.9.1 earlier). Any
suggestions are welcomed.
--
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-not-processing-file-with-particular-number-of-entries-tp6694p7056.html
Sent
I am slightly confused about the --executor-memory setting. My yarn
cluster has a maximum container memory of 8192MB.
When I specify --executor-memory 8G in my spark-shell, no container can
be started at all. It only works when I lower the executor memory to 7G.
But then, on yarn, I see 2
Hi,
I am new to Spark (and almost-new in python!). How can I download and
install a Python library in my cluster so I can just import it later?
Any help would be much appreciated.
Thanks!
--
View this message in context:
I have a working set larger than available memory, thus I am hoping to turn
on rdd compression so that I can store more in-memory. Strangely it made no
difference. The number of cached partitions, fraction cached, and size in
memory remain the same. Any ideas?
I confirmed that rdd compression
Have you set the persistence level of the RDD to MEMORY_ONLY_SER (
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)?
If you're calling cache, the default persistence level is MEMORY_ONLY so
that setting will have no impact.
On Thu, Jun 5, 2014 at 4:41 PM, Xu (Simon)
Scala by the Bay registration and training is now open!
We are assembling a great two-day program for Scala By the Bay
www.scalabythebay.org
-- the yearly SF Scala developer conference. This year the conference
itself is on August 8-9 in Fort Mason, near the Golden Gate bridge,
with the Scala
Hi Prabeesh/ Sean,
I tried both the steps you guys mentioned looks like its not able to
resolve it.
[warn] [NOT FOUND ]
org.eclipse.jetty.orbit#javax.transaction;1.1.1.v201105210645!javax.transaction.orbit
(131ms)
[warn] public: tried
[warn]
Thanks Matei.
Using your pointers I can import data frrom HDFS, what I want to do now is
something like this in Spark:
---
import myown.mapper
rdd.map (mapper.map)
---
The reason why I want this: myown.mapper is a java class I already
developed. I used
For standalone and yarn mode, you need to install native libraries on all
nodes. The best solution is installing them to /usr/lib/libblas.so.3 and
/usr/lib/liblapack.so.3 . If your matrix is sparse, the native libraries cannot
help because they are for dense linear algebra. You can create RDD
Thanks.. it works now.
-Simon
On Thu, Jun 5, 2014 at 10:47 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:
Have you set the persistence level of the RDD to MEMORY_ONLY_SER (
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)?
If you're calling cache, the default
Hi Andrei,
Thank you for your help! Just to make sure I understand, when I run this
command sc.addPyFile(/path/to/yourmodule.py), I need to be already logged
into the master node and have my python files somewhere, is that correct?
--
View this message in context:
Use RDD.mapPartitions to go over all the items in a partition with one Mapper
object. It will look something like this:
rdd.mapPartitions(iterator =
val mapper = new myown.Mapper()
mapper.configure(conf)
val output = // {{create an OutputCollector that stores stuff in an
ArrayBuffer}}
In my answer I assumed you run your program with pyspark command (e.g.
pyspark mymainscript.py, pyspark should be on your path). In this case
workflow is as follows:
1. You create SparkConf object that simply contains your app's options.
2. You create SparkContext, which initializes your
How would I go about creating a new AMI image that I can use with the spark
ec2 commands? I can't seem to find any documentation. I'm looking for a
list of steps that I'd need to perform to make an Amazon Linux image ready
to be used by the spark ec2 tools.
I've been reading through the spark
Hi,
I’m still having trouble running the CassandraTest example from the Spark-1.0.0
binary package. I’ve made a Stackoverflow question for it so you can get some
street cred for helping me :)
http://stackoverflow.com/q/24069039/503826
Thanks!
Tim Kellogg
Sr. Software Engineer, Protocols
Hi All,
Please help me set Executor JVM memory size. I am using Spark shell and it
appears that the executors are started with a predefined JVM heap of 512m
as soon as Spark shell starts. How can I change this setting? I tried
setting SPARK_EXECUTOR_MEMORY before launching Spark shell:
export
Hi Oleg,
I set the size of my executors on a standalone cluster when using the shell
like this:
./bin/spark-shell --master $MASTER --total-executor-cores
$CORES_ACROSS_CLUSTER --driver-java-options
-Dspark.executor.memory=$MEMORY_PER_EXECUTOR
It doesn't seem particularly clean, but it works.
Hi,
I've got a weird question but maybe someone has already dealt with it.
My Spark Streaming application needs to
- download a file from a S3 bucket,
- run a script with the file as input,
- create a DStream from this script output.
I've already got the second part done with the rdd.pipe() API
Thank you, Andrew,
I am using Spark 0.9.1 and tried your approach like this:
bin/spark-shell --driver-java-options
-Dspark.executor.memory=$MEMORY_PER_EXECUTOR
I get
bad option: '--driver-java-options'
There must be something different in my setup. Any ideas?
Thank you again,
Oleg
On 5
Thank you for your quick reply.
As far as I know, the update does not require negative observations, because
the update rule
Xu = (YtCuY + λI)^-1 Yt Cu P(u)
can be simplified by taking advantage of its algebraic structure, so
negative observations are not needed. This is what I think at the
On Thu, Jun 5, 2014 at 10:38 PM, redocpot julien19890...@gmail.com wrote:
can be simplified by taking advantage of its algebraic structure, so
negative observations are not needed. This is what I think at the first time
I read the paper.
Correct, a big part of the reason that is efficient is
Hi Xu,
As crazy as it might sound, this all makes sense.
There are a few different quantities at play here:
* the heap size of the executor (controlled by --executor-memory)
* the amount of memory spark requests from yarn (the heap size plus
384 mb to account for fixed memory costs outside if
Hey Ajay, thanks for reporting this. There was indeed a bug, specifically in
the way join tasks spill to disk (which happened when you had more concurrent
tasks competing for memory). I’ve posted a patch for it here:
https://github.com/apache/spark/pull/986. Feel free to try that if you’d like;
I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume
that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
When these happen things get extremely slow.
Does this mean that the executor got terminated and restarted?
Is there a way to prevent this from happening
Oh my apologies that was for 1.0
For Spark 0.9 I did it like this:
MASTER=spark://mymaster:7077 SPARK_MEM=8g ./bin/spark-shell -c
$CORES_ACROSS_CLUSTER
The downside of this though is that SPARK_MEM also sets the driver's JVM to
be 8g, rather than just the executors. I think this is the reason
Hi Ajay,
Can you please try running the same code with spark.shuffle.spill=false and
see if the numbers turn out correctly? That parameter controls whether or
not the buggy code that Matei fixed in ExternalAppendOnlyMap is used.
FWIW I saw similar issues in 0.9.0 but no longer in 0.9.1 after I
On a related note, I'd also minimize any kind of executor movement. I.e.,
once an executor is spawned and data cached in the executor, I want that
executor to live all the way till the job is finished, or the machine fails
in a fatal manner.
What would be the best way to ensure that this is the
Hi Aaron,
When you say that sorting is being worked on, can you elaborate a little
more please?
If particular, I want to sort the items within each partition (not
globally) without necessarily bringing them all into memory at once.
Thanks,
Roger
On Sat, May 31, 2014 at 11:10 PM, Aaron
I think it would very handy to be able to specify that you want sorting
during a partitioning stage.
On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover roger.hoo...@gmail.com wrote:
Hi Aaron,
When you say that sorting is being worked on, can you elaborate a little
more please?
If particular, I
Hi Roger,
You should be able to sort within partitions using the rdd.mapPartitions()
method, and that shouldn't require holding all data in memory at once. It
does require holding the entire partition in memory though. Do you need
the partition to never be held in memory all at once?
As far as
Sean,
your patch fixes the issue, thank you so much! (This is the second
time within one week I run into network libraries not shutting down
threads properly, I'm really glad your code fixes the issue.)
I saw your pull request is closed, but not merged yet. Can I do
anything to get your fix into
just use -Dspark.executor.memory=
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-executor-memory-when-using-spark-shell-tp7082p7103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Me again,
Things have been going well, actually. I've got my build chain sorted,
1.0.0 and streaming is working reliably. I managed to turn off the INFO
messages by messing with every log4j properties file on the system. :-)
On thing I would like to try now is some natural language processing on
If some task have no locality preference, it will also show up as
PROCESS_LOCAL, yet, I think we probably need to name it NO_PREFER to make it
more clear. Not sure is this your case.
Best Regards,
Raymond Liu
From: coded...@gmail.com [mailto:coded...@gmail.com] On Behalf Of Sung Hwan
Chung
Nice explanation... Thanks!
On Thu, Jun 5, 2014 at 5:50 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Xu,
As crazy as it might sound, this all makes sense.
There are a few different quantities at play here:
* the heap size of the executor (controlled by --executor-memory)
* the amount
Nope, sorry, nevermind!
I looked at the source, and it was pretty obvious that it didn't implement
that yet, so I've ripped the classes out and am mutating them into a new
receivers right now...
... starting to get the hang of this.
On Fri, Jun 6, 2014 at 1:07 PM, Jeremy Lee
Hello,
I have been using Externalizer from Chill to as serialization wrapper. It
appears to me that Spark have some conflict with the classloader with
Chill. I have the (a simplified version) following program:
import java.io._
import com.twitter.chill.Externalizer
class X(val i: Int) {
hi,
here is problem description, I write a custom networkreceiver to receive
image data from camera. I had confirmed all the data received correctly.
1)when data received, only the networkreceiver node run at full speed, while
other nodes keep idle, my spark cluster has 6 nodes.
2)And every
58 matches
Mail list logo