You can use rdd.takeOrdered(1)(reverseOrdrering)
reverseOrdering is you Ordering[T] instance where you define the ordering
logic. This you have to pass in the method
On Thu, Apr 24, 2014 at 11:21 AM, Frank Austin Nothaft
fnoth...@berkeley.edu wrote:
If you do this, you could simplify to:
Also same thing can be done using rdd.top(1)(reverseOrdering)
On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra
sourav.chan...@livestream.com wrote:
You can use rdd.takeOrdered(1)(reverseOrdrering)
reverseOrdering is you Ordering[T] instance where you define the ordering
logic. This you
Thanks Guys !
On Thu, Apr 24, 2014 at 11:29 AM, Sourav Chandra
sourav.chan...@livestream.com wrote:
Also same thing can be done using rdd.top(1)(reverseOrdering)
On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra
sourav.chan...@livestream.com wrote:
You can use
thank you, i add setJars, but nothing changes
val conf = new SparkConf()
.setMaster(spark://127.0.0.1:7077)
.setAppName(Simple App)
.set(spark.executor.memory, 1g)
.setJars(Seq(target/scala-2.10/simple-project_2.10-1.0.jar))
val sc = new SparkContext(conf)
--
try the complete path
qinwei
From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set
spark.executor.memory and heap sizethank you, i add setJars, but nothing changes
val conf = new SparkConf()
.setMaster(spark://127.0.0.1:7077)
.setAppName(Simple App)
Thanks Mayur.
So without Hadoop and any other distributed file systems, by running:
val doc = sc.textFile(/home/scalatest.txt,5)
doc.count
we can only get parallelization within the computer where the file is
loaded, but not the parallelization within the computers in the cluster
(Spark
Prashant Sharma
On Thu, Apr 24, 2014 at 12:15 PM, Carter gyz...@hotmail.com wrote:
Thanks Mayur.
So without Hadoop and any other distributed file systems, by running:
val doc = sc.textFile(/home/scalatest.txt,5)
doc.count
we can only get parallelization within the computer where
i tried, but no effect
Qin Wei wrote
try the complete path
qinwei
From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set
spark.executor.memory and heap sizethank you, i add setJars, but nothing
changes
val conf = new SparkConf()
Good to know, thanks for pointing this out to me!
On 23/04/2014 19:55, Sandy Ryza wrote:
Ah, you're right about SPARK_CLASSPATH and ADD_JARS. My bad.
SPARK_YARN_APP_JAR is going away entirely -
https://issues.apache.org/jira/browse/SPARK-1053
On Wed, Apr 23, 2014 at 8:07 AM, Christophe
Thank you very much for your help Prashant.
Sorry I still have another question about your answer: however if the
file(/home/scalatest.txt) is present on the same path on all systems it
will be processed on all nodes.
When presenting the file to the same path on all nodes, do we just simply
copy
It is the same file and hadoop library that we use for splitting takes care
of assigning the right split to each node.
Prashant Sharma
On Thu, Apr 24, 2014 at 1:36 PM, Carter gyz...@hotmail.com wrote:
Thank you very much for your help Prashant.
Sorry I still have another question about your
i think maybe it's the problem of read local file
val logFile = /home/wxhsdp/spark/example/standalone/README.md
val logData = sc.textFile(logFile).cache()
if i replace the above code with
val logData = sc.parallelize(Array(1,2,3,4)).cache()
the job can complete successfully
can't i read a
You need to use proper url format:
file://home/wxhsdp/spark/example/standalone/README.md
On Thu, Apr 24, 2014 at 1:29 PM, wxhsdp wxh...@gmail.com wrote:
i think maybe it's the problem of read local file
val logFile = /home/wxhsdp/spark/example/standalone/README.md
val logData =
Sorry wrong format:
file:///home/wxhsdp/spark/example/standalone/README.md
An extra / is needed at the start.
On Thu, Apr 24, 2014 at 1:46 PM, Adnan Yaqoob nsyaq...@gmail.com wrote:
You need to use proper url format:
file://home/wxhsdp/spark/example/standalone/README.md
On Thu, Apr 24,
thanks for your reply, adnan, i tried
val logFile = file:///home/wxhsdp/spark/example/standalone/README.md
i think there needs three left slash behind file:
it's just the same as val logFile =
home/wxhsdp/spark/example/standalone/README.md
the error remains:(
--
View this message in context:
Hi,
You should be able to read it, file://or file:/// not even required for
reading locally , just path is enough..
what error message you getting on spark-shell while reading...
for local:
Also read the same from hdfs file also ...
put your README file there and read , it works in both ways..
hi arpit,
on spark shell, i can read local file properly,
but when i use sbt run, error occurs.
the sbt error message is in the beginning of the thread
Arpit Tak-2 wrote
Hi,
You should be able to read it, file://or file:/// not even required for
reading locally , just path is enough..
Thank you very much Prashant.
Date: Thu, 24 Apr 2014 01:24:39 -0700
From: ml-node+s1001560n4739...@n3.nabble.com
To: gyz...@hotmail.com
Subject: Re: Need help about how hadoop works.
It is the same file and hadoop library that we use for splitting takes
care of assigning the right
Okk fine,
try like this , i tried and it works..
specify spark path also in constructor...
and also
export SPARK_JAVA_OPTS=-Xms300m -Xmx512m -XX:MaxPermSize=1g
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SimpleApp {
def main(args:
Also try out these examples, all of them works
http://docs.sigmoidanalytics.com/index.php/MLlib
if you spot any problems in those, let us know.
Regards,
arpit
On Wed, Apr 23, 2014 at 11:08 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
See
Hi All, Finally i wrote the following code, which is felt does optimally if
not the most optimum one.
Using file pointers, seeking the byte after the last \n but backwards !!
This is memory efficient and i hope even unix tail implementation should be
something similar !!
import
Hi,
Relatively new on spark and have tried running SparkPi example on a
standalone 12 core three machine cluster. What I'm failing to understand is,
that running this example with a single slice gives better performance as
compared to using 12 slices. Same was the case when I was using
it seems that it's nothing about settings, i tried take action, and find it's
ok, but error occurs when i tried count and collect
val a = sc.textFile(any file)
a.take(n).foreach(println) //ok
a.count() //failed
a.collect()//failed
val b = sc.parallelize((Array(1,2,3,4))
You may try this:
val lastOption = sc.textFile(input).mapPartitions { iterator =
if (iterator.isEmpty) {
iterator
} else {
Iterator
.continually((iterator.next(), iterator.hasNext()))
.collect { case (value, false) = value }
.take(1)
}
}.collect().lastOption
Thanks for the info. It seems like the JTS library is exactly what I
need (I'm not doing any raster processing at this point).
So, once they successfully finish the Scala wrappers for JTS, I would
theoretically be able to use Scala to write a Spark job that includes
the JTS library, and then run
Moreover it seems all the workers are registered and have sufficient memory
(2.7GB where as I have asked for 512 MB). The UI also shows the jobs are
running on the slaves. But on the termial it is still the same error
Initial job has not accepted any resources; check your cluster UI to ensure
that
If I have this code:
val stream1= doublesInputStream.window(Seconds(10), Seconds(2))
val stream2= stream1.reduceByKeyAndWindow(_ + _, Seconds(10), Seconds(10))
Does reduceByKeyAndWindow merge all RDDs from stream1 that came in the 10
second window?
Example, in the first 10 secs stream1 will
Did you build it with SPARK_HIVE=true?
On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru
diplomaticg...@gmail.comwrote:
Hi Matei,
I checked out the git repository and built it. However, I'm still getting
below error. It couldn't find those SQL packages. Please advice.
package
You shouldn't need to set SPARK_HIVE=true unless you want to use the
JavaHiveContext. You should be able to access
org.apache.spark.sql.api.java.JavaSQLContext with the default build.
How are you building your application?
Michael
On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or
Looks like you're depending on Spark 0.9.1, which doesn't have Spark SQL.
Assuming you've downloaded Spark, just run 'mvn install' to publish Spark
locally, and depend on Spark version 1.0.0-SNAPSHOT.
On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru
diplomaticg...@gmail.comwrote:
It's a simple
Oh, and you'll also need to add a dependency on spark-sql_2.10.
On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust
mich...@databricks.comwrote:
Yeah, you'll need to run `sbt publish-local` to push the jars to your
local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT.
On
Many thanks for your prompt reply. I'll try your suggestions and will get
back to you.
On 24 April 2014 18:17, Michael Armbrust mich...@databricks.com wrote:
Oh, and you'll also need to add a dependency on spark-sql_2.10.
On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust
Rstudio should be fine.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/IDE-for-sparkR-tp4764p4772.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks Cheng !!
On Thu, Apr 24, 2014 at 5:43 PM, Cheng Lian lian.cs@gmail.com wrote:
You may try this:
val lastOption = sc.textFile(input).mapPartitions { iterator =
if (iterator.isEmpty) {
iterator
} else {
Iterator
.continually((iterator.next(),
./spark-shell: line 153: 17654 Killed
$FWDIR/bin/spark-class org.apache.spark.repl.Main $@
Any ideas?
Did you launch this using our EC2 scripts
(http://spark.apache.org/docs/latest/ec2-scripts.html) or did you manually set
up the daemons? My guess is that their hostnames are not being resolved
properly on all nodes, so executor processes can’t connect back to your driver
app. This error
The problem is that SparkPi uses Math.random(), which is a synchronized method,
so it can’t scale to multiple cores. In fact it will be slower on multiple
cores due to lock contention. Try another example and you’ll see better
scaling. I think we’ll have to update SparkPi to create a new Random
Could you share the command you used and more of the error message?
Also, is it an MLlib specific problem? -Xiangrui
On Thu, Apr 24, 2014 at 11:49 AM, John King
usedforprinting...@gmail.com wrote:
./spark-shell: line 153: 17654 Killed
$FWDIR/bin/spark-class org.apache.spark.repl.Main $@
Any
Is your Spark cluster running? Try to start with generating simple
RDDs and counting. -Xiangrui
On Thu, Apr 24, 2014 at 11:38 AM, John King
usedforprinting...@gmail.com wrote:
I receive this error:
Traceback (most recent call last):
File stdin, line 1, in module
File
Last command was:
val model = new NaiveBayes().run(points)
On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng men...@gmail.com wrote:
Could you share the command you used and more of the error message?
Also, is it an MLlib specific problem? -Xiangrui
On Thu, Apr 24, 2014 at 11:49 AM, John King
Yes, I got it running for large RDD (~7 million lines) and mapping. Just
received this error when trying to classify.
On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng men...@gmail.com wrote:
Is your Spark cluster running? Try to start with generating simple
RDDs and counting. -Xiangrui
On
This happens to me when using the EC2 scripts for v1.0.0rc2 recent release.
The Master connects and then disconnects immediately, eventually saying
Master disconnected from cluster.
On Thu, Apr 24, 2014 at 4:01 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
Did you launch this using our EC2
It worked!! Many thanks for your brilliant support.
On 24 April 2014 18:20, diplomatic Guru diplomaticg...@gmail.com wrote:
Many thanks for your prompt reply. I'll try your suggestions and will get
back to you.
On 24 April 2014 18:17, Michael Armbrust mich...@databricks.com wrote:
Oh,
Thanks Xiangrui, Matei and Arpit. It does work fine after adding
Vector.dense. I have a follow up question, I will post on a new thread.
On Thu, Apr 24, 2014 at 2:49 AM, Arpit Tak arpi...@sigmoidanalytics.comwrote:
Also try out these examples, all of them works
Folks,
I am wondering how mllib interacts with jblas and lapack. Does it make
copies of data from my RDD format to jblas's format? Does jblas copy it
again before passing to lapack native code?
I also saw some comparisons with VW and it seems mllib is slower on a
single node but scales better and
I tried locally with the example described in the latest guide:
http://54.82.157.211:4000/mllib-naive-bayes.html , and it worked fine.
Do you mind sharing the code you used? -Xiangrui
On Thu, Apr 24, 2014 at 1:57 PM, John King usedforprinting...@gmail.com wrote:
Yes, I got it running for large
Do you mind sharing more code and error messages? The information you
provided is too little to identify the problem. -Xiangrui
On Thu, Apr 24, 2014 at 1:55 PM, John King usedforprinting...@gmail.com wrote:
Last command was:
val model = new NaiveBayes().run(points)
On Thu, Apr 24, 2014 at
I was able to run simple examples as well.
Which version of Spark? Did you use the most recent commit or from
branch-1.0?
Some background: I tried to build both on Amazon EC2, but the master kept
disconnecting from the client and executors failed after connecting. So I
tried to just use one
In the other thread I had an issue with Python. In this issue, I tried
switching to Scala. The code is:
*import* org.apache.spark.mllib.regression.*LabeledPoint**;*
*import org.apache.spark.mllib.linalg.SparseVector;*
*import org.apache.spark.mllib.classification.NaiveBayes;*
import
Also when will the official 1.0 be released?
On Thu, Apr 24, 2014 at 7:04 PM, John King usedforprinting...@gmail.comwrote:
I was able to run simple examples as well.
Which version of Spark? Did you use the most recent commit or from
branch-1.0?
Some background: I tried to build both on
I don't see anything wrong with your code. Could you do points.count()
to see how many training examples you have? Also, make sure you don't
have negative feature values. The error message you sent did not say
NaiveBayes went wrong, but the Spark shell was killed. -Xiangrui
On Thu, Apr 24, 2014
It just displayed this error and stopped on its own. Do the lines of code
mentioned in the error have anything to do with it?
On Thu, Apr 24, 2014 at 7:54 PM, Xiangrui Meng men...@gmail.com wrote:
I don't see anything wrong with your code. Could you do points.count()
to see how many training
anyone knows the reason? i've googled a bit, and found some guys had the same
problem, but with no replies...
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4796.html
Sent from the Apache Spark User
i noticed that error occurs
at
org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
at
org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2378)
at
Try running sbt/sbt clean and re-compiling. Any luck?
On Thu, Apr 24, 2014 at 5:33 PM, martin.ou martin...@orchestrallinc.cnwrote:
occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3
1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly
2.found Exception:
Hey Jim, this is unfortunately harder than I’d like right now, but here’s how
to do it. Look at the stderr file of the executor on that machine, and you’ll
see lines like this:
14/04/24 19:17:24 INFO HadoopRDD: Input split:
file:/Users/matei/workspace/apache-spark/README.md:0+2000
This says
spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping)
this line is too slow. There are about 2 million elements in word_mapping.
*Is there a good style for writing a large collection to hdfs?*
import org.apache.spark._
import SparkContext._
import
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see
http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably
faster.
Matei
On Apr 24, 2014, at 8:01 PM, Earthson Lu earthson...@gmail.com wrote:
Hi All,
I have a problem with the Item-Based Collaborative Filtering Recommendation
Algorithms in spark.
The basic flow is as below:
(Item1, (User1 ,
Score1))
RDD1 ==(Item2, (User2 , Score2))
Kryo With Exception below:
com.esotericsoftware.kryo.KryoException
(com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 1)
com.esotericsoftware.kryo.io.Output.require(Output.java:138)
com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446)
Thanks for the reply. It indeed increased the usage. There was another issue
we found, we were broadcasting hadoop configuration by writing a wrapper
class over it. But found the proper way in Spark Code
sc.broadcast(new SerializableWritable(conf))
--
View this message in context:
I only see one risk: if your feature indices are not sorted, it might
have undefined behavior. Other than that, I don't see any thing
suspicious. -Xiangrui
On Thu, Apr 24, 2014 at 4:56 PM, John King usedforprinting...@gmail.com wrote:
It just displayed this error and stopped on its own. Do the
Hi
I am also curious about this question.
The textFile function was supposed to read a hdfs file? In this case
,It is on local filesystem that the file was taken in.There are any
recognization ways to identify the local filesystem and the hdfs in the
textFile function?
Beside, the OOM
63 matches
Mail list logo