Our customer asked us to implement Naive Bayes which should be able to at
least train news20 one year ago, and we implemented for them in Hadoop
using distributed cache to store the model.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
Not sure if this is always ideal for Naive Bayes, but you could also hash the
features into a lower-dimensional space (e.g. reduce it to 50,000 features).
For each feature simply take MurmurHash3(featureID) % 5 for example.
Matei
On Apr 27, 2014, at 11:24 PM, DB Tsai dbt...@stanford.edu
Hi All,
I want to store a csv-text file in Parquet format in HDFS and then do some
processing in Spark.
Somehow my search to find the way to do was futile. More help was available
for parquet with impala.
Any guidance here? Thanks !!
Hi
I am running a simple word count program on spark standalone cluster.
The cluster is made up of 6 node, each run 4 worker and each worker own 10G
memory and 16 core thus total 96 core and 240G memory. ( well, also used to
configed as 1 worker with 40G memory on each node )
On Mon, Apr 28, 2014 at 8:22 AM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:
e.g. something like
rdd.mapPartition((rows : Iterator[String]) = {
var idx = 0
rows.map((row: String) = {
val valueMap = SparkWorker.getMemoryContent(valMap)
val prevVal = valueMap(idx)
idx +=
Actually, I do not know how to do something like this or whether this is
possible - thus my suggestive statement.
Can you already declare persistent memory objects per worker? I tried
something like constructing a singleton object within map functions, but
that didn't work as it seemed to
Hi all,
I get spark 1.0 snapshot code from git; and I compiled it using command:
mvn -Pbigtop-dist -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests
package -e
in cluster, I add [export SPARK_YARN_MODE=true] to spark-env.sh, and run
HdfsTest examples;
and I got error, any one got
thank you for your help, Sourav.
i found broadcast_0 binary file in /tmp directory. it's size is 33.4kB, not
equal to estimated size 135.6 KB.
i opened it and found it's content has no relations with my read in file. i
guess broadcast_0 is a config
file about spark, is that right?
--
View this
I am seeing the following exception from a very basic test project when it
runs on spark local.
java.lang.NoSuchMethodError:
org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2;
The project is built with Java 1.6, Scala 2.10.3 and spark 0.9.1
Apart from user defined broadcast variable, there are others which is being
created by spark. This could be one of those.
As I had mentioned you can do a small program where you create a broadcast
variable. Check the broadcast variable id(say its x). Then go to the /tmp
to open broadcast_x
Hi,
I am trying to recompile SIMR with Spark 9.1 but it fails on incompatible
method:
[error]
/home/lukas/src/simr/src/main/scala/org/apache/spark/simr/RelayServer.scala:213:
not enough arguments for method createActorSystem: (name: String, host:
String, port: Int, indestructible: Boolean, conf:
Hi Spark-users,
Within my Spark Streaming program, I am able to ingest data sent by my Flume
Avro Client. I configured a 'spooling directory source' to write data to a
Flume Avro Sink (the Spark Streaming Driver program in this case). The default
deserializer i.e. LINE is used to parse the
That is quite mysterious, and I do not think we have enough information to
answer. JavaPairRDDString, Tuple2.lookup() works fine on a remote Spark
cluster:
$ MASTER=spark://localhost:7077 bin/spark-shell
scala val rdd = org.apache.spark.api.java.JavaPairRDD.fromRDD(sc.makeRDD(0
until 10, 3).map(x
Good question! I am also new to the JVM and would appreciate some tips.
On Sun, Apr 27, 2014 at 5:19 AM, wxhsdp wxh...@gmail.com wrote:
Hi, all
i have some questions about debug in spark:
1) when application finished, application UI is shut down, i can not see
the details about the app,
Thanks for your answer.
I tried running on a single machine - master and worker on one host. I
get exactly the same results.
Very little CPU activity on the machine in question. The web UI shows a
single task and its state is RUNNING. it will remain so indefinitely.
I have a single partition,
I recall asking about this, and I think Matei suggest it was, but is the
scheduler thread safe?
I am running mllib libraries as futures in the same driver using the same
dataset as input and this error
14/04/28 08:29:48 ERROR TaskSchedulerImpl: Exception in statusUpdate
A mutable map in an object should do what your looking for then I believe.
You just reference the object as an object in your closure so it won't be
swept up when your closure is serialized and you can reference variables of
the object on the remote host then. e.g.:
object MyObject {
val mmap =
As to your last line: I've used RDD zipping to avoid GC since MyBaseData is
large and doesn't change. I think this is a very good solution to what is
being asked for.
On Mon, Apr 28, 2014 at 10:44 AM, Ian O'Connell i...@ianoconnell.com wrote:
A mutable map in an object should do what your
Could this be related to the size of the lookup result ?
I tried to recreate a similar scenario on the spark shell which causes
an exception:
scala val rdd =
org.apache.spark.api.java.JavaPairRDD.fromRDD(sc.makeRDD(0 until 4,
3).map(x = ( ( 0,52fb9b1a3004f07d1a87c8f3 ),
One thing I have used this for was to create codebooks for SIFT features in
images. It is a common, though fairly naïve, method for converting high
dimensional features into a simple word-like features. Thus, if you have 200
SIFT features for an image, you can reduce that to 200 ‘words’ that
I'm not sure what I said came through. RDD zip is not hacky at all, as it
only depends on a user not changing the partitioning. Basically, you would
keep your losses as an RDD[Double] and zip whose with the RDD of examples,
and update the losses. You're doing a copy (and GC) on the RDD of
Try turning on the Kryo serializer as described at
http://spark.apache.org/docs/latest/tuning.html. Also, are there any exceptions
in the driver program’s log before this happens?
Matei
On Apr 28, 2014, at 9:19 AM, Buttler, David buttl...@llnl.gov wrote:
Hi,
I am trying to run the K-means
Right---They are zipped at each iteration.
On Mon, Apr 28, 2014 at 11:56 AM, Chester Chen chesterxgc...@yahoo.comwrote:
Tom,
Are you suggesting two RDDs, one with loss and another for the rest
info, using zip to tie them together, but do update on loss RDD (copy) ?
Chester
Sent from
Hi everyone. I'm trying to run some of the Spark example code, and most of
it appears to be undocumented (unless I'm missing something). Can someone
help me out?
I'm particularly interested in running SparkALS, which wants parameters:
M U F iter slices
What are these variables? They appear to
Ian, I tried playing with your suggestion, but I get a task not
serializable error (and some obvious things didn't fix it). Can you get
that working?
On Mon, Apr 28, 2014 at 10:58 AM, Tom Vacek minnesota...@gmail.com wrote:
As to your last line: I've used RDD zipping to avoid GC since
That might be a good alternative to what we are looking for. But I wonder
if this would be as efficient as we want to. For instance, will RDDs of the
same size usually get partitioned to the same machines - thus not
triggering any cross machine aligning, etc. We'll explore it, but I would
still
Hi Diana,
SparkALS is an example implementation of ALS. It doesn't call the ALS
algorithm implemented in MLlib. M, U, and F are used to generate
synthetic data.
I'm updating the examples. In the meantime, you can take a look at the
updated MLlib guide:
Yeah you'd have to look at the source code in this case; it's not
explained in the scaladoc or usage message as far as I can see either.
The args refer specifically to the example of recommending Movies to
Users. This example makes up a bunch of ratings and then makes
recommendations using ALS.
M
If you create your auxiliary RDD as a map from the examples, the
partitioning will be inherited.
On Mon, Apr 28, 2014 at 12:38 PM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:
That might be a good alternative to what we are looking for. But I wonder
if this would be as efficient as we want
http://spark.apache.org/docs/0.9.0/mllib-guide.html#collaborative-filtering-1
One thing which is undocumented: the integers representing users and
items have to be positive. Otherwise it throws exceptions.
Li
On 28 avr. 2014, at 10:30, Diana Carroll dcarr...@cloudera.com wrote:
Hi everyone.
Thanks, Deb. But I'm looking at org.apache.spark.examples.SparkALS, which
is not in the mllib examples, and does not take any file parameters.
I don't see the class you refer to in the examples ...however, if I did
want to run that example, where would I find the file in question?
It would be
Warning noob question:
The sc.textFile(URI) method seems to support reading from files in parallel but
you have to supply some wildcard URI, which greatly limits how the storage is
structured. Is there a simple way to pass in a URI list or is it an exercise
left for the student?
You might also investigate other clustering algorithms, such as canopy
clustering and nearest neighbors. Some of them are less accurate, but more
computationally efficient. Often they are used to compute approximate
clusters followed by k-means (or a variant thereof) for greater accuracy.
dean
Joe,
Do you have your SPARK_HOME variable set correctly in the spark-env.sh script?
I was getting that error when I was first setting up my cluster, turned out I
had to make some changes in the spark-env script to get things working
correctly.
Ben
-Original Message-
From: Joe L
Don't use SparkALS...that's the first version of the code and does not
scale...
Li is right...you have to do the dictionary generation on users, products
and then generate indexed fileI wrote some utilities but looks like it
is application dependentthe indexed netflix format is more
Should I file a JIRA to remove the example? I think it is confusing to
include example code without explanation of how to run it, and it sounds
like this one isn't worth running or reviewing anyway.
On Mon, Apr 28, 2014 at 2:34 PM, Debasish Das debasish.da...@gmail.comwrote:
Don't use
Hey Jim,
This IOException thing is a general issue that we need to fix and your
observation is spot-in. There is actually a JIRA for it here I created a
few days ago:
https://issues.apache.org/jira/browse/SPARK-1579
Aaron is assigned on that one but not actively working on it, so we'd
welcome a
After upgrading to Spark 0.9.1, sbt assembly is failing. I'm trying to fix
it with merge strategy, etc., but is anyone else seeing this?
For example,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-0-9-1-assembly-fails-tp4979.html
Sent from the
So, good news and bad news. I have a customized Build.scala
http://apache-spark-user-list.1001560.n3.nabble.com/libraryDependencies-configuration-is-different-for-sbt-assembly-vs-sbt-run-tt565.html#a1542
that allows me to use the 'run' and 'assembly' commands in sbt without
toggling the
You can get the internal AvroFlumeEvent inside the SparkFlumeEvent using
SparkFlumeEvent.event. That should probably give you all the original text
data.
On Mon, Apr 28, 2014 at 5:46 AM, Kulkarni, Vikram vikram.kulka...@hp.comwrote:
Hi Spark-users,
Within my Spark Streaming program, I am
Um. When I updated the spark dependency, I unintentially deleted the
provided attribute. Oops. Nothing to see here . . .
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-0-9-1-assembly-fails-tp4979p4982.html
Sent from the Apache Spark User List
Matei, thank you. That seemed to work but I'm not able to import a class
from my jar.
Using the verbose options, I can see that my jar should be included
Parsed arguments:
...
jars
/Users/rhoover/Work/spark-etl/target/scala-2.10/spark-etl_2.10-1.0.jar
And I see the class I want to load in
A couple of issues:
1) the jar doesn't show up on the classpath even though SparkSubmit had it
in the --jars options. I tested this by running :cp in spark-shell
2) After adding it the classpath using (:cp
/Users/rhoover/Work/spark-etl/target/scala-2.10/spark-etl_2.10-1.0.jar), it
still fails.
For the second question, you can submit multiple jobs through the same
SparkContext via different threads and this is a supported way of
interacting with Spark.
From the documentation:
Second, *within* each Spark application, multiple “jobs” (Spark actions)
may be running concurrently if they
you need to import org.apache.spark.rdd.RDD to include RDD.
http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
here are some examples you can learn
https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib
SK wrote
I am a new user of
thanks for your reply, daniel
what do you mean by the logs contain everything to reconstruct the same
data. ?
i also use times to look into the logs, but only get a little.
as i can see, it logs the flow to run the application, but there are no more
details about
each task, for example, see the
It would be useful to have some way to open multiple files at once into a
single RDD (e.g. sc.textFile(iterable_over_uris)). Logically, it would be
equivalent to opening a single file which is made by concatenating the
various files together. This would only be useful, of course, if the source
Actually wildcards work too, e.g. s3n://bucket/file1*, and I believe so do
comma-separated lists (e.g. s3n://file1,s3n://file2). These are all inherited
from FileInputFormat in Hadoop.
Matei
On Apr 28, 2014, at 6:05 PM, Andrew Ash and...@andrewash.com wrote:
This is already possible with the
Hi,
I am a new user of Spark. I have a class that defines a function as follows.
It returns a tuple : (Int, Int, Int).
class Sim extends VectorSim {
override def input(master:String): (Int,Int,Int) = {
sc = new SparkContext(master, Test)
val ratings =
In classic MapReduce/Hadoop, you may optionally define setup() and cleanup()
methods.
They ( setup() and cleanup() ) are called for each task, so if you have 20
mappers running, the setup/cleanup will be called for each one.
What is the equivalent of these in Spark?
Thanks,
best regards,
What about if you run ./bin/spark-shell
--driver-class-path=/path/to/your/jar.jar
I think either this or the --jars flag should work, but it's possible there
is a bug with the --jars flag when calling the Repl.
On Mon, Apr 28, 2014 at 4:30 PM, Roger Hoover roger.hoo...@gmail.comwrote:
A
I don't think there is a setup() or cleanup() in Spark but you can usually
achieve the same using mapPartitions and having the setup code at the top
of the mapPartitions and cleanup at the end.
The reason why this usually works is that in Hadoop map/reduce, each map
task runs over an input split.
Not that I know of. We were discussing it on another thread and it came up.
I think if you look up the Hadoop FileInputFormat API (which Spark uses)
you'll see it mentioned there in the docs.
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
But that's not
The problem is this object can't be Serializerable, it holds a RDD field and
SparkContext. But Spark shows an error that it need Serialization.
The order of my debug output is really strange.
~
Training Start!
Round 0
Hehe?
Hehe?
started?
failed?
Round 1
Hehe?
~
here is my code
69
Thank you very much Ameet!
Can you please point me to an example?
Best,
Mahmoud
Sent from my iPhone
On Apr 28, 2014, at 6:32 PM, Ameet Kini
ameetk...@gmail.commailto:ameetk...@gmail.com wrote:
I don't think there is a setup() or cleanup() in Spark but you can usually
achieve the same using
In general, as Andrew points out, it's possible to submit jobs from
multiple threads and many Spark applications do this.
One thing to check out is the job server from Ooyala, this is an
application on top of Spark that has an automated submission API:
https://github.com/ooyala/spark-jobserver
Patrick,
Thank you for replying. That didn't seem to work either. I see the option
parsed using verbose mode.
Parsed arguments:
...
driverExtraClassPath
/Users/rhoover/Work/spark-etl/target/scala-2.10/spark-etl_2.10-1.0.jar
But the jar still doesn't show up if I run :cp in the repl and
I've moved SparkContext and RDD as parameter of train. And now it tells me
that SparkContext need to serialize!
I think the the problem is RDD is trying to make itself lazy. and some
BroadCast Object need to be generate dynamicly, so the closure have
SparkContext inside, so the task complete
Hi Patrick
I am just doing simple word count , the data is generated by hadoop
random text writer.
This seems to me not quite related to compress , If I turn off compress
on shuffle, the metrics is something like below for the smaller 240MB Dataset.
Executor ID Address
Your code is unformatted. Can u paste the whole file in gist and i can take
a look for u.
On Apr 28, 2014 10:42 PM, Earthson earthson...@gmail.com wrote:
I've moved SparkContext and RDD as parameter of train. And now it tells me
that SparkContext need to serialize!
I think the the problem is
60 matches
Mail list logo