Doing a quick Google search, it appears to me that there is a number people
who have implemented algorithms for solving systems of (sparse) linear
equations on Hadoop MapReduce.
However, I can find no such thing for Spark.
Has anyone information on whether there are attempts of creating such
Hi,
I'm using a cluster with 5 nodes that each use 8 cores and 10GB of RAM
Basically I'm creating a dictionary from text, i.e. giving each words that
occurs more than n times in all texts a unique identifier.
The essential port of the code looks like that:
var texts = ctx.sql(SELECT text FROM
With a lower number of partitions, I keep losing executors during
collect at KMeans.scala:283
The error message is ExecutorLostFailure (executor lost).
The program recovers by automatically repartitioning the whole dataset
(126G), which takes very long and seems to only delay the inevitable
Right now, I have issues even at a far earlier point.
I'm fetching data from a registerd table via
var texts = ctx.sql(SELECT text FROM tweetTrainTable LIMIT
2000).map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
//persisted because it's used again
When trying to use KMeans.train with some large data and 5 worker nodes, it
would due to BlockManagers shutting down because of timeout. I was able to
prevent that by adding
spark.storage.blockManagerSlaveTimeoutMs 300
to the spark-defaults.conf.
However, with 1 Million feature vectors,
If I don't understand you wrong, setting event logging in the SPARK_JAVA_OPTS
should achieve what you want. I'm logging to the HDFS, but according to the
config page http://spark.apache.org/docs/latest/configuration.html a
folder should be possible as well.
Example with all other settings
I'm trying to apply KMeans training to some text data, which consists of
lines that each contain something between 3 and 20 words. For that purpose,
all unique words are saved in a dictionary. This dictionary can become very
large as no hashing etc. is done, but it should spill to disk in case it
This should work:
jobs.saveAsTextFile(file:home/hysom/testing)
Note the 4 slashes, it's really 3 slashes + absolute path.
This should be mentioned in the docu though, I only remember that from
having seen it somewhere else.
The output folder, here testing, will be created and must therefore
I recently moved my Spark installation from one Linux user to another one,
i.e. changed the folder and ownership of the files. That was everything, no
other settings were changed or different machines used.
However, now it suddenly takes three minutes to have all executors in the
Spark shell
Not all memory can be used for Java heap space, so maybe it does run out.
Could you try repartitioning the data? To my knowledge you shouldn't be
thrown out as long as a single partition fits into memory, even if the whole
dataset does not.
To do that, exchange
val train = parsedData.cache()
Update: I can get it to work by disabling iptables temporarily. I can,
however, not figure out on which port I have to accept traffic. 4040 and any
of the Master or Worker ports mentioned in the previous post don't work.
Can it be one of the randomly assigned ones in the 30k to 60k range? Those
in your cluster.
-Andrew 2014-08-06 10:23 GMT-07:00 durin lt; [hidden email] gt;:
lt;blockquote style='border-left:2px solid #CC;padding:0 1em'
class=quot;gmail_quotequot; style=quot;margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1exquot;gt;Update: I can get it to work by disabling
I am using the latest Spark master and additionally, I am loading these jars:
- spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar
- twitter4j-core-4.0.2.jar
- twitter4j-stream-4.0.2.jar
My simple test program that I execute in the shell looks as follows:
import org.apache.spark.streaming._
Using 3.0.3 (downloaded from http://mvnrepository.com/artifact/org.twitter4j
) changes the error to
Exception in thread Thread-55 java.lang.NoClassDefFoundError:
twitter4j/StatusListener
at
org.apache.spark.streaming.twitter.TwitterInputDStream.getReceiver(TwitterInputDStream.scala:55)
In the WebUI Environment tab, the section Classpath Entries lists the
following ones as part of System Classpath:
/foo/hadoop-2.0.0-cdh4.5.0/etc/hadoop
/foo/spark-master-2014-07-28/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.0.0-cdh4.5.0.jar
/foo/spark-master-2014-07-28/conf
Hi Tathagata,
I didn't mean to say this was an error. According to the other thread I
linked, right now there shouldn't be any conflicts, so I wanted to use
streaming in the shell for easy testing.
I thought I had to create my own project in which I'd add streaming as a
dependency, but if I can
Development is really rapid here, that's a great thing.
Out of curiosity, how did communication work before torrent? Did everything
have to go back to the master / driver first?
--
View this message in context:
Hi Xiangru,
thanks for the explanation.
1. You said we have to broadcast m * k centers (with m = number of rows). I
thought there were only k centers at each time, which would the have size of
n * k and needed to be broadcasted. Is that I typo or did I understand
something wrong?
And the
Hi Xiangrui,
using the current master meant a huge improvement for my task. Something
that did not even finish before (training with 120G of dense data) now
completes in a reasonable time. I guess using torrent helps a lot in this
case.
Best regards,
Simon
--
View this message in context:
As a source, I have a textfile with n rows that each contain m
comma-separated integers.
Each row is then converted into a feature vector with m features each.
I've noticed, that given the same total filesize and number of features, a
larger number of columns is much more expensive for training
I'm using spark 1.0.0 (three weeks old build of latest).
Along the lines of this tutorial
http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-streaming.html
, I want to read some tweets from twitter.
When trying to execute in the Spark-Shell, I get
The tutorial
Thanks. Can I see that a Class is not available in the shell somewhere in the
API Docs or do I have to find out by trial and error?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/import-org-apache-spark-streaming-twitter-in-Shell-tp9665p9678.html
Sent from
Thanks, setting the number of partitions to the number of executors helped a
lot and training with 20k entries got a lot faster.
However, when I tried training with 1M entries, after about 45 minutes of
calculations, I get this:
It's stuck at this point. The CPU load for the master is at 100%
Hi,
I'm trying to use org.apache.spark.mllib.clustering.KMeans to do some basic
clustering with Strings.
My code works great when I use a five-figure amount of training elements.
However, with for example 2 million elements, it gets extremely slow. A
single stage may take up to 30 minutes.
From
Hi,
in many SQL-DBMS like MySQL, you can set an offset for the LIMIT clause,
s.t. /LIMIT 5, 10/ will return 10 rows, starting from row 5.
As far as I can see, this is not possible in Spark-SQL.
The best solution I have to imitate that (using Scala) is converting the RDD
into an Array via
I'm using Spark 1.0.0-SNAPSHOT (downloaded and compiled on 2014/06/23).
I'm trying to execute the following code:
import org.apache.spark.SparkContext._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val table =
sqlContext.jsonFile(hdfs://host:9100/user/myuser/data.json)
(Thread.java:662) Driver stacktrace:
Is the only possible reason that some of these 4.3 Million JSON-Objects are
not valid JSON, or could there be another explanation?
And if it is the reason, is there some way to tell the function to just skip
faulty lines?
Thanks,
Durin
--
View this message
Hi Yin an Aaron,
thanks for your help, this was indeed the problem. I've counted 1233 blank
lines using grep, and the code snippet below works with those.
From what you said, I guess that skipping faulty lines will be possible in
later versions?
Kind regards,
Simon
--
View this message in
28 matches
Mail list logo