Solving Systems of Linear Equations Using Spark?

2014-09-07 Thread durin
Doing a quick Google search, it appears to me that there is a number people who have implemented algorithms for solving systems of (sparse) linear equations on Hadoop MapReduce. However, I can find no such thing for Spark. Has anyone information on whether there are attempts of creating such

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-08-28 Thread durin
Hi, I'm using a cluster with 5 nodes that each use 8 cores and 10GB of RAM Basically I'm creating a dictionary from text, i.e. giving each words that occurs more than n times in all texts a unique identifier. The essential port of the code looks like that: var texts = ctx.sql(SELECT text FROM

Re: Only master is really busy at KMeans training

2014-08-26 Thread durin
With a lower number of partitions, I keep losing executors during collect at KMeans.scala:283 The error message is ExecutorLostFailure (executor lost). The program recovers by automatically repartitioning the whole dataset (126G), which takes very long and seems to only delay the inevitable

Re: Only master is really busy at KMeans training

2014-08-26 Thread durin
Right now, I have issues even at a far earlier point. I'm fetching data from a registerd table via var texts = ctx.sql(SELECT text FROM tweetTrainTable LIMIT 2000).map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) //persisted because it's used again

Only master is really busy at KMeans training

2014-08-19 Thread durin
When trying to use KMeans.train with some large data and 5 worker nodes, it would due to BlockManagers shutting down because of timeout. I was able to prevent that by adding spark.storage.blockManagerSlaveTimeoutMs 300 to the spark-defaults.conf. However, with 1 Million feature vectors,

Re: Spark webUI - application details page

2014-08-14 Thread durin
If I don't understand you wrong, setting event logging in the SPARK_JAVA_OPTS should achieve what you want. I'm logging to the HDFS, but according to the config page http://spark.apache.org/docs/latest/configuration.html a folder should be possible as well. Example with all other settings

Using very large files for KMeans training -- cluster centers size?

2014-08-11 Thread durin
I'm trying to apply KMeans training to some text data, which consists of lines that each contain something between 3 and 20 words. For that purpose, all unique words are saved in a dictionary. This dictionary can become very large as no hashing etc. is done, but it should spill to disk in case it

Re: saveAsTextFile

2014-08-10 Thread durin
This should work: jobs.saveAsTextFile(file:home/hysom/testing) Note the 4 slashes, it's really 3 slashes + absolute path. This should be mentioned in the docu though, I only remember that from having seen it somewhere else. The output folder, here testing, will be created and must therefore

Executors for Spark shell take much longer to be ready

2014-08-08 Thread durin
I recently moved my Spark installation from one Linux user to another one, i.e. changed the folder and ownership of the files. That was everything, no other settings were changed or different machines used. However, now it suddenly takes three minutes to have all executors in the Spark shell

Re: KMeans Input Format

2014-08-07 Thread durin
Not all memory can be used for Java heap space, so maybe it does run out. Could you try repartitioning the data? To my knowledge you shouldn't be thrown out as long as a single partition fits into memory, even if the whole dataset does not. To do that, exchange val train = parsedData.cache()

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread durin
Update: I can get it to work by disabling iptables temporarily. I can, however, not figure out on which port I have to accept traffic. 4040 and any of the Master or Worker ports mentioned in the previous post don't work. Can it be one of the randomly assigned ones in the 30k to 60k range? Those

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread durin
in your cluster. -Andrew 2014-08-06 10:23 GMT-07:00 durin lt; [hidden email] gt;: lt;blockquote style='border-left:2px solid #CC;padding:0 1em' class=quot;gmail_quotequot; style=quot;margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1exquot;gt;Update: I can get it to work by disabling

Spark Streaming fails - where is the problem?

2014-08-04 Thread durin
I am using the latest Spark master and additionally, I am loading these jars: - spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar - twitter4j-core-4.0.2.jar - twitter4j-stream-4.0.2.jar My simple test program that I execute in the shell looks as follows: import org.apache.spark.streaming._

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread durin
Using 3.0.3 (downloaded from http://mvnrepository.com/artifact/org.twitter4j ) changes the error to Exception in thread Thread-55 java.lang.NoClassDefFoundError: twitter4j/StatusListener at org.apache.spark.streaming.twitter.TwitterInputDStream.getReceiver(TwitterInputDStream.scala:55)

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread durin
In the WebUI Environment tab, the section Classpath Entries lists the following ones as part of System Classpath: /foo/hadoop-2.0.0-cdh4.5.0/etc/hadoop /foo/spark-master-2014-07-28/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.0.0-cdh4.5.0.jar /foo/spark-master-2014-07-28/conf

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread durin
Hi Tathagata, I didn't mean to say this was an error. According to the other thread I linked, right now there shouldn't be any conflicts, so I wanted to use streaming in the shell for easy testing. I thought I had to create my own project in which I'd add streaming as a dependency, but if I can

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread durin
Development is really rapid here, that's a great thing. Out of curiosity, how did communication work before torrent? Did everything have to go back to the master / driver first? -- View this message in context:

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
Hi Xiangru, thanks for the explanation. 1. You said we have to broadcast m * k centers (with m = number of rows). I thought there were only k centers at each time, which would the have size of n * k and needed to be broadcasted. Is that I typo or did I understand something wrong? And the

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
Hi Xiangrui, using the current master meant a huge improvement for my task. Something that did not even finish before (training with 120G of dense data) now completes in a reasonable time. I guess using torrent helps a lot in this case. Best regards, Simon -- View this message in context:

KMeans: expensiveness of large vectors

2014-07-24 Thread durin
As a source, I have a textfile with n rows that each contain m comma-separated integers. Each row is then converted into a feature vector with m features each. I've noticed, that given the same total filesize and number of features, a larger number of columns is much more expensive for training

import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread durin
I'm using spark 1.0.0 (three weeks old build of latest). Along the lines of this tutorial http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-streaming.html , I want to read some tweets from twitter. When trying to execute in the Spark-Shell, I get The tutorial

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread durin
Thanks. Can I see that a Class is not available in the shell somewhere in the API Docs or do I have to find out by trial and error? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/import-org-apache-spark-streaming-twitter-in-Shell-tp9665p9678.html Sent from

Re: KMeans for large training data

2014-07-12 Thread durin
Thanks, setting the number of partitions to the number of executors helped a lot and training with 20k entries got a lot faster. However, when I tried training with 1M entries, after about 45 minutes of calculations, I get this: It's stuck at this point. The CPU load for the master is at 100%

KMeans for large training data

2014-07-11 Thread durin
Hi, I'm trying to use org.apache.spark.mllib.clustering.KMeans to do some basic clustering with Strings. My code works great when I use a five-figure amount of training elements. However, with for example 2 million elements, it gets extremely slow. A single stage may take up to 30 minutes. From

LIMIT with offset in SQL queries

2014-07-02 Thread durin
Hi, in many SQL-DBMS like MySQL, you can set an offset for the LIMIT clause, s.t. /LIMIT 5, 10/ will return 10 rows, starting from row 5. As far as I can see, this is not possible in Spark-SQL. The best solution I have to imitate that (using Scala) is converting the RDD into an Array via

jsonFile function in SQLContext does not work

2014-06-25 Thread durin
I'm using Spark 1.0.0-SNAPSHOT (downloaded and compiled on 2014/06/23). I'm trying to execute the following code: import org.apache.spark.SparkContext._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val table = sqlContext.jsonFile(hdfs://host:9100/user/myuser/data.json)

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin
(Thread.java:662) Driver stacktrace: Is the only possible reason that some of these 4.3 Million JSON-Objects are not valid JSON, or could there be another explanation? And if it is the reason, is there some way to tell the function to just skip faulty lines? Thanks, Durin -- View this message

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin
Hi Yin an Aaron, thanks for your help, this was indeed the problem. I've counted 1233 blank lines using grep, and the code snippet below works with those. From what you said, I guess that skipping faulty lines will be possible in later versions? Kind regards, Simon -- View this message in