from:"durin"

Solving Systems of Linear Equations Using Spark?

2014-09-07 Thread durin

Doing a quick Google search, it appears to me that there is a number people who have implemented algorithms for solving systems of (sparse) linear equations on Hadoop MapReduce. However, I can find no such thing for Spark. Has anyone information on whether there are attempts of creating such

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-08-28 Thread durin

Hi, I'm using a cluster with 5 nodes that each use 8 cores and 10GB of RAM Basically I'm creating a dictionary from text, i.e. giving each words that occurs more than n times in all texts a unique identifier. The essential port of the code looks like that: var texts = ctx.sql(SELECT text FROM

Re: Only master is really busy at KMeans training

2014-08-26 Thread durin

With a lower number of partitions, I keep losing executors during collect at KMeans.scala:283 The error message is ExecutorLostFailure (executor lost). The program recovers by automatically repartitioning the whole dataset (126G), which takes very long and seems to only delay the inevitable

Re: Only master is really busy at KMeans training

2014-08-26 Thread durin

Right now, I have issues even at a far earlier point. I'm fetching data from a registerd table via var texts = ctx.sql(SELECT text FROM tweetTrainTable LIMIT 2000).map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) //persisted because it's used again

Only master is really busy at KMeans training

2014-08-19 Thread durin

When trying to use KMeans.train with some large data and 5 worker nodes, it would due to BlockManagers shutting down because of timeout. I was able to prevent that by adding spark.storage.blockManagerSlaveTimeoutMs 300 to the spark-defaults.conf. However, with 1 Million feature vectors,

Re: Spark webUI - application details page

2014-08-14 Thread durin

If I don't understand you wrong, setting event logging in the SPARK_JAVA_OPTS should achieve what you want. I'm logging to the HDFS, but according to the config page http://spark.apache.org/docs/latest/configuration.html a folder should be possible as well. Example with all other settings

Using very large files for KMeans training -- cluster centers size?

2014-08-11 Thread durin

I'm trying to apply KMeans training to some text data, which consists of lines that each contain something between 3 and 20 words. For that purpose, all unique words are saved in a dictionary. This dictionary can become very large as no hashing etc. is done, but it should spill to disk in case it

Re: saveAsTextFile

2014-08-10 Thread durin

This should work: jobs.saveAsTextFile(file:home/hysom/testing) Note the 4 slashes, it's really 3 slashes + absolute path. This should be mentioned in the docu though, I only remember that from having seen it somewhere else. The output folder, here testing, will be created and must therefore

Executors for Spark shell take much longer to be ready

2014-08-08 Thread durin

I recently moved my Spark installation from one Linux user to another one, i.e. changed the folder and ownership of the files. That was everything, no other settings were changed or different machines used. However, now it suddenly takes three minutes to have all executors in the Spark shell

Re: KMeans Input Format

2014-08-07 Thread durin

Not all memory can be used for Java heap space, so maybe it does run out. Could you try repartitioning the data? To my knowledge you shouldn't be thrown out as long as a single partition fits into memory, even if the whole dataset does not. To do that, exchange val train = parsedData.cache()

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread durin

Update: I can get it to work by disabling iptables temporarily. I can, however, not figure out on which port I have to accept traffic. 4040 and any of the Master or Worker ports mentioned in the previous post don't work. Can it be one of the randomly assigned ones in the 30k to 60k range? Those

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread durin

in your cluster. -Andrew 2014-08-06 10:23 GMT-07:00 durin lt; [hidden email] gt;: lt;blockquote style='border-left:2px solid #CC;padding:0 1em' class=quot;gmail_quotequot; style=quot;margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1exquot;gt;Update: I can get it to work by disabling

Spark Streaming fails - where is the problem?

2014-08-04 Thread durin

I am using the latest Spark master and additionally, I am loading these jars: - spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar - twitter4j-core-4.0.2.jar - twitter4j-stream-4.0.2.jar My simple test program that I execute in the shell looks as follows: import org.apache.spark.streaming._

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread durin

Using 3.0.3 (downloaded from http://mvnrepository.com/artifact/org.twitter4j ) changes the error to Exception in thread Thread-55 java.lang.NoClassDefFoundError: twitter4j/StatusListener at org.apache.spark.streaming.twitter.TwitterInputDStream.getReceiver(TwitterInputDStream.scala:55)

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread durin

In the WebUI Environment tab, the section Classpath Entries lists the following ones as part of System Classpath: /foo/hadoop-2.0.0-cdh4.5.0/etc/hadoop /foo/spark-master-2014-07-28/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.0.0-cdh4.5.0.jar /foo/spark-master-2014-07-28/conf

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

2014-07-31 Thread durin

Hi Tathagata, I didn't mean to say this was an error. According to the other thread I linked, right now there shouldn't be any conflicts, so I wanted to use streaming in the shell for easy testing. I thought I had to create my own project in which I'd add streaming as a dependency, but if I can

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread durin

Development is really rapid here, that's a great thing. Out of curiosity, how did communication work before torrent? Did everything have to go back to the master / driver first? -- View this message in context:

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin

Hi Xiangru, thanks for the explanation. 1. You said we have to broadcast m * k centers (with m = number of rows). I thought there were only k centers at each time, which would the have size of n * k and needed to be broadcasted. Is that I typo or did I understand something wrong? And the

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin

Hi Xiangrui, using the current master meant a huge improvement for my task. Something that did not even finish before (training with 120G of dense data) now completes in a reasonable time. I guess using torrent helps a lot in this case. Best regards, Simon -- View this message in context:

KMeans: expensiveness of large vectors

2014-07-24 Thread durin

As a source, I have a textfile with n rows that each contain m comma-separated integers. Each row is then converted into a feature vector with m features each. I've noticed, that given the same total filesize and number of features, a larger number of columns is much more expensive for training

import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread durin

I'm using spark 1.0.0 (three weeks old build of latest). Along the lines of this tutorial http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-streaming.html , I want to read some tweets from twitter. When trying to execute in the Spark-Shell, I get The tutorial

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread durin

Thanks. Can I see that a Class is not available in the shell somewhere in the API Docs or do I have to find out by trial and error? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/import-org-apache-spark-streaming-twitter-in-Shell-tp9665p9678.html Sent from

Re: KMeans for large training data

2014-07-12 Thread durin

Thanks, setting the number of partitions to the number of executors helped a lot and training with 20k entries got a lot faster. However, when I tried training with 1M entries, after about 45 minutes of calculations, I get this: It's stuck at this point. The CPU load for the master is at 100%

KMeans for large training data

2014-07-11 Thread durin

Hi, I'm trying to use org.apache.spark.mllib.clustering.KMeans to do some basic clustering with Strings. My code works great when I use a five-figure amount of training elements. However, with for example 2 million elements, it gets extremely slow. A single stage may take up to 30 minutes. From

LIMIT with offset in SQL queries

2014-07-02 Thread durin

Hi, in many SQL-DBMS like MySQL, you can set an offset for the LIMIT clause, s.t. /LIMIT 5, 10/ will return 10 rows, starting from row 5. As far as I can see, this is not possible in Spark-SQL. The best solution I have to imitate that (using Scala) is converting the RDD into an Array via

jsonFile function in SQLContext does not work

2014-06-25 Thread durin

I'm using Spark 1.0.0-SNAPSHOT (downloaded and compiled on 2014/06/23). I'm trying to execute the following code: import org.apache.spark.SparkContext._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val table = sqlContext.jsonFile(hdfs://host:9100/user/myuser/data.json)

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin

(Thread.java:662) Driver stacktrace: Is the only possible reason that some of these 4.3 Million JSON-Objects are not valid JSON, or could there be another explanation? And if it is the reason, is there some way to tell the function to just skip faulty lines? Thanks, Durin -- View this message

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin

Hi Yin an Aaron, thanks for your help, this was indeed the problem. I've counted 1233 blank lines using grep, and the code snippet below works with those. From what you said, I guess that skipping faulty lines will be possible in later versions? Kind regards, Simon -- View this message in

Solving Systems of Linear Equations Using Spark?

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

Re: Only master is really busy at KMeans training

Re: Only master is really busy at KMeans training

Only master is really busy at KMeans training

Re: Spark webUI - application details page

Using very large files for KMeans training -- cluster centers size?

Re: saveAsTextFile

Executors for Spark shell take much longer to be ready

Re: KMeans Input Format

Re: Spark Streaming fails - where is the problem?

Re: Spark Streaming fails - where is the problem?

Spark Streaming fails - where is the problem?

Re: Spark Streaming fails - where is the problem?

Re: Spark Streaming fails - where is the problem?

Re: sbt package failed: wrong libraryDependencies for spark-streaming?

Re: KMeans: expensiveness of large vectors

Re: KMeans: expensiveness of large vectors

Re: KMeans: expensiveness of large vectors

KMeans: expensiveness of large vectors

import org.apache.spark.streaming.twitter._ in Shell

Re: import org.apache.spark.streaming.twitter._ in Shell

Re: KMeans for large training data

KMeans for large training data

LIMIT with offset in SQL queries

jsonFile function in SQLContext does not work

Re: jsonFile function in SQLContext does not work

Re: jsonFile function in SQLContext does not work

28 matches

Site Navigation

Mail list logo

Footer information