Re: Finding previous and next element in a sorted RDD

2014-08-21 Thread Evan Chan
There's no way to avoid a shuffle due to the first and last elements of each partition needing to be computed with the others, but I wonder if there is a way to do a minimal shuffle. On Thu, Aug 21, 2014 at 6:13 PM, cjwang wrote: > One way is to do zipWithIndex on the RDD. Then use the index as

Re: OOM Java heap space error on saveAsTextFile

2014-08-21 Thread Akhil Das
What operation are you performing before doing the saveAsTextFile? If you are doing a groupBy/sortBy/mapPartition/reduceByKey operations then you can specify the number of partitions. We were facing these kind of problems and specifying the correct partition solved the issue. Thanks Best Regards

Spark on Mesos cause mesos-master OOM

2014-08-21 Thread Chengwei Yang
Hi List, We're recently trying to running spark on Mesos, however, we encountered a fatal error that mesos-master process will continuousely consume memory and finally killed by OOM Killer, this situation only happening if has spark job (fine-grained mode) running. We finally root caused the issu

LDA example?

2014-08-21 Thread Denny Lee
Quick question - is there a handy sample / example of how to use the LDA algorithm within Spark MLLib?   Thanks! Denny

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-21 Thread ZHENG, Xu-dong
Update. I just find a magic parameter *blanceSlack* in *CoalescedRDD*, which sounds could control the locality. The default value is 0.1 (smaller value means lower locality). I change it to 1.0 (full locality) and use #3 approach, then find a lot improvement (20%~40%). Although the Web UI still sh

Re: [Spark SQL] How to select first row in each GROUP BY group?

2014-08-21 Thread Fengyun RAO
Thanks, Silvio, If we write schemaRDD.map(row => (key, row)) .groupBy(key) .map((key, rows) => row) // take the first row from Iterable[ROW] We get an RDD[ROW], however, we need a SchemaRDD for following query. In our case, the ROW has about 80 columns which exceeds the case

Re: AppMaster OOME on YARN

2014-08-21 Thread Nieyuan
1.At begining of reduce task , mask will deliver map output info to every excutor. You can check stderr to find size of map output info . It should be : "spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is xxx bytes" 2.Erery excutor will processing 10+TB/2000 = 5G data. Reduc

Spark SQL Parser error

2014-08-21 Thread S Malligarjunan
Hello All, When i execute the following query  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) CREATE TABLE user1 (time string, id string, u_id string, c_ip string, user_agent string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE LO

Re: Spark-job error on writing result into hadoop w/ switch_user=false

2014-08-21 Thread Jongyoul Lee
Hi, For more details, I'm using mesos below and run a very simple command on spark-shell, scala> sc.textFile("/data/pickat/tsv/app/2014/07/31/*").map(_.split).groupBy(p => p(1)).saveAsTextFile("/user/1001079/pickat_test") "hdfs" is an account name run mesos, "1001079" is that of running script.

The running time of spark

2014-08-21 Thread Denis RP
Hi, I'm using spark on a cluster of 8 VMs, each with two cores and 3.5GB RAM. But I need to run a shortest path algorithm on data of 500+GB(textfile, each line contains a node id and nodes it points to) I've tested it on the cluster, but the speed seems to be extremely slow, and haven't got any

Re: Finding previous and next element in a sorted RDD

2014-08-21 Thread cjwang
One way is to do zipWithIndex on the RDD. Then use the index as a key. Add or subtract 1 for previous or next element. Then use cogroup or join to bind them together. val idx = input.zipWithIndex val previous = idx.map(x => (x._2+1, x._1)) val current = idx.map(x => (x._2, x._1)) val next = idx

Re: Spark Streaming - What does Spark Streaming checkpoint?

2014-08-21 Thread Chris Fregly
The StreamingContext can be recreated from a checkpoint file, indeed. check out the following Spark Streaming source files for details: StreamingContext, Checkpoint, DStream, DStreamCheckpoint, and DStreamGraph. On Wed, Jul 9, 2014 at 6:11 PM, Yan Fang wrote: > Hi guys, > > I am a little co

Finding previous and next element in a sorted RDD

2014-08-21 Thread cjwang
I have an RDD containing elements sorted in certain order. I would like to map over the elements knowing the values of their respective previous and next elements. With regular List, I used to do this: ("input" is a List below) // The first of the previous measures and the last of the next meas

Re: Matrix multiplication in spark

2014-08-21 Thread Victor Tso-Guillen
Scala defines transpose. On Thu, Aug 21, 2014 at 4:22 PM, x wrote: > Yes. > Now Spark API doesn't provide transpose function. You have to define it > like below. > > def transpose(m: Array[Array[Double]]): Array[Array[Double]] = { > (for { > c <- m(0).indices > } yield m.map(_(c))

Spark-JobServer moving to a new location

2014-08-21 Thread Evan Chan
Dear community, Wow, I remember when we first open sourced the job server, at the first Spark Summit in December. Since then, more and more of you have started using it and contributing to it. It is awesome to see! If you are not familiar with the spark job server, it is a REST API for managin

Re: Spark Streaming Twitter Example Error

2014-08-21 Thread Rishi Yadav
please add following three libraries to your class path. spark-streaming-twitter_2.10-1.0.0.jar twitter4j-core-3.0.3.jar twitter4j-stream-3.0.3.jar On Thu, Aug 21, 2014 at 1:09 PM, danilopds wrote: > Hi! > > I'm beginning with the development in Spark Streaming.. And I'm learning > with the e

Re: Hive From Spark

2014-08-21 Thread Marcelo Vanzin
Hi Du, I don't believe the Guava change has made it to the 1.1 branch. The Guava doc says "hashInt" was added in 12.0, so what's probably happening is that you have and old version of Guava in your classpath before the Spark jars. (Hadoop ships with Guava 11, so that may be the source of your prob

Re: Hive From Spark

2014-08-21 Thread Du Li
Hi, This guava dependency conflict problem should have been fixed as of yesterday according to https://issues.apache.org/jira/browse/SPARK-2420 However, I just got java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; by the following code

Re: Matrix multiplication in spark

2014-08-21 Thread x
Yes. Now Spark API doesn't provide transpose function. You have to define it like below. def transpose(m: Array[Array[Double]]): Array[Array[Double]] = { (for { c <- m(0).indices } yield m.map(_(c)) ).toArray } xj @ Tokyo On Thu, Aug 21, 2014 at 10:12 PM, phoenix bai wrote: > th

Configuration for big worker nodes

2014-08-21 Thread soroka21
Hi, I have relatively big worker nodes. What would be the best worker configuration for them? Should I use all memory for JVM and utilize all cores when running my jobs? Each node has 2x10 cores CPU and 160GB of RAM. Cluster has 4 nodes connected with 10G network. -- View this message in context

saveAsTextFile makes no progress without caching RDD

2014-08-21 Thread jerryye
Hi, I'm running on branch-1.1 and trying to do a simple transformation to a relatively small dataset of 64GB and saveAsTextFile essentially hangs and tasks are stuck in running mode with the following code: // Stalls with tasks running for over an hour with no tasks finishing. Smallest partition i

AppMaster OOME on YARN

2014-08-21 Thread Vipul Pandey
Hi, I'm running Spark on YARN carrying out a simple reduceByKey followed by another reduceByKey after some transformations. After completing the first stage my Master runs out of memory. I have 20G assigned to the master, 145 executors (12G each +4G overhead) , around 90k input files, 10+TB d

Development environment issues

2014-08-21 Thread pierred
Hello all, I am trying to get productive with Spark and Scala but I haven't figured out a good development environment yet. Coming from the eclipse/java/ant/ivy/hadoop world, I understand that I have a steep learning curve ahead of me, but still I would expect to be able to relatively quickly set

Re: Debugging cluster stability, configuration issues

2014-08-21 Thread Jayant Shekhar
Hi Shay, You can try setting spark.storage.blockManagerSlaveTimeoutMs to a higher value. Cheers, Jayant On Thu, Aug 21, 2014 at 1:33 PM, Shay Seng wrote: > Unfortunately it doesn't look like my executors are OOM. On the slave > machines I checked both the logs in /spark/log (which I assume i

Re: heterogeneous cluster hardware

2014-08-21 Thread anthonyjschu...@gmail.com
This is what I thought the simplest method would be, but I can't seem to figure out how to configure it-- When you set: SPARK_WORKER_INSTANCES, to set the number of worker processes per node but when you set SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g.

Re: heterogeneous cluster hardware

2014-08-21 Thread Andrew Ash
>From my code reading investigation: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L194 - where Spark only uses offers that have at least spark.executor.memory (called sc.executorMemory here) https://github.com

Re: heterogeneous cluster hardware

2014-08-21 Thread Andrew Ash
I'm actually not sure the Spark+Mesos integration supports dynamically allocating memory (it does support dynamically allocating cores though). Has anyone here actually used Spark+Mesos on heterogenous hardware and done dynamic memory allocation? My understanding is that each Spark executor start

Re: heterogeneous cluster hardware

2014-08-21 Thread Jörn Franke
Hi, No worries ;-) I think this scenario might still be supported by spark running on Mesos or Yarn2. Even your GPU-scenario could be supported. Check out the following resources: * https://spark.apache.org/docs/latest/running-on-mesos.html * http://mesos.berkeley.edu/mesos_tech_report.pdf Best

Re: Spark on Yarn: Connecting to Existing Instance

2014-08-21 Thread Chris Fregly
perhaps the author is referring to Spark Streaming applications? they're examples of long-running applications. the application/domain-level protocol still needs to be implemented yourself, as sandy pointed out. On Wed, Jul 9, 2014 at 11:03 AM, John Omernik wrote: > So how do I do the "long-l

OOM Java heap space error on saveAsTextFile

2014-08-21 Thread Daniil Osipov
Hello, My job keeps failing on saveAsTextFile stage (frustrating after a 3 hour run) with an OOM exception. The log is below. I'm running the job on an input of ~8Tb gzipped JSON files, executing on 15 m3.xlarge instances. Executor is given 13Gb memory, and I'm setting two custom preferences in th

Re: heterogeneous cluster hardware

2014-08-21 Thread Daniel Siegmann
If you use Spark standalone, you could start multiple workers on some machines. Size your worker configuration to be appropriate for the weak machines, and start multiple on your beefier machines. It may take a bit of work to get that all hooked up - probably you'll want to write some scripts to s

Re: Debugging cluster stability, configuration issues

2014-08-21 Thread Shay Seng
Unfortunately it doesn't look like my executors are OOM. On the slave machines I checked both the logs in /spark/log (which I assume is from the salve driver?) and in /spark/work/... which I assume are from each worker/executor. On Thu, Aug 21, 2014 at 11:19 AM, Yana Kadiyska wrote: > Wheneve

Re: Spark SQL: Caching nested structures extremely slow

2014-08-21 Thread Yin Huai
I have not profiled this part. But, I think one possible cause is allocating an array for every inner struct for every row (every struct value is represented by a Spark SQL row). I will play with it later and see what I find. On Tue, Aug 19, 2014 at 9:01 PM, Evan Chan wrote: > Hey guys, > > I'm

Spark Streaming Twitter Example Error

2014-08-21 Thread danilopds
Hi! I'm beginning with the development in Spark Streaming.. And I'm learning with the examples available in the spark directory. There are several applications and I want to make modifications. I can execute the TwitterPopularTags normally with command: ./bin/run-example TwitterPopularTags So,

Re: [Tachyon] Error reading from Parquet files in HDFS

2014-08-21 Thread Evan Chan
And it worked earlier with non-parquet directory. On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan wrote: > The underFS is HDFS btw. > > On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan wrote: >> Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config) >> >> scala> val gdeltT = >> sqlContext.parquetF

Re: [Tachyon] Error reading from Parquet files in HDFS

2014-08-21 Thread Evan Chan
The underFS is HDFS btw. On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan wrote: > Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config) > > scala> val gdeltT = > sqlContext.parquetFile("tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/") > 14/08/21 19:07:14 INFO : > initialize(tachyon://1

[Tachyon] Error reading from Parquet files in HDFS

2014-08-21 Thread Evan Chan
Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config) scala> val gdeltT = sqlContext.parquetFile("tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/") 14/08/21 19:07:14 INFO : initialize(tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005, Configuration: core-default.xml, core-site.xml

Writeup on Spark SQL with GDELT

2014-08-21 Thread Evan Chan
I just put up a repo with a write-up on how to import the GDELT public dataset into Spark SQL and play around. Has a lot of notes on different import methods and observations about Spark SQL. Feel free to have a look and comment. http://www.github.com/velvia/spark-sql-gdelt ---

Kafka Spark Streaming job has an issue when the worker reading from Kafka is killed

2014-08-21 Thread Bharat Venkat
Hi, To test the resiliency of Kafka Spark streaming, I killed the worker reading from Kafka Topic and noticed that the driver is unable to replace the worker and the job becomes a rogue job that keeps running doing nothing from that point on. Is this a known issue? Are there any workarounds? He

Re: [Spark Streaming] kafka consumer announce

2014-08-21 Thread Evgeniy Shishkin
>> On 21 Aug 2014, at 20:25, Tim Smith wrote: >> >> Thanks. Discovering kafka metadata from zookeeper instead of brokers >> is nicer. Saving metadata and offsets to HBase, is that optional or >> mandatory? >> Can it be made optional (default to zookeeper)? >> For now we implemented and somewhat

Re: Spark QL and protobuf schema

2014-08-21 Thread Dmitriy Lyubimov
ok i'll try. happen to do that a lot to other tools. So I am guessing you are saying if i wanted to do it now, i'd start against https://github.com/apache/spark/tree/branch-1.1 and PR against it? On Thu, Aug 21, 2014 at 12:28 AM, Michael Armbrust wrote: > I do not know of any existing way to d

Re: MLlib: issue with increasing maximum depth of the decision tree

2014-08-21 Thread SURAJ SHETH
Hi Sameer, http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tree-not-getting-built-for-5-or-more-levels-maxDepth-5-and-the-one-built-for-3-levelsy-td7401.html Thanks and Regards, Suraj Sheth On Thu, Aug 21, 2014 at 10:52 PM, Sameer Tilak wrote: > Resending this: > > > Hi Al

Debugging cluster stability, configuration issues

2014-08-21 Thread Shay Seng
Hi, I am running Spark 0.9.2 on an EC2 cluster with about 16 r3.4xlarge machines The cluster is running Spark standalone and is launched with the ec2 scripts. In my Spark job, I am using ephemeral HDFS to checkpoint some of my RDDs. I'm also reading and writing to S3. My jobs also involve a large

MLlib: issue with increasing maximum depth of the decision tree

2014-08-21 Thread Sameer Tilak
Resending this: Hi All, My dataset is fairly small -- a CSV file with around half million rows and 600 features. Everything works when I set maximum depth of the decision tree to 5 or 6. However, I get this error for larger values of that parameter -- For example when I set it to 10.

RE: I am struggling to run Spark Examples on my local machine

2014-08-21 Thread prajod.vettiyattil
Hi Steve, Your spark master is not running if you have not started it. On windows its missing come scripts/and or the correct installation instructions. I was able to start the master with C:\> spark-class.cmd org.apache.spark.deploy.master.Master Then on the browser with localhost:port you get

I am struggling to run Spark Examples on my local machine

2014-08-21 Thread Steve Lewis
I download the binaries for spark-1.0.2-hadoop1 and unpack it on my Widows 8 box. I can execute spark-shell.com and get a command window which does the proper things I open a browser to http:/localhost:4040 and a window comes up describing the spark-master Then using IntelliJ I create a project wi

Re: heterogeneous cluster hardware

2014-08-21 Thread anthonyjschu...@gmail.com
Jörn, thanks for the post... Unfortunately, I am stuck with the hardware I have and might not be able to get budget allocated for a new stack of servers when I've already got so many "ok" servers on hand... And even more unfortunately, a large subset of these machines are... shall we say... extrem

Re: Job aborted due to stage failure: Master removed our application: FAILED

2014-08-21 Thread Yana
I think there might be yet another log to check -- the actual executor log. I'm not quite sure where are stored but I believe it's under spark.local.dir (by default /tmp) and then look for folders like app-... and under there should be the executor number and logs. Might be easier to click throug

Tracking memory usage

2014-08-21 Thread Grzegorz Białek
Hi, I would like to ask how to check how much memory of executor was used during run of application. I know where to check cache memory usage in logs and in web UI (in Storage tab), but where can I check size of rest of the heap (used e.g. for aggregation and cogroups during shuffle)? Because it a

Re: heterogeneous cluster hardware

2014-08-21 Thread Jörn Franke
Hi, Well, you could use Mesos or Yarn2 to define resources per Job - you can give only as much resources (cores, memory etc.) per machine as your "worst" machine has. The rest is done by Mesos or Yarn. By doing this you avoid a per machine resource assignment without any disadvantages. You can ru

DStream start a separate DStream

2014-08-21 Thread Josh J
Hi, I would like to have a sliding window dstream perform a streaming computation and store these results. Once these results are stored, I then would like to process the results. Though I must wait until the final computation done for all tuples in the sliding window, before I begin the new DStre

multiple windows from the same DStream ?

2014-08-21 Thread Josh J
Hi, Can I build two sliding windows in parallel from the same Dstream ? Will these two window streams run in parallel and process the same data? I wish to do two different functions (aggregration on one window and storage for the other window) across the same original dstream data though the same

Re: heterogeneous cluster hardware

2014-08-21 Thread anthonyjschu...@gmail.com
I've got a stack of Dell Commodity servers-- Ram~>(8 to 32Gb) single or dual quad core processor cores per machine. I think I will have them loaded with CentOS. Eventually, I may want to add GPUs on the nodes to handle linear alg. operations... My Idea has been: 1) to find a way to configure Spar

Re: Merging two Spark SQL tables?

2014-08-21 Thread Evan Chan
SO I tried the above (why doesn't union or ++ have the same behavior btw?) and it works, but is slow because the original Rdds are not cached and files must be read from disk. I also discovered you can recover the InMemoryCached versions of the Rdds using sqlContext.table("table1"). Thus you can

Re: Spark Streaming with Flume event

2014-08-21 Thread Spidy
BTW, i'm using CDH 5.1 wtih Spark 1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Flume-event-tp12569p12579.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Broadcast vs simple variable

2014-08-21 Thread Julien Naour
Thanks for the useful links. Cheers, Julien 2014-08-21 11:47 GMT+02:00 Yanbo Liang : > In Spark/MLlib, task serialization such as cluster centers of k-means was > replaced by broadcast variables due to performance. > You can refer this PR https://github.com/apache/spark/pull/1427 > And curren

[Spark Streaming] kafka consumer announce

2014-08-21 Thread Evgeniy Shishkin
Hello, we are glad to announce yet another kafka input stream. Available at https://github.com/wgnet/spark-kafka-streaming It is used in production for about 3 months. We will be happy to hear your feedback. Custom Spark Kafka consumer based on Kafka SimpleConsumer API. Features • dis

Re: [Spark SQL] How to select first row in each GROUP BY group?

2014-08-21 Thread Silvio Fiorito
Yeah, unfortunately SparkSQL is missing a lot of the nice analytical functions in Hive. But using a combo of SQL and Spark operations you should be able to run the basic SQL, then do a groupBy on the SchemaRDD, then for each group just take the first record. From: Fengyun RAO mailto:raofeng...@

Re: [Spark SQL] How to select first row in each GROUP BY group?

2014-08-21 Thread Fengyun RAO
Could anybody help? I googled and read a lot, but didn’t find anything helpful. or to make the question simple: * How to set row number for each group? * SELECT a, ROW_NUMBER() OVER (PARTITION BY a) AS num FROM table. 2014-08-20 15:52 GMT+08:00 Fengyun RAO : I have a table with 4 column

Launching history server problem

2014-08-21 Thread Grzegorz Białek
Hi, I try to launch history server in local mode but my server doesn't see any completed applications. (I'm not sure if it's possible in standalone mode, here: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/monitoring.html#viewing-after-the-fact is written only about Mesos and Yarn). I set sp

Job aborted due to stage failure: Master removed our application: FAILED

2014-08-21 Thread Kristoffer Sjögren
Hi I have trouble executing a really simple Java job on spark 1.0.0-cdh5.1.0 that runs inside a docker container: SparkConf sparkConf = new SparkConf().setAppName("TestApplication").setMaster("spark://localhost:7077"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); JavaRDD lines = ctx.te

Re: Matrix multiplication in spark

2014-08-21 Thread x
You could create a distributed matrix with RowMatrix. val rmat = new RowMatrix(rows) And then make a local DenseMatrix. val localMat = Matrices.dense(m, n, mat) Then multiply them. rmat.multiply(localMat) xj @ Tokyo On Thu, Aug 21, 2014 at 6:37 PM, Sean Owen wrote: > Are you trying to mul

Re: Personalized Page rank in graphx

2014-08-21 Thread Martin Liesenberg
I could take a stab at it, though I'd have some reading up on Personalized PageRank to do, before I'd be able to start coding. If that's OK, I'd get started. Best regards, Martin On 20 August 2014 23:03, Ankur Dave wrote: > At 2014-08-20 10:57:57 -0700, Mohit Singh wrote: >> I was wondering if

More than one worker freezes (some) applications

2014-08-21 Thread Alexander Matz
Hi, I'm using spark on a 8 (+headnode) node cluster in standalone mode. The headnode runs the master instance and the compute nodes are supposed to run the workers. No worker runs on the headnode. However, as soon as I connect more than one worker I get a weird behaviour: some applications run, s

Spark Streaming checkpoint recovery causes IO re-execution

2014-08-21 Thread RodrigoB
Dear Spark users, We have a spark streaming application which receives events from kafka and has an updatestatebykey call that executes IO like writing to Cassandra or sending events to other systems. Upon metadata checkpoint recovery (before the data checkpoint occurs) all lost RDDs get recompu

Re: Advantage of using cache()

2014-08-21 Thread Grzegorz Białek
Hi, thank you for your response. I removed issues you mentioned. Now I read RDDs from files, whole rdd is cached, I don't use random and rdd1 and rdd2 are identical. RDDs that are joined contains 100k entries and result contains 10m entries. rdd1 and rdd2 after join also contains 10m entries. Here

Re: Broadcast vs simple variable

2014-08-21 Thread Yanbo Liang
In Spark/MLlib, task serialization such as cluster centers of k-means was replaced by broadcast variables due to performance. You can refer this PR https://github.com/apache/spark/pull/1427 And current k-means implementation of MLlib, it's benefited from sparse vector computing. http://spark-summit

out of memory errors -- per core memory limits?

2014-08-21 Thread Rok Roskar
I am having some issues with processes running out of memory and I'm wondering if I'm setting things up incorrectly. I am running a job on two nodes with 24 cores and 256Gb of memory each. I start the pyspark shell with SPARK_EXECUTOR_MEMORY=210gb. When I run the job with anything more than 8

Re: Matrix multiplication in spark

2014-08-21 Thread Sean Owen
Are you trying to multiply dense or sparse matrices? if sparse, are they very large -- meaning, are you looking for distributed operations? On Thu, Aug 21, 2014 at 10:07 AM, phoenix bai wrote: > there is RowMatrix implemented in spark. > and I check for a while but failed to find any matrix opera

ClassCastException, when calling saveToCassandra()

2014-08-21 Thread keiffster
While getting up and running with Spark & Cassandra I am coming up against the following problem I've narrowed it down to the following 2 lines ( originally was a large query ) val pcollection2 = sc.parallelize(Seq(("", "Filter1", "A", "X"))) pcollection2.saveToCassandra("signals", "segm

Matrix multiplication in spark

2014-08-21 Thread phoenix bai
there is RowMatrix implemented in spark. and I check for a while but failed to find any matrix operations (like multiplication etc) are defined in the class yet. so, my question is, if I want to do matrix multiplication, (to do vector x matrix multiplication to be precise), need to convert the vec

Re: DStream cannot write to text file

2014-08-21 Thread Hoai-Thu Vuong
50075 is default port for web access the right port is 9000 or some thing is configured in core-site.xml with variable: fs.default.name. please check the document On Thu, Aug 21, 2014 at 3:01 PM, Mayur Rustagi wrote: > is your hdfs running, can spark access it? > > Mayur Rustagi > Ph: +1 (760)

Re: DStream cannot write to text file

2014-08-21 Thread Mayur Rustagi
is your hdfs running, can spark access it? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Aug 21, 2014 at 1:15 PM, cuongpham92 wrote: > I'm sorry, I just forgot "/data" after "hdfs://localhost:50075". When I > add

Re: DStream cannot write to text file

2014-08-21 Thread cuongpham92
I'm sorry, I just forgot "/data" after "hdfs://localhost:50075". When I added it, a new exception showed up: "Call to localhost/127.0.0.1:50075 failed on local exception". How could I fix it? Thanks, Cuong. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DSt

Re: Broadcast vs simple variable

2014-08-21 Thread Julien Naour
My Arrays are in fact Array[Array[Long]] and like 17x15 (17 centers with 150 000 modalities, i'm working on qualitative variables) so they are pretty large. I'm working on it to get them smaller, it's mostly a sparse matrix. Good things to know nervertheless. Thanks, Julien Naour 2014-08-20

Re: DStream cannot write to text file

2014-08-21 Thread cuongpham92
Dear Mayur Rustagi, The exception is "Incomplete HDFS URI, no host: hdfs://localhost:50075-1408602904000.output". In my situation, I receive a string from kafka, and then want to save it to text file. Could you please give me some suggestion about this? Best regards, Cuong. -- View this message

Re: Accessing to elements in JavaDStream

2014-08-21 Thread cuongpham92
Thank you so much, it runs successfully. Cuong. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-to-elements-in-JavaDStream-tp12459p12557.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: DStream cannot write to text file

2014-08-21 Thread Mayur Rustagi
MyDStreamVariable.saveAsTextFile("hdfs://localhost:50075/data", "output") this shoudl work..is it throwing exception? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Aug 21, 2014 at 12:44 PM, cuongpham92 wrote: > >

Re: spark - reading hfds files every 5 minutes

2014-08-21 Thread Mayur Rustagi
Hi, case class Person(name: String, age: Int) val lines = ssc.textFileStream("blah blah") val sqc = new SQLContext(sc); lines.foreachRDD(rdd=>{ rdd.map(_.split(",")).map(p => Persons(p(0), p(1).trim.toInt)).registerAsTable("data") val teenagers = sqc.sql("SELECT * FROM

Re: Spark QL and protobuf schema

2014-08-21 Thread Michael Armbrust
I do not know of any existing way to do this. It should be possible using the new public API for applying schema (will be available in 1.1) to an RDD. Basically you'll need to convert the proto buff records into rows, and also create a StructType that represents the schema. With this two things

Re: Merging two Spark SQL tables?

2014-08-21 Thread Michael Armbrust
I believe this should work if you run srdd1.unionAll(srdd2). Both RDDs must have the same schema. On Wed, Aug 20, 2014 at 11:30 PM, Evan Chan wrote: > Is it possible to merge two cached Spark SQL tables into a single > table so it can queried with one SQL statement? > > ie, can you do schemaRd

Re: Accessing to elements in JavaDStream

2014-08-21 Thread Mayur Rustagi
transform your way :) MyDStream.transform(RDD => RDD.map(wordChanger)) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Wed, Aug 20, 2014 at 1:25 PM, cuongpham92 wrote: > Hi, > I am a newbie to Spark Streaming, and I am

Re: DStream cannot write to text file

2014-08-21 Thread cuongpham92
Dear Mayur Rustagi, Could you give me more detail about this? I did DStream.saveAsTextFile("hdfs://localhost:50075", "output"), but it did not work. As I see, DStream.saveAsTextFile requires two parameters, prefix and suffix, so how should I provide these types of info? Thanks in advance, Cuong.

Re: RDD Row Index

2014-08-21 Thread TJ Klein
Thanks. As my files are defined to be non-splittable, I eventually I ended up using mapPartitionsWithIndex() taking the split ID as index def g(splitIndex, iterator): yield (splitIndex, iterator.next()) myRDD.mapPartitionsWithIndex(g) -- View this message in context: http://apache-spark

Re: Mapping with extra arguments

2014-08-21 Thread TJ Klein
Thanks for the nice example. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-with-extra-arguments-tp12541p12549.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Mapping with extra arguments

2014-08-21 Thread TJ Klein
Thanks. That's pretty much what I need. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mapping-with-extra-arguments-tp12541p12548.html Sent from the Apache Spark User List mailing list archive at Nabble.com.