Re: Memory/Network Intensive Workload

2014-06-30 Thread Akhil Das
Hi Not sure, if this will help you. 1. Create one application that will put files to your S3 bucket from public data source (You can use public wiki-data) 2. Create another application (SparkStreaming one) which will listen on that bucket ^^ and perform some operation (Caching, GroupBy etc) as

Re: Spark Streaming with HBase

2014-06-30 Thread Akhil Das
Something like this??? import java.util.List; import org.apache.commons.configuration.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Result; import

How to control a spark application(executor) using memory amount per node?

2014-06-30 Thread hansen
Hi, When i send the following statements in spark-shell: val file = sc.textFile(hdfs://nameservice1/user/study/spark/data/soc-LiveJournal1.txt) val count = file.flatMap(line = line.split( )).map(word = (word, 1)).reduceByKey(_+_) println(count.count()) and, it throw a exception:

Re: Spark RDD member of class loses it's value when the class being used as graph attribute

2014-06-30 Thread Daniel Darabos
Can you share some example code of what you are doing? BTW Gmail puts down your mail as spam, saying it cannot verify it came from yahoo.com. Might want to check your mail client settings. (It could be a Gmail or Yahoo bug too of course.) On Fri, Jun 27, 2014 at 4:29 PM, harsh2005_7

Re: Spark RDD member of class loses it's value when the class being used as graph attribute

2014-06-30 Thread harsh2005_7
The code base is huge but sharing the snapshot of it which I think might give you some idea . Here is my class Player which is supposed to be my vertex attribute : *class Player(var RvalRdd: RDD[((Int, Int), Double)], Slope_m: Double) extends Serializable { //Some code here }* As you can see

Configuration properties for Spark

2014-06-30 Thread M Singh
Hi: Is there a comprehensive properties list (with permissible/default values) for spark ? Thanks Mans

Callbacks on freeing up of RDDs

2014-06-30 Thread Jaideep Dhok
Hi all, I am trying to create a custom RDD class for result set of queries supported in InMobi Grill (http://inmobi.github.io/grill/) Each result set has a schema (similar to Hive's TableSchema) and a path in HDFS containing the result set data. An easy way of doing this would be to create a

Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0

2014-06-30 Thread MEETHU MATHEW
Hi all, I reinstalled spark,reboot the system,but still I am not able to start the workers.Its throwing the following exception: Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/192.168.125.174:0 I doubt the problem is with 192.168.125.174:0.

Re: How to control a spark application(executor) using memory amount per node?

2014-06-30 Thread MEETHU MATHEW
Hi, Try setting driver-java-options with spark-submit or set spark.executor.extraJavaOptions in spark-default.conf   Thanks Regards, Meethu M On Monday, 30 June 2014 1:28 PM, hansen han...@neusoft.com wrote: Hi, When i send the following statements in spark-shell:     val file =

TaskNotSerializable when invoking KMeans.run

2014-06-30 Thread Daniel Micol
Hello, I’m trying to use KMeans with MLLib but am getting a TaskNotSerializable error. I’m using Spark 0.9.1 and invoking the KMeans.run method with k = 2 and numPartitions = 200. Has anyone seen this error before and know what could be the reason for this? Thanks, Daniel

Serializer or Out-of-Memory issues?

2014-06-30 Thread Sguj
I'm trying to perform operations on a large RDD, that ends up being about 1.3 GB in memory when loaded in. It's being cached in memory during the first operation, but when another task begins that uses the RDD, I'm getting this error that says the RDD was lost: 14/06/30 09:48:17 INFO

Spark 1.0 docs out of sync?

2014-06-30 Thread Diana Carroll
I'm hoping someone can clear up some confusion for me. When I view the Spark 1.0 docs online (http://spark.apache.org/docs/1.0.0/) they are different than the docs which are packaged with the Spark 1.0.0 download (spark-1.0.0.tgz). In particular, in the online docs, there's a single merged Spark

Re: TaskNotSerializable when invoking KMeans.run

2014-06-30 Thread Xiangrui Meng
Could you post the code snippet and the error stack trace? -Xiangrui On Mon, Jun 30, 2014 at 7:03 AM, Daniel Micol dmi...@gmail.com wrote: Hello, I’m trying to use KMeans with MLLib but am getting a TaskNotSerializable error. I’m using Spark 0.9.1 and invoking the KMeans.run method with k = 2

spark streaming counter metrics

2014-06-30 Thread Chen Song
I am new to spark streaming and wondering if spark streaming tracks counters (e.g., how many rows in each consumer, how many rows routed to an individual reduce task, etc.) in any form so I can get an idea of how data is skewed? I checked spark job page but don't seem to find any. -- Chen Song

Re: Could not compute split, block not found

2014-06-30 Thread Bill Jay
Tobias, Your suggestion is very helpful. I will definitely investigate it. Just curious. Suppose the batch size is t seconds. In practice, does Spark always require the program to finish processing the data of t seconds within t seconds' processing time? Can Spark begin to consume the new batch

Help understanding spark.task.maxFailures

2014-06-30 Thread Yana Kadiyska
Hi community, this one should be an easy one: I have left spark.task.maxFailures to it's default (which should be 4). I see a job that shows the following statistics for Tasks: Succeeded/Total 7109/819 (1 failed) So there were 819 tasks to start with. I have 2 executors in that cluster. From

Spark 1.0: Reading JSON LZH Compressed File

2014-06-30 Thread Uddin, Nasir M.
Hi, Spark 1.0 has been installed as Standalone - But it can't read any compressed (CMX/Snappy) and Sequence file residing on HDFS. The key notable message is: Unable to load native-hadoop library.. Other related messages are - Caused by: java.lang.IllegalStateException: Cannot load

RE: Serialization of objects

2014-06-30 Thread Sameer Tilak
Hi everyone,I was able to solve this issue. For now I changed the library code and added the following to the class com.wcohen.ss.BasicStringWrapper: public class BasicStringWrapper implements Serializable However, I am still curious to know ho to get around the issue when you don't have

Spark 1.0 and Logistic Regression Python Example

2014-06-30 Thread Sam Jacobs
Hi, I modified the example code for logistic regression to compute the error in classification. Please see below. However the code is failing when it makes a call to: labelsAndPreds.filter(lambda (v, p): v != p).count() with the error message (something related to numpy or dot product):

odd caching behavior or accounting

2014-06-30 Thread Brad Miller
Hi All, I am resending this message because I suspect the original may have been blocked from the mailing list due to attachments. Note that the mail does appear on the apache archives

Re: Could not compute split, block not found

2014-06-30 Thread Tobias Pfeiffer
Bill, let's say the processing time is t' and the window size t. Spark does not *require* t' t. In fact, for *temporary* peaks in your streaming data, I think the way Spark handles it is very nice, in particular since 1) it does not mix up the order in which items arrived in the stream, so items

Re: TaskNotSerializable when invoking KMeans.run

2014-06-30 Thread Jaideep Dhok
Hi Daniel, I also faced the same issue when using Naive Bayes classifier in MLLib. I was able to solve it by making all fields in the calling object either transient of serializable. Spark will print which class's object it was not able to serialize, in the error message. that can give you a

Re: History Server renered page not suitable for load balancing

2014-06-30 Thread elyast
Done :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/History-Server-renered-page-not-suitable-for-load-balancing-tp7447p8550.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-30 Thread elyast
Hi Andrew, I'm actually using spark-submit, and I tried using spark.executor.extraJavaOpts to configure tachyon client to connect to Tachyon HA master, however the configuration settings were not picked up. On the other hand when I set the same tachyon configuration parameters through

Re: spark job stuck when running on mesos fine grained mode

2014-06-30 Thread elyast
Hi Prabeesh, I've recently moved to mesos 0.18.2 and spark 1.0, so far no problems in fine grained mode, even for grapx or mllib workflows. If u have specific code snippets I can try it out. Best regards Lukasz -- View this message in context:

Re: Spark 1.0 and Logistic Regression Python Example

2014-06-30 Thread Xiangrui Meng
You were using an old version of numpy, 1.4? I think this is fixed in the latest master. Try to replace vec.dot(target) by numpy.dot(vec, target), or use the latest master. -Xiangrui On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs sam.jac...@us.abb.com wrote: Hi, I modified the example code for