Re: Kyro deserialisation error

2014-07-24 Thread Guillaume Pitel
Hi, We've got the same problem here (randomly happens) : Unable to find class: 6 4 ڗ4Ú» 8 44Úº*Q|T4⛇` j4 Ǥ4ꙴg8 4 ¾4Ú»  4 4Ú» pE4ʽ4ں*WsѴμˁ4ڻ4ʤ4ցbל4ڻ

spark streaming actor receiver doesn't play well with kryoserializer

2014-07-24 Thread Alan Ngai
it looks like when you configure sparkconfig to use the kryoserializer in combination of using an ActorReceiver, bad things happen. I modified the ActorWordCount example program from val sparkConf = new SparkConf().setAppName(ActorWordCount) to val sparkConf = new SparkConf()

Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-24 Thread Sean Owen
Yeah reduce() will leave you with one big collection of sets on the driver. Maybe the set of all identifiers isn't so big -- a hundred million Longs even isn't so much. I'm glad to hear cartesian works but can that scale? you're making an RDD of N^2 elements initially which is just vast. On Thu,

Spark Function setup and cleanup

2014-07-24 Thread Yosi Botzer
Hi, I am using the Java api of Spark. I wanted to know if there is a way to run some code in a manner that is like the setup() and cleanup() methods of Hadoop Map/Reduce The reason I need it is because I want to read something from the DB according to each record I scan in my Function, and I

Re: streaming sequence files?

2014-07-24 Thread Sean Owen
Can you just call fileStream or textFileStream in the second app, to consume files that appear in HDFS / Tachyon from the first job? On Thu, Jul 24, 2014 at 2:43 AM, Barnaby bfa...@outlook.com wrote: If I save an RDD as a sequence file such as: val wordCounts = words.map(x = (x,

Starting with spark

2014-07-24 Thread Sameer Sayyed
Hello All, I am new user of spark, I am using *cloudera-quickstart-vm-5.0.0-0-vmware* for execute sample examples of Spark. I am very sorry for silly and basic question. I am not able to deploy and execute sample examples of spark. please suggest me *how to start with spark*. Please help me

save to HDFS

2014-07-24 Thread lmk
Hi, I have a scala application which I have launched into a spark cluster. I have the following statement trying to save to a folder in the master: saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](hdfs://masteripaddress:9000/root/test-app/test1/) The application is executed successfully and

Re: save to HDFS

2014-07-24 Thread Akhil Das
Are you sure the RDD that you were saving isn't empty!? Are you seeing a _SUCCESS file in this location? hdfs:// masteripaddress:9000/root/test-app/test1/ (Do hadoop fs -ls hdfs://masteripaddress:9000/root/test-app/test1/) Thanks Best Regards On Thu, Jul 24, 2014 at 4:24 PM, lmk

Re: Starting with spark

2014-07-24 Thread Akhil Das
Here's the complete overview http://spark.apache.org/docs/latest/ And Here's the quick start guidelines http://spark.apache.org/docs/latest/quick-start.html I would suggest you downloading the Spark pre-compiled binaries

Re: save to HDFS

2014-07-24 Thread lmk
Hi Akhil, I am sure that the RDD that I saved is not empty. I have tested it using take. But is there no way that I can see this saved physically like we do in the normal context? Can't I view this folder as I am already logged into the cluster? And, should I run hadoop fs -ls

Re: save to HDFS

2014-07-24 Thread Akhil Das
This piece of code saveAsHadoopFile[TextOutputFormat[NullWritable,Text]](hdfs:// masteripaddress:9000/root/test-app/test1/) Saves the RDD into HDFS, and yes you can physically see the files using the hadoop command (hadoop fs -ls /root/test-app/test1 - yes you need to login to the cluster). In

Re: save to HDFS

2014-07-24 Thread lmk
Thanks Akhil. I was able to view the files. Actually I was trying to list the same using regular ls and since it did not show anything I was concerned. Thanks for showing me the right direction. Regards, lmk -- View this message in context:

Re: Spark Function setup and cleanup

2014-07-24 Thread Yanbo Liang
If you want to connect to DB in program, you can use JdbcRDD ( https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ) 2014-07-24 18:32 GMT+08:00 Yosi Botzer yosi.bot...@gmail.com: Hi, I am using the Java api of Spark. I wanted to know if there

Re: new error for me

2014-07-24 Thread phoenix bai
I am currently facing the same problem. error snapshot as below: 14-07-24 19:15:30 WARN [pool-3-thread-1] SendingConnection: Error finishing connection to r64b22034.tt.net/10.148.129.84:47525 java.net.ConnectException: Connection timed out at

Re: Configuring Spark Memory

2014-07-24 Thread Martin Goodson
Thank you Nishkam, I have read your code. So, for the sake of my understanding, it seems that for each spark context there is one executor per node? Can anyone confirm this? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Thu, Jul 24, 2014 at 6:12 AM, Nishkam

Re: Spark Function setup and cleanup

2014-07-24 Thread Yosi Botzer
In my case I want to reach HBase. For every record with userId I want to get some extra information about the user and add it to result record for further prcessing On Thu, Jul 24, 2014 at 9:11 AM, Yanbo Liang yanboha...@gmail.com wrote: If you want to connect to DB in program, you can use

GraphX for pyspark?

2014-07-24 Thread Eric Friedman
I understand that GraphX is not yet available for pyspark. I was wondering if the Spark team has set a target release and timeframe for doing that work? Thank you, Eric

Re: Starting with spark

2014-07-24 Thread Marco Shaw
First thing... Go into the Cloudera Manager and make sure that the Spark service (master?) is started. Marco On Thu, Jul 24, 2014 at 7:53 AM, Sameer Sayyed sam.sayyed...@gmail.com wrote: Hello All, I am new user of spark, I am using *cloudera-quickstart-vm-5.0.0-0-vmware* for execute

Spark got stuck with a loop

2014-07-24 Thread Denis RP
Hi, I ran spark standalone mode on a cluster and it went well for approximately one hour, then the driver's output stopped with the following: 14/07/24 08:07:36 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 36 to spark@worker5.local:47416 14/07/24 08:07:36 INFO

Re: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-07-24 Thread lihu
​Which code do you used, do you caused by your own code or something in spark itself? On Tue, Jul 22, 2014 at 8:50 AM, hsy...@gmail.com hsy...@gmail.com wrote: I have the same problem On Sat, Jul 19, 2014 at 12:31 AM, lihu lihu...@gmail.com wrote: Hi, Everyone. I have a piece of

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-24 Thread Tathagata Das
You will have to define your own stream-to-iterator function and use the socketStream. The function should return custom delimited object as bytes are continuously coming in. When data is insufficient, the function should block. TD On Jul 23, 2014 6:52 PM, kytay kaiyang@gmail.com wrote: Hi

Re: Configuring Spark Memory

2014-07-24 Thread Martin Goodson
Great - thanks for the clarification Aaron. The offer stands for me to write some documentation and an example that covers this without leaving *any* room for ambiguity. -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Thu, Jul 24, 2014 at 6:09 PM, Aaron

Re: Configuring Spark Memory

2014-07-24 Thread Aaron Davidson
Whoops, I was mistaken in my original post last year. By default, there is one executor per node per Spark Context, as you said. spark.executor.memory is the amount of memory that the application requests for each of its executors. SPARK_WORKER_MEMORY is the amount of memory a Spark Worker is

Re: Starting with spark

2014-07-24 Thread Jerry
Hi Sameer, I think it is much easier to start using Spark in standalone mode on a single machine. Last time I tried cloudera manager to deploy spark, it wasn't very straight forward and I hit couple of obstacles along the way. However, standalone mode is very easy to start exploring spark.

Re: akka 2.3.x?

2014-07-24 Thread yardena
We are also eagerly waiting for akka 2.3.4 support as we use Akka (and Spray) directly in addition to Spark. Yardena -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/akka-2-3-x-tp10513p10597.html Sent from the Apache Spark User List mailing list archive

Re: Starting with spark

2014-07-24 Thread Kostiantyn Kudriavtsev
Hi Sam, I tried Spark on Cloudera a couple month age, any there were a lot of issues… Fortunately, I was able to switch to Hortonworks and exerting works perfect. In general, you can try two mode: standalone and via YARN. Personally, I found using Spark via YARN more comfortable special for

rdd.saveAsTextFile blows up

2014-07-24 Thread Eric Friedman
I'm trying to run a simple pipeline using PySpark, version 1.0.1 I've created an RDD over a parquetFile and am mapping the contents with a transformer function and now wish to write the data out to HDFS. All of the executors fail with the same stack trace (below) I do get a directory on HDFS,

Re: akka 2.3.x?

2014-07-24 Thread Matei Zaharia
This is being tracked here: https://issues.apache.org/jira/browse/SPARK-1812, since it will also be needed for cross-building with Scala 2.11. Maybe we can do it before that. Probably too late for 1.1, but you should open an issue for 1.2. In that JIRA I linked, there's a pull request from a

GraphX canonical conflation issues

2014-07-24 Thread e5c
Hi there, This issue has been mentioned in: http://apache-spark-user-list.1001560.n3.nabble.com/Java-IO-Stream-Corrupted-Invalid-Type-AC-td6925.html However I'm starting a new thread since the issue is distinct from the above topic's designated subject. I'm test-running canonical conflation

continuing processing when errors occur

2014-07-24 Thread Art Peel
Our system works with RDDs generated from Hadoop files. It processes each record in a Hadoop file and for a subset of those records generates output that is written to an external system via RDD.foreach. There are no dependencies between the records that are processed. If writing to the external

Getting the number of slaves

2014-07-24 Thread Nicolas Mai
Hi, Is there a way to get the number of slaves/workers during runtime? I searched online but didn't find anything :/ The application I'm working will run on different clusters corresponding to different deployment stages (beta - prod). It would be great to get the number of slaves currently in

Re: Simple record matching using Spark SQL

2014-07-24 Thread Yin Huai
Hi Sarath, I will try to reproduce the problem. Thanks, Yin On Wed, Jul 23, 2014 at 11:32 PM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi Michael, Sorry for the delayed response. I'm using Spark 1.0.1 (pre-built version for hadoop 1). I'm running spark programs on

Spark Training at Scala By the Bay with Databricks, Fast Tracl to Scala

2014-07-24 Thread Alexy Khrabrov
Scala By the Bay (www.scalabythebay.org) is happy to confirm that our Spark training on August 11-12 will be run by Databricks and By the Bay together. It will be focused on Scala, and is the first Spark Training at a major Scala venue. Spark is written in Scala, with the unified data pipeline

Re: Getting the number of slaves

2014-07-24 Thread Evan R. Sparks
Try sc.getExecutorStorageStatus().length SparkContext's getExecutorMemoryStatus or getExecutorStorageStatus will give you back an object per executor - the StorageStatus objects are what drives a lot of the Spark Web UI.

Emacs Setup Anyone?

2014-07-24 Thread Steve Nunez
Anyone out there have a good configuration for emacs? Scala-mode sort of works, but I¹d love to see a fully-supported spark-mode with an inferior shell. Searching didn¹t turn up much of anything. Any emacs users out there? What setup are you using? Cheers, - SteveN -- CONFIDENTIALITY

Kmeans: set initial centers explicitly

2014-07-24 Thread SK
Hi, The mllib.clustering.kmeans implementation supports a random or parallel initialization mode to pick the initial centers. is there a way to specify the initial centers explictly? It would be useful to have a setCenters() method where we can explicitly specify the initial centers. (For e.g. R

Re: continuing processing when errors occur

2014-07-24 Thread Imran Rashid
Hi Art, I have some advice that isn't spark-specific at all, so it doesn't *exactly* address your questions, but you might still find helpful. I think using an implicit to add your retyring behavior might be useful. I can think of two options: 1. enriching RDD itself, eg. to add a

Re: Configuring Spark Memory

2014-07-24 Thread John Omernik
SO this is good information for standalone, but how is memory distributed within Mesos? There's coarse grain mode where the execute stays active, or theres fine grained mode where it appears each task is it's only process in mesos, how to memory allocations work in these cases? Thanks! On Thu,

Re: continuing processing when errors occur

2014-07-24 Thread Imran Rashid
whoops! just realized I was retyring the function even on success. didn't pay enough attention to the output from my calls. Slightly updated definitions: class RetryFunction[-A](nTries: Int,f: A = Unit) extends Function[A,Unit] { def apply(a: A): Unit = { var tries = 0 var success =

KMeans: expensiveness of large vectors

2014-07-24 Thread durin
As a source, I have a textfile with n rows that each contain m comma-separated integers. Each row is then converted into a feature vector with m features each. I've noticed, that given the same total filesize and number of features, a larger number of columns is much more expensive for training

cache changes precision

2014-07-24 Thread Ron Gonzalez
Hi,   I'm doing the following:   def main(args: Array[String]) = {     val sparkConf = new SparkConf().setAppName(AvroTest).setMaster(local[2])     val sc = new SparkContext(sparkConf)     val conf = new Configuration()     val job = new Job(conf)     val path = new Path(/tmp/a.avro);     val

mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-24 Thread abhiguruvayya
Can any one help me understand the key difference between mapToPair vs flatMapToPair vs flatMap functions and also when to apply these functions in particular. -- View this message in context:

Re: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-07-24 Thread Tathagata Das
You can set the Java option -Dsun.io.serialization.extendedDebugInfo=true to have more information about the object be printed. It will help you trace down the how the SparkContext is getting included in some kind of closure. TD On Thu, Jul 24, 2014 at 9:48 AM, lihu lihu...@gmail.com wrote:

Re: Getting the number of slaves

2014-07-24 Thread Nicolas Mai
Thanks, this is what I needed :) I should have searched more... Something I noticed though: after the SparkContext is initialized, I had to wait for a few seconds until sc.getExecutorStorageStatus.length returns the correct number of workers in my cluster (otherwise it returns 1, for the

Re: streaming sequence files?

2014-07-24 Thread Barnaby
I have the streaming program writing sequence files. I can find one of the files and load it in the shell using: scala val rdd = sc.sequenceFile[String, Int](tachyon://localhost:19998/files/WordCounts/20140724-213930) 14/07/24 21:47:50 INFO storage.MemoryStore: ensureFreeSpace(32856) called

Re: NotSerializableException in Spark Streaming

2014-07-24 Thread Nicholas Chammas
Yep, here goes! Here are my environment vitals: - Spark 1.0.0 - EC2 cluster with 1 slave spun up using spark-ec2 - twitter4j 3.0.3 - spark-shell called with --jars argument to load spark-streaming-twitter_2.10-1.0.0.jar as well as all the twitter4j jars. Now, while I’m in the

Re: Simple record matching using Spark SQL

2014-07-24 Thread Yin Huai
Hi Sarath, Have you tried the current branch 1.0? If not, can you give it a try and see if the problem can be resolved? Thanks, Yin On Thu, Jul 24, 2014 at 11:17 AM, Yin Huai yh...@databricks.com wrote: Hi Sarath, I will try to reproduce the problem. Thanks, Yin On Wed, Jul 23,

Re: mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-24 Thread Matei Zaharia
The Pair ones return a JavaPairRDD, which has additional operations on key-value pairs. Take a look at http://spark.apache.org/docs/latest/programming-guide.html#working-with-key-value-pairs for details. Matei On Jul 24, 2014, at 3:41 PM, abhiguruvayya sharath.abhis...@gmail.com wrote: Can

Re: Spark Function setup and cleanup

2014-07-24 Thread Yanbo Liang
You can refer this topic http://www.mapr.com/developercentral/code/loading-hbase-tables-spark 2014-07-24 22:32 GMT+08:00 Yosi Botzer yosi.bot...@gmail.com: In my case I want to reach HBase. For every record with userId I want to get some extra information about the user and add it to result

Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-24 Thread Jianshi Huang
I can successfully run my code in local mode using spark-submit (--master local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode. Any hints what is the problem? Is it a closure serialization problem? How can I debug it? Your answers would be very helpful. 14/07/25 11:48:14

Re: GraphX Pragel implementation

2014-07-24 Thread Ankur Dave
On Thu, Jul 24, 2014 at 9:52 AM, Arun Kumar toga...@gmail.com wrote: While using pregel API for Iterations how to figure out which super step the iteration currently in. The Pregel API doesn't currently expose this, but it's very straightforward to modify Pregel.scala