Re: Could not compute split, block not found

2014-07-01 Thread Tathagata Das
Are you by any change using only memory in the storage level of the input streams? TD On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, let's say the processing time is t' and the window size t. Spark does not *require* t' t. In fact, for *temporary* peaks in

Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0

2014-07-01 Thread MEETHU MATHEW
Hi, I did netstat -na | grep 192.168.125.174 and its showing 192.168.125.174:7077 LISTEN(after starting master) I tried to execute the following script from the slaves manually but it ends up with the same exception and log.This script is internally executing the java command.  

Re: Serialization of objects

2014-07-01 Thread Aaron Davidson
If you want to stick with Java serialization and need to serialize a non-Serializable object, your best choices are probably to either subclass it with a Serializable one or wrap it in a class of your own which implements its own writeObject/readObject methods (see here:

build spark assign version number myself?

2014-07-01 Thread majian
Hi,all: I'm working to compile spark by executing './make-distribution.sh --hadoop 0.20.205.0 --tgz ', after the completion of the compilation I found that the default version number is 1.1.0-SNAPSHOT i.e. spark-1.1.0-SNAPSHOT-bin-0.20.205.tgz, who know how to assign version number myself

issue with running example code

2014-07-01 Thread Gurvinder Singh
Hi, I am having issue in running scala example code. I have tested and able to run successfully python example code, but when I run the scala code I get this error java.lang.ClassCastException: cannot assign instance of org.apache.spark.examples.SparkPi$$anonfun$1 to field

Questions about disk IOs

2014-07-01 Thread Charles Li
Hi Spark, I am running LBFGS on our user data. The data size with Kryo serialisation is about 210G. The weight size is around 1,300,000. I am quite confused that the performance is very close whether the data is cached or not. The program is simple: points = sc.hadoopFIle(int,

Re: build spark assign version number myself?

2014-07-01 Thread Guillaume Ballet
You can specify a custom name with the --name option. It will still contain 1.1.0-SNAPSHOT, but at least you can specify your company name. If you want to replace SNAPSHOT with your company name, you will have to edit make-distribution.sh and replace the following line: VERSION=$(mvn

Re: build spark assign version number myself?

2014-07-01 Thread Guillaume Ballet
Sorry, there's a typo in my previous post, the line should read: VERSION=$(mvn help:evaluate -Dexpression=project.version 2/dev/null | grep -v INFO | tail -n 1 | sed -e 's/SNAPSHOT/$COMPANYNAME/g') On Tue, Jul 1, 2014 at 10:35 AM, Guillaume Ballet gbal...@gmail.com wrote: You can specify a

RSpark installation on Windows

2014-07-01 Thread Stuti Awasthi
Hi All Can we install RSpark on windows setup of R and use it to access the remote Spark cluster ? Thanks Stuti Awasthi ::DISCLAIMER:: The

Spark Streaming question batch size

2014-07-01 Thread Laeeq Ahmed
Hi, The window size in a spark streaming is time based which means we have different number of elements in each window. For example if you have two streams (might be more) which are related to each other and you want to compare them in a specific time interval. I am not clear how it will work.

Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition

2014-07-01 Thread Honey Joshi
Hi, I am trying to run a project which takes data as a DStream and dumps the data in the Shark table after various operations. I am getting the following error : Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed 1 times (most recent failure: Exception

Window Size

2014-07-01 Thread Laeeq Ahmed
Hi, The window size in a spark streaming is time based which means we have different number of elements in each window. For example if you have two streams (might be more) which are related to each other and you want to compare them in a specific time interval. I am not clear how it will

java.io.FileNotFoundException: http://IP/broadcast_1

2014-07-01 Thread Honey Joshi
Hi All, We are using shark table to dump the data, we are getting the following error : Exception in thread main org.apache.spark.SparkException: Job aborted: Task 1.0:0 failed 1 times (most recent failure: Exception failure: java.io.FileNotFoundException: http://IP/broadcast_1) We dont know

Failed to launch Worker

2014-07-01 Thread MEETHU MATHEW
 Hi , I am using Spark Standalone mode with one master and 2 slaves.I am not  able to start the workers and connect it to the master using  ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077 The log says Exception in thread main

Re: Failed to launch Worker

2014-07-01 Thread Akhil Das
Is this command working?? java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/ assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077 Thanks

Re: Failed to launch Worker

2014-07-01 Thread MEETHU MATHEW
Yes.   Thanks Regards, Meethu M On Tuesday, 1 July 2014 6:14 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Is this command working?? java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m

RE: Spark 1.0 and Logistic Regression Python Example

2014-07-01 Thread Sam Jacobs
Thanks Xiangrui, your suggestion fixed the problem. I will see if I can upgrade the numpy/python for a permanent fix. My current versions of python and numpy are 2.6 and 4.1.9 respectively. Thanks, Sam -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Tuesday,

Re: Changing log level of spark

2014-07-01 Thread Philip Limbeck
We changed the loglevel to DEBUG by replacing every INFO with DEBUG in /root/ephemeral-hdfs/conf/log4j.properties and propagating it to the cluster. There is some DEBUG output visible in both master and worker but nothing really interesting regarding stages or scheduling. Since we expected a

difference between worker and slave nodes

2014-07-01 Thread aminn_524
Can anyone explain to me what is difference between worker and slave? I hav e one master and two slaves which are connected to each other, by using jps command I can see master in master node and worker in slave nodes but I dont have any worker in my master by using this command

Re: Changing log level of spark

2014-07-01 Thread Surendranauth Hiraman
One thing we ran into was that there was another log4j.properties earlier in the classpath. For us, it was in our MapR/Hadoop conf. If that is the case, something like the following could help you track it down. The only thing to watch out for is that you might have to walk up the classloader

Re: Changing log level of spark

2014-07-01 Thread Yana Kadiyska
Are you looking at the driver log? (e.g. Shark?). I see a ton of information in the INFO category on what query is being started, what stage is starting and which executor stuff is sent to. So I'm not sure if you're saying you see all that and you need more, or that you're not seeing this type of

Spark 1.0: Unable to Read LZO Compressed File

2014-07-01 Thread Uddin, Nasir M.
Dear Spark Users: Spark 1.0 has been installed as Standalone - But it can't read any compressed (CMX/Snappy) and Sequence file residing on HDFS (it can read uncompressed files from HDFS). The key notable message is: Unable to load native-hadoop library.. Other related messages are -

Re: Spark Streaming question batch size

2014-07-01 Thread Yana Kadiyska
Are you saying that both streams come in at the same rate and you have the same batch interval but the batch size ends up different? i.e. two datapoints both arriving at X seconds after streaming starts end up in two different batches? How do you define real time values for both streams? I am

Re: Question about VD and ED

2014-07-01 Thread Baoxu Shi(Dash)
Hi Bin, VD and ED are ClassTags, you could treat them as placeholder, or template T in C (not 100% clear). You do not need convert graph[String, Double] to Graph[VD,ED]. Check ClassTag’s definition in Scala could help. Best, On Jul 1, 2014, at 4:49 AM, Bin WU bw...@connect.ust.hk wrote: Hi

Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-01 Thread Xiangrui Meng
You can use either bin/run-example or bin/spark-summit to run example code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark classpath. There are examples in the official doc: http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here -Xiangrui On Tue, Jul 1, 2014 at

Re: Questions about disk IOs

2014-07-01 Thread Xiangrui Meng
Try to reduce number of partitions to match the number of cores. We will add treeAggregate to reduce the communication cost. PR: https://github.com/apache/spark/pull/1110 -Xiangrui On Tue, Jul 1, 2014 at 12:55 AM, Charles Li littlee1...@gmail.com wrote: Hi Spark, I am running LBFGS on our

Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Aditya Varun Chadha
I attended yesterday on ustream.tv, but can't find the links to today's streams anywhere. help! -- Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)

Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Alexis Roos
*General Session / Keynotes : http://www.ustream.tv/channel/spark-summit-2014 http://www.ustream.tv/channel/spark-summit-2014Track A : http://www.ustream.tv/channel/track-a1 http://www.ustream.tv/channel/track-a1Track B: http://www.ustream.tv/channel/track-b1

Re: Could not compute split, block not found

2014-07-01 Thread Bill Jay
Hi Tobias, Your explanation makes a lot of sense. Actually, I tried to use partial data on the same program yesterday. It has been up for around 24 hours and is still running correctly. Thanks! Bill On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, let's say

Re: Could not compute split, block not found

2014-07-01 Thread Bill Jay
Hi Tathagata, Yes. The input stream is from Kafka and my program reads the data, keeps all the data in memory, process the data, and generate the output. Bill On Mon, Jun 30, 2014 at 11:45 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Are you by any change using only memory in the

spark streaming rate limiting from kafka

2014-07-01 Thread Chen Song
In my use case, if I need to stop spark streaming for a while, data would accumulate a lot on kafka topic-partitions. After I restart spark streaming job, the worker's heap will go out of memory on the fetch of the 1st batch. I am wondering if * Is there a way to throttle reading from kafka in

Re: Improving Spark multithreaded performance?

2014-07-01 Thread Kyle Ellrott
This all seems pretty hackish and a lot of trouble to get around limitations in mllib. The big limitation is that right now, the optimization algorithms work on one large dataset at a time. We need a second of set of methods to work on a large number of medium sized datasets. I've started to code

Re: spark streaming rate limiting from kafka

2014-07-01 Thread Luis Ángel Vicente Sánchez
Maybe reducing the batch duration would help :\ 2014-07-01 17:57 GMT+01:00 Chen Song chen.song...@gmail.com: In my use case, if I need to stop spark streaming for a while, data would accumulate a lot on kafka topic-partitions. After I restart spark streaming job, the worker's heap will go

Re: Re: spark table to hive table

2014-07-01 Thread John Omernik
Michael - Does Spark SQL support rlike and like yet? I am running into that same error with a basic select * from table where field like '%foo%' using the hql() funciton. Thanks On Wed, May 28, 2014 at 2:22 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, May 27, 2014 at 6:08 PM,

Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Soumya Simanta
Are these sessions recorded ? On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote: *General Session / Keynotes : http://www.ustream.tv/channel/spark-summit-2014 http://www.ustream.tv/channel/spark-summit-2014 Track A : http://www.ustream.tv/channel/track-a1

Re: Failed to launch Worker

2014-07-01 Thread Aaron Davidson
Where are you running the spark-class version? Hopefully also on the workers. If you're trying to centrally start/stop all workers, you can add a slaves file to the spark conf/ directory which is just a list of your hosts, one per line. Then you can just use ./sbin/start-slaves.sh to start the

Re: Spark Streaming question batch size

2014-07-01 Thread Laeeq Ahmed
Hi Yana, Yes, that is what I am saying. I need both streams to be at same pace. I do have timestamps for each datapoint. There is a way suggested by Tathagata das in an earlier post where you have have a bigger window than required and you fetch your required data from that window based on

why is toBreeze private everywhere in mllib?

2014-07-01 Thread Koert Kuipers
its kind of handy to be able to convert stuff to breeze... is there some other way i am supposed to access that functionality?

[ANNOUNCE] Flambo - A Clojure DSL for Apache Spark

2014-07-01 Thread Soren Macbeth
Yieldbot is pleased to announce the release of Flambo, our Clojure DSL for Apache Spark. Flambo allows one to write spark applications in pure Clojure as an alternative to Scala, Java and Python currently available in Spark. We have already written a substantial amount of internal code in

Re: why is toBreeze private everywhere in mllib?

2014-07-01 Thread Xiangrui Meng
We were not ready to expose it as a public API in v1.0. Both breeze and MLlib are in rapid development. It would be possible to expose it as a developer API in v1.1. For now, it should be easy to define a toBreeze method in your own project. -Xiangrui On Tue, Jul 1, 2014 at 12:17 PM, Koert

Re: Spark SQL : Join throws exception

2014-07-01 Thread Yin Huai
Seems it is a bug. I have opened https://issues.apache.org/jira/browse/SPARK-2339 to track it. Thank you for reporting it. Yin On Tue, Jul 1, 2014 at 12:06 PM, Subacini B subac...@gmail.com wrote: Hi All, Running this join query sql(SELECT * FROM A_TABLE A JOIN B_TABLE B WHERE

spark-submit script and spark.files.userClassPathFirst

2014-07-01 Thread _soumya_
Hi, I'm trying to get rid of an error (NoSuchMethodError) while using Amazon's s3 client on Spark. I'm using the Spark Submit script to run my code. Reading about my options and other threads, it seemed the most logical way would be make sure my jar is loaded first. Spark submit on debug shows

slf4j multiple bindings

2014-07-01 Thread Bill Jay
Hi all, I have an issue with multiple slf4j bindings. My program was running correctly. I just added the new dependency kryo. And when I submitted a job, the job was killed because of the following error messages: *SLF4J: Class path contains multiple SLF4J bindings.* The log said there were

Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Mohammed Guller
I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3 worker). Our app is fetching data from Cassandra and doing a basic filter, map, and countByKey on that data. I have run into a strange problem. Even if the number of rows in Cassandra is just 1M, the Spark job goes seems

Re: multiple passes in mapPartitions

2014-07-01 Thread Chris Fregly
also, multiple calls to mapPartitions() will be pipelined by the spark execution engine into a single stage, so the overhead is minimal. On Fri, Jun 13, 2014 at 9:28 PM, zhen z...@latrobe.edu.au wrote: Thank you for your suggestion. We will try it out and see how it performs. We think the

Re: Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Yana Kadiyska
A lot of things can get funny when you run distributed as opposed to local -- e.g. some jar not making it over. Do you see anything of interest in the log on the executor machines -- I'm guessing 192.168.222.152/192.168.222.164. From here

Re: Fw: How Spark Choose Worker Nodes for respective HDFS block

2014-07-01 Thread Chris Fregly
yes, spark attempts to achieve data locality (PROCESS_LOCAL or NODE_LOCAL) where possible just like MapReduce. it's a best practice to co-locate your Spark Workers on the same nodes as your HDFS Name Nodes for just this reason. this is achieved through the RDD.preferredLocations() interface

Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Marco Shaw
They are recorded... For example, 2013: http://spark-summit.org/2013 I'm assuming the 2014 videos will be up in 1-2 weeks. Marco On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Are these sessions recorded ? On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos

Re: spark streaming rate limiting from kafka

2014-07-01 Thread Tobias Pfeiffer
Hi, On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote: * Is there a way to control how far Kafka Dstream can read on topic-partition (via offset for example). By setting this to a small number, it will force DStream to read less data initially. Please see the post at

Re: Spark 1.0: Unable to Read LZO Compressed File

2014-07-01 Thread Matei Zaharia
I’d suggest asking the IBM Hadoop folks, but my guess is that the library cannot be found in /opt/IHC/lib/native/Linux-amd64-64/. Or maybe if this exception is happening in your driver program, the driver program’s java.library.path doesn’t include this. (SPARK_LIBRARY_PATH from spark-env.sh

Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Soumya Simanta
Awesome. Just want to catch up on some sessions from other tracks. Learned a ton over the last two days. Thanks Soumya On Jul 1, 2014, at 8:50 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yup, we’re going to try to get the videos up as soon as possible. Matei On Jul 1,

Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0

2014-07-01 Thread Aaron Davidson
In your spark-env.sh, do you happen to set SPARK_PUBLIC_DNS or something of that kin? This error suggests the worker is trying to bind a server on the master's IP, which clearly doesn't make sense On Mon, Jun 30, 2014 at 11:59 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I did

Re: multiple passes in mapPartitions

2014-07-01 Thread Frank Austin Nothaft
Hi Zhen, The Scala iterator trait supports cloning via the duplicate method (http://www.scala-lang.org/api/current/index.html#scala.collection.Iterator@duplicate:(Iterator[A],Iterator[A])). Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Jun 13,

Re: Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Mayur Rustagi
It could be cause you are out of memory on the worker nodes blocks are not getting registered.. A older issue with 0.6.0 was with dead nodes causing loss of task then resubmission of data in an infinite loop... It was fixed in 0.7.0 though. Are you seeing a crash log in this log.. or in the