Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0

2014-07-01 Thread MEETHU MATHEW
Hi,

I did netstat -na | grep 192.168.125.174 and its showing 192.168.125.174:7077 
LISTEN(after starting master)

I tried to execute the following script from the slaves manually but it ends up 
with the same exception and log.This script is internally executing the java 
command.
 /usr/local/spark-1.0.0/sbin/start-slave.sh 1 spark://192.168.125.174:7077
In this case netstat is showing any connection established to master:7077.

When we manually execute the java command,the connection is getting established 
to master.

Thanks & Regards, 
Meethu M


On Monday, 30 June 2014 6:38 PM, Akhil Das  wrote:
 


Are you sure you have this ip 192.168.125.174 bind for that machine? (netstat 
-na | grep 192.168.125.174)


Thanks
Best Regards


On Mon, Jun 30, 2014 at 5:34 PM, MEETHU MATHEW  wrote:

Hi all,
>
>
>I reinstalled spark,reboot the system,but still I am not able to start the 
>workers.Its throwing the following exception:
>
>
>Exception in thread "main" org.jboss.netty.channel.ChannelException: Failed to 
>bind to: master/192.168.125.174:0
>
>
>I doubt the problem is with 192.168.125.174:0. Eventhough the command contains 
>master:7077,why its showing 0 in the log.
>
>
>java -cp 
>::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
>org.apache.spark.deploy.worker.Worker spark://master:7077
>
>
>Can somebody tell me  a solution.
> 
>Thanks & Regards, 
>Meethu M
>
>
>
>On Friday, 27 June 2014 4:28 PM, MEETHU MATHEW  wrote:
> 
>
>
>Hi,
>ya I tried setting another PORT also,but the same problem..
>master is set in etc/hosts
> 
>Thanks & Regards, 
>Meethu M
>
>
>
>On Friday, 27 June 2014 3:23 PM, Akhil Das  wrote:
> 
>
>
>tha's strange, did you try setting the master port to something else (use 
>SPARK_MASTER_PORT).
>
>
>Also you said you are able to start it from the java commandline
>
>
>java -cp 
>::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
>org.apache.spark.deploy.worker.Worker spark://:master:7077
>
>
>
>What is the master ip specified here? is it like you have entry for master in 
>the /etc/hosts? 
>
>
>Thanks
>Best Regards
>
>
>On Fri, Jun 27, 2014 at 3:09 PM, MEETHU MATHEW  wrote:
>
>Hi Akhil,
>>
>>
>>I am running it in a LAN itself..The IP of the master is given correctly.
>> 
>>Thanks & Regards, 
>>Meethu M
>>
>>
>>
>>On Friday, 27 June 2014 2:51 PM, Akhil Das  wrote:
>> 
>>
>>
>>why is it binding to port 0? 192.168.125.174:0 :/
>>
>>
>>Check the ip address of that master machine (ifconfig) looks like the ip 
>>address has been changed (hoping you are running this machines on a LAN)
>>
>>
>>Thanks
>>Best Regards
>>
>>
>>On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW  
>>wrote:
>>
>>Hi all,
>>>
>>>
>>>My Spark(Standalone mode) was running fine till yesterday.But now I am 
>>>getting  the following exeception when I am running start-slaves.sh or 
>>>start-all.sh
>>>
>>>
>>>slave3: failed to launch org.apache.spark.deploy.worker.Worker:
>>>slave3:   at 
>>>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>>>slave3:   at java.lang.Thread.run(Thread.java:662)
>>>
>>>
>>>The log files has the following lines.
>>>
>>>
>>>14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j profile: 
>>>org/apache/spark/log4j-defaults.properties
>>>14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser
>>>14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication 
>>>disabled; ui acls disabled; users with view permissions: Set(hduser)
>>>14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started
>>>14/06/27 11:06:30 INFO Remoting: Starting remoting
>>>Exception in thread "main" org.jboss.netty.channel.ChannelException: Failed 
>>>to bind to: master/192.168.125.174:0
>>>at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
>>>...
>>>Caused by: java.net.BindException: Cannot assign requested address
>>>...
>>>I saw the same error reported before and have tried the following solutions.
>>>
>>>
>>>Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a 
>>>different number..But nothing is working.
>>>
>>>
>>>When I try to start the worker from the respective machines using the 
>>>following java command,its running without any exception
>>>
>>>
>>>java -cp 
>>>::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
>>> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
>>>org.apache.spark.deploy.worker.Worker spark://:master:7077
>>>
>>>
>>>
>>>Somebody please give a solution
>>> 
>>>Thanks & Regards, 
>>>Meethu M
>>
>>
>>
>
>
>
>
>

Re: Serialization of objects

2014-07-01 Thread Aaron Davidson
If you want to stick with Java serialization and need to serialize a
non-Serializable object, your best choices are probably to either subclass
it with a Serializable one or wrap it in a class of your own which
implements its own writeObject/readObject methods (see here:
http://stackoverflow.com/questions/6163872/how-to-serialize-a-non-serializable-in-java
)

Otherwise you can use Kryo to register custom serializers for other
people's objects.


On Mon, Jun 30, 2014 at 1:52 PM, Sameer Tilak  wrote:

> Hi everyone,
> I was able to solve this issue. For now I changed the library code and
> added the following to the class com.wcohen.ss.BasicStringWrapper:
>
> public class BasicStringWrapper implements  Serializable
>
> However, I am still curious to know ho to get around the issue when you
> don't have access to the code and you are using a 3rd party jar.
>
>
> --
> From: ssti...@live.com
> To: u...@spark.incubator.apache.org
> Subject: Serialization of objects
> Date: Thu, 26 Jun 2014 09:30:31 -0700
>
>
> Hi everyone,
>
> Aaron, thanks for your help so far. I am trying to serialize objects that
> I instantiate from a 3rd party library namely instances of 
> com.wcohen.ss.Jaccard,
> and com.wcohen.ss.BasicStringWrapper. However, I am having problems with
> serialization. I am (at least trying to) using Kryo for serialization. I
>  am still facing the serialization issue. I get 
> "org.apache.spark.SparkException:
> Job aborted due to stage failure: Task not serializable:
> java.io.NotSerializableException: com.wcohen.ss.BasicStringWrapper" Any
> help with this will be great.
> Scala code:
>
> package approxstrmatch
>
> import com.wcohen.ss.BasicStringWrapper;
> import com.wcohen.ss.Jaccard;
>
> import java.util.Iterator;
>
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkConf
>
> import org.apache.spark.rdd;
> import org.apache.spark.rdd.RDD;
>
> import com.esotericsoftware.kryo.Kryo
> import org.apache.spark.serializer.KryoRegistrator
>
> class MyRegistrator extends KryoRegistrator {
>   override def registerClasses(kryo: Kryo) {
> kryo.register(classOf[approxstrmatch.JaccardScore])
> kryo.register(classOf[com.wcohen.ss.BasicStringWrapper])
> kryo.register(classOf[com.wcohen.ss.Jaccard])
>
>   }
> }
>
>  class JaccardScore  {
>
>   val mjc = new Jaccard()  with Serializable
>   val conf = new
> SparkConf().setMaster("spark://pzxnvm2018:7077").setAppName("ApproxStrMatch")
>   conf.set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")
>   conf.set("spark.kryo.registrator", "approxstrmatch.MyRegistrator")
>
>   val sc = new SparkContext(conf)
>
>   def calculateScoreSecond (sourcerdd: RDD[String], destrdd: RDD[String])
>  {
>   val jc_ = this.mjc
>
>   var i: Int = 0
>   for (sentence <- sourcerdd.toLocalIterator)
>{val str1 = new BasicStringWrapper (sentence)
> var scorevector = destrdd.map(x => jc_.score(str1, new
> BasicStringWrapper(x)))
> val fileName = new
> String("/apps/software/scala-approsstrmatch-sentence" + i)
> scorevector.saveAsTextFile(fileName)
> i += 1
>}
>
>   }
>
> Here is the script:
>  val distFile = sc.textFile("hdfs://serverip:54310/data/dummy/sample.txt");
>  val srcFile = sc.textFile("hdfs://serverip:54310/data/dummy/test.txt");
>  val score = new approxstrmatch.JaccardScore()
>  score.calculateScoreSecond(srcFile, distFile)
>
> O/P:
>
> 14/06/25 12:32:05 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[3] at
> textFile at :12), which has no missing parents
> 14/06/25 12:32:05 INFO DAGScheduler: Submitting 1 missing tasks from Stage
> 0 (MappedRDD[3] at textFile at :12)
> 14/06/25 12:32:05 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
> 14/06/25 12:32:05 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on
> executor localhost: localhost (PROCESS_LOCAL)
> 14/06/25 12:32:05 INFO TaskSetManager: Serialized task 0.0:0 as 1879 bytes
> in 4 ms
> 14/06/25 12:32:05 INFO Executor: Running task ID 0
> 14/06/25 12:32:05 INFO Executor: Fetching
> http://serverip:47417/jars/approxstrmatch.jar with timestamp 1403724701564
> 14/06/25 12:32:05 INFO Utils: Fetching
> http://serverip:47417/jars/approxstrmatch.jar to
> /tmp/fetchFileTemp8194323811657370518.tmp
> 14/06/25 12:32:05 INFO Executor: Adding
> file:/tmp/spark-397828b5-3e0e-4bb4-b56b-58895eb4d6df/approxstrmatch.jar to
> class loader
> 14/06/25 12:32:05 INFO Executor: Fetching
> http://serverip:47417/jars/secondstring-20140618.jar with timestamp
> 1403724701562
> 14/06/25 12:32:05 INFO Utils: Fetching
> http://serverip:47417/jars/secondstring-20140618.jar to
> /tmp/fetchFileTemp8711755318201511766.tmp
> 14/06/25 12:32:06 INFO Executor: Adding
> file:/tmp/spark-397828b5-3e0e-4bb4-b56b-58895eb4d6df/secondstring-20140618.jar
> to class loader
> 14/06/25 12:32:06 INFO BlockManager: Found block broadcast_1 locally
> 14/06/25 12:32:06 INFO HadoopRDD: Input split:
> hdfs://serveri

build spark assign version number myself?

2014-07-01 Thread majian
Hi,all:
 I'm working to compile spark by executing './make-distribution.sh --hadoop 
0.20.205.0 --tgz  ',
 after the completion of the compilation I found  that the default version 
number is "1.1.0-SNAPSHOT"  i.e. spark-1.1.0-SNAPSHOT-bin-0.20.205.tgz,
 who know how to assign version number myself , for example 
spark-1.1.0-company-bin-0.20.205.tgz  .


Thanks,
majian


issue with running example code

2014-07-01 Thread Gurvinder Singh
Hi,

I am having issue in running scala example code. I have tested and able
to run successfully python example code, but when I run the scala code I
get this error

java.lang.ClassCastException: cannot assign instance of
org.apache.spark.examples.SparkPi$$anonfun$1 to field
org.apache.spark.rdd.MappedRDD.f of type scala.Function1 in instance of
org.apache.spark.rdd.MappedRDD

I have compiled spark from the github directly and running with the
command as

spark-submit /usr/share/spark/lib/spark-examples_2.10-1.1.0-SNAPSHOT.jar
--class org.apache.spark.examples.SparkPi 5 --jars
/usr/share/spark/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.h5.0.1.jar

Any suggestions will be helpful.

Thanks,
Gurvinder


Questions about disk IOs

2014-07-01 Thread Charles Li
Hi Spark,

I am running LBFGS on our user data. The data size with Kryo serialisation is 
about 210G. The weight size is around 1,300,000. I am quite confused that the 
performance is very close whether the data is cached or not.

The program is simple:
points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..)
points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not cached
gradient = new LogisticGrandient();
updater = new SquaredL2Updater();
initWeight = Vectors.sparse(size, new int[]{}, new double[]{})
result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, 
convergeTol, maxIter, regParam, initWeight);

I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its cluster 
mode. Below are some arguments I am using:
—executor-memory 10G
—num-executors 50
—executor-cores 2

Storage Using:
When caching:
Cached Partitions 951
Fraction Cached 100%
Size in Memory 215.7GB
Size in Tachyon 0.0B
Size on Disk 1029.7MB

The time cost by every aggregate is around 5 minutes with cache enabled. Lots 
of disk IOs can be seen on the hadoop node. I have the same result with cache 
disabled.

Should data points caching improve the performance? Should caching decrease the 
disk IO?

Thanks in advance.

Re: build spark assign version number myself?

2014-07-01 Thread Guillaume Ballet
You can specify a custom name with the --name option. It will still contain
1.1.0-SNAPSHOT, but at least you can specify your company name.

If you want to replace SNAPSHOT with your company name, you will have to
edit make-distribution.sh and replace the following line:

VERSION=$(mvn help:evaluate -Dexpression=project.version 2>/dev/null | grep
-v "INFO" | tail -n 1)

with something like

COMPANYNAME="SoullessMegaCorp"
VERSION=$(mvn help:evaluate -Dexpression=project.version 2>/dev/null | grep
-v "INFO" | tail -n 1 | sed -e 's/SNAPSHOT/$COMPANYNAME')

and so on for other packages that use their own version scheme.




On Tue, Jul 1, 2014 at 9:21 AM, majian  wrote:

>   Hi,all:
>
>  I'm working to compile spark by executing './make-distribution.sh --hadoop
> 0.20.205.0 --tgz  ',
>  after the completion of the compilation I found  that the default version 
> number is "1.1.0-SNAPSHOT"
>  i.e. spark-1.1.0-SNAPSHOT-bin-0.20.205.tgz,
>  who know how to assign version number myself ,
> for example spark-1.1.0-company-bin-0.20.205.tgz  .
>
>
> Thanks,
> majian
>
>


Re: build spark assign version number myself?

2014-07-01 Thread Guillaume Ballet
Sorry, there's a typo in my previous post, the line should read:
VERSION=$(mvn help:evaluate -Dexpression=project.version 2>/dev/null | grep
-v "INFO" | tail -n 1 | sed -e 's/SNAPSHOT/$COMPANYNAME/g')


On Tue, Jul 1, 2014 at 10:35 AM, Guillaume Ballet  wrote:

> You can specify a custom name with the --name option. It will still
> contain 1.1.0-SNAPSHOT, but at least you can specify your company name.
>
> If you want to replace SNAPSHOT with your company name, you will have to
> edit make-distribution.sh and replace the following line:
>
> VERSION=$(mvn help:evaluate -Dexpression=project.version 2>/dev/null |
> grep -v "INFO" | tail -n 1)
>
> with something like
>
> COMPANYNAME="SoullessMegaCorp"
> VERSION=$(mvn help:evaluate -Dexpression=project.version 2>/dev/null |
> grep -v "INFO" | tail -n 1 | sed -e 's/SNAPSHOT/$COMPANYNAME')
>
> and so on for other packages that use their own version scheme.
>
>
>
>
> On Tue, Jul 1, 2014 at 9:21 AM, majian  wrote:
>
>>   Hi,all:
>>
>>  I'm working to compile spark by executing './make-distribution.sh --hadoop
>> 0.20.205.0 --tgz  ',
>>  after the completion of the compilation I found  that the default version 
>> number is "1.1.0-SNAPSHOT"
>>  i.e. spark-1.1.0-SNAPSHOT-bin-0.20.205.tgz,
>>  who know how to assign version number myself ,
>> for example spark-1.1.0-company-bin-0.20.205.tgz  .
>>
>>
>> Thanks,
>> majian
>>
>>
>
>


RSpark installation on Windows

2014-07-01 Thread Stuti Awasthi
Hi All

Can we install RSpark on windows setup of R and use it to access the remote 
Spark cluster ?

Thanks
Stuti Awasthi


::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




Spark Streaming question batch size

2014-07-01 Thread Laeeq Ahmed
Hi,

The window size in a spark streaming is time based which means we have 
different number of elements in each window. For example if you have two 
streams (might be more) which are related to each other and you want to compare 
them in a specific time interval. I am not clear how it will work. Although 
they start running simultaneously, they might have different number of elements 
in each time interval.

The following is output for two streams which have same number of elements and 
ran simultaneously. The left most value is the number of elements in each 
window. If we add the number of elements them, they are same for both streams 
but we can't compare both streams as they are different in window size and 
number of windows.

Can we somehow make windows based on real time values for both streams? or Can 
we make windows based on number of elements?

(n, (mean, varience, SD))

Stream 1

(7462,(1.0535658165371238,4242.001306434091,65.13064798107025))
(44826,(0.2546925855084064,5042.890184382894,71.0133099100647))
(245466,(0.2857731601728941,5014.411691661449,70.81251084138628))
(154852,(0.21907814309792514,3483.800160602281,59.023725404300606))
(156345,(0.3075668844414613,7449.528181550462,86.31064929399189))
(156603,(0.27785151491351234,5917.809892281489,76.9273026452994))
(156047,(0.18130350363672296,4019.0232843737017,63.39576708561623))


Stream 2

(10493,(0.5554953964547791,1254.883548218503,35.42433553672536))
(180649,(0.21684831234050583,1095.9634245399352,33.1053383087975))
(179994,(0.22048869512317407,1443.0566458182718,37.98758541705792))
(179455,(0.20473330254938552,1623.9538730448216,40.29831104456888))
(269817,(0.16987953223480945,3270.663944782799,57.18971887308766))
(101193,(0.21469292497504766,1263.0879032808723,35.53994799209577))


Regards,
Laeeq

Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition

2014-07-01 Thread Honey Joshi
Hi,
I am trying to run a project which takes data as a DStream and dumps the
data in the Shark table after various operations. I am getting the
following error :

Exception in thread "main" org.apache.spark.SparkException: Job aborted:
Task 0.0:0 failed 1 times (most recent failure: Exception failure:
java.lang.ClassCastException: org.apache.spark.rdd.UnionPartition cannot
be cast to org.apache.spark.rdd.HadoopPartition)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Can someone please explain the cause of this error, I am also using a
Spark Context with the existing Streaming Context.


Window Size

2014-07-01 Thread Laeeq Ahmed
Hi,

The window size in a spark streaming is time based which means we have 
different number of elements in each window. For example if you have two 
streams (might be more) which are related to each other and you want to compare 
them in a specific time interval. I am not clear how it will 
work. Although they start running simultaneously, they might have 
different number of elements in each time interval.

The following is output for two streams which have same number of elements 
and ran simultaneously. The left most value is the number of elements in each 
window. If we add the number of elements them, they are same for 
both streams but we can't compare both streams as they are different in 
window size and number of windows.

Can we somehow make windows based on real time values for both streams? or Can 
we make windows based on number of elements?

(n, (mean, varience, SD))

Stream 1

(7462,(1.0535658165371238,4242.001306434091,65.13064798107025))
(44826,(0.2546925855084064,5042.890184382894,71.0133099100647))
(245466,(0.2857731601728941,5014.411691661449,70.81251084138628))
(154852,(0.21907814309792514,3483.800160602281,59.023725404300606))
(156345,(0.3075668844414613,7449.528181550462,86.31064929399189))
(156603,(0.27785151491351234,5917.809892281489,76.9273026452994))
(156047,(0.18130350363672296,4019.0232843737017,63.39576708561623))


Stream 2

(10493,(0.5554953964547791,1254.883548218503,35.42433553672536))
(180649,(0.21684831234050583,1095.9634245399352,33.1053383087975))
(179994,(0.22048869512317407,1443.0566458182718,37.98758541705792))
(179455,(0.20473330254938552,1623.9538730448216,40.29831104456888))
(269817,(0.16987953223480945,3270.663944782799,57.18971887308766))
(101193,(0.21469292497504766,1263.0879032808723,35.53994799209577))


Regards,Laeeq


java.io.FileNotFoundException: http:///broadcast_1

2014-07-01 Thread Honey Joshi
Hi All,

We are using shark table to dump the data, we are getting the following
error :

Exception in thread "main" org.apache.spark.SparkException: Job aborted:
Task 1.0:0 failed 1 times (most recent failure: Exception failure:
java.io.FileNotFoundException: http:///broadcast_1)

We dont know where the error is coming from, can anyone please explain me
the casue of this error and how to handle it. The spark.cleaner.ttl is set
to 4600, which i guess is more than enough to run the application.
Spark Version : 0.9.0-incubating
Shark : 0.9.0 - SNAPSHOT
Scala : 2.10.3

Thank You
Honey Joshi
Ideata Analytics





SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-01 Thread Wanda Hawk
Hello,

I have installed spark-1.0.0 with scala2.10.3. I have built spark with "sbt/sbt 
assembly" and added 
"/home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar"
 to my CLASSPATH variable. 
Then I went here 
"../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples" created a 
new directory "classes" and compiled SparkKMeans.scala with "scalac -d classes/ 
SparkKMeans.scala" 
Then I navigated to "classes" (I commented this line in the scala file : 
package org.apache.spark.examples ) and tried to run it with "java -cp . 
SparkKMeans" and I get the following error:
"Exception in thread "main" java.lang.NoClassDefFoundError: breeze/linalg/Vector
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
        at java.lang.Class.getMethod0(Class.java:2774)
        at java.lang.Class.getMethod(Class.java:1663)
        at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
Caused by: java.lang.ClassNotFoundException: breeze.linalg.Vector
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 6 more
"
The jar under 
"/home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar"
 contains the breeze/linalg/Vector* path, I even tried to unpack it and put it 
in CLASSPATH to it does not seem to pick it up


I am currently running java 1.8
"java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)"

What I am doing wrong ?

Failed to launch Worker

2014-07-01 Thread MEETHU MATHEW


 Hi ,

I am using Spark Standalone mode with one master and 2 slaves.I am not  able to 
start the workers and connect it to the master using 


./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077

The log says

Exception in thread "main" org.jboss.netty.channel.ChannelException: Failed to 
bind to: master/x.x.x.174:0
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
...
Caused by: java.net.BindException: Cannot assign requested address

When I try to start the worker from the slaves using the following java 
command,its running without any exception

java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://:master:7077





Thanks & Regards, 
Meethu M

Re: Failed to launch Worker

2014-07-01 Thread Akhil Das
Is this command working??

java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/
assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
-XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077

Thanks
Best Regards


On Tue, Jul 1, 2014 at 6:08 PM, MEETHU MATHEW 
wrote:

>
>  Hi ,
>
> I am using Spark Standalone mode with one master and 2 slaves.I am not
>  able to start the workers and connect it to the master using
>
> ./bin/spark-class org.apache.spark.deploy.worker.Worker
> spark://x.x.x.174:7077
>
> The log says
>
> Exception in thread "main" org.jboss.netty.channel.ChannelException:
> Failed to bind to: master/x.x.x.174:0
>  at
> org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
>  ...
> Caused by: java.net.BindException: Cannot assign requested address
>
> When I try to start the worker from the slaves using the following java
> command,its running without any exception
>
> java -cp
> ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
> org.apache.spark.deploy.worker.Worker spark://:master:7077
>
>
>
>
> Thanks & Regards,
> Meethu M
>


Re: Failed to launch Worker

2014-07-01 Thread MEETHU MATHEW
Yes.
 
Thanks & Regards, 
Meethu M


On Tuesday, 1 July 2014 6:14 PM, Akhil Das  wrote:
 


Is this command working??

java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077



Thanks
Best Regards


On Tue, Jul 1, 2014 at 6:08 PM, MEETHU MATHEW  wrote:


>
> Hi ,
>
>
>I am using Spark Standalone mode with one master and 2 slaves.I am not  able 
>to start the workers and connect it to the master using 
>
>
>./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077
>
>
>The log says
>
>
>Exception in thread "main" org.jboss.netty.channel.ChannelException: Failed to 
>bind to: master/x.x.x.174:0
>at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
>...
>Caused by: java.net.BindException: Cannot assign requested address
>
>
>When I try to start the worker from the slaves using the following java 
>command,its running without any exception
>
>
>java -cp 
>::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
>org.apache.spark.deploy.worker.Worker spark://:master:7077
>
>
>
>
>
>
>
>
>
>Thanks & Regards, 
>Meethu M

RE: Spark 1.0 and Logistic Regression Python Example

2014-07-01 Thread Sam Jacobs
Thanks Xiangrui, your suggestion fixed the problem. I will see if I can upgrade 
the numpy/python for a permanent fix. My current versions of python and numpy 
are 2.6 and 4.1.9 respectively.

Thanks,

Sam  

-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Tuesday, July 01, 2014 12:14 AM
To: user@spark.apache.org
Subject: Re: Spark 1.0 and Logistic Regression Python Example

You were using an old version of numpy, 1.4? I think this is fixed in the 
latest master. Try to replace vec.dot(target) by numpy.dot(vec, target), or use 
the latest master. -Xiangrui

On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs  wrote:
> Hi,
>
>
> I modified the example code for logistic regression to compute the 
> error in classification. Please see below. However the code is failing 
> when it makes a call to:
>
>
> labelsAndPreds.filter(lambda (v, p): v != p).count()
>
>
> with the error message (something related to numpy or dot product):
>
>
> File 
> "/opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/classification.py",
> line 65, in predict
>
> margin = _dot(x, self._coeff) + self._intercept
>
>   File "/opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/_common.py", 
> line 443, in _dot
>
> return vec.dot(target)
>
> AttributeError: 'numpy.ndarray' object has no attribute 'dot'
>
>
> FYI, I am running the code using spark-submit i.e.
>
>
> ./bin/spark-submit 
> examples/src/main/python/mllib/logistic_regression2.py
>
>
>
> The code is posted below if it will be useful in any way:
>
>
> from math import exp
>
> import sys
> import time
>
> from pyspark import SparkContext
>
> from pyspark.mllib.classification import LogisticRegressionWithSGD 
> from pyspark.mllib.regression import LabeledPoint from numpy import 
> array
>
>
> # Load and parse the data
> def parsePoint(line):
> values = [float(x) for x in line.split(',')]
> if values[0] == -1:   # Convert -1 labels to 0 for MLlib
> values[0] = 0
> return LabeledPoint(values[0], values[1:])
>
> sc = SparkContext(appName="PythonLR")
> # start timing
> start = time.time()
> #start = time.clock()
>
> data = sc.textFile("sWAMSpark_train.csv")
> parsedData = data.map(parsePoint)
>
> # Build the model
> model = LogisticRegressionWithSGD.train(parsedData)
>
> #load test data
>
> testdata = sc.textFile("sWSpark_test.csv") parsedTestData = 
> testdata.map(parsePoint)
>
> # Evaluating the model on test data
> labelsAndPreds = parsedTestData.map(lambda p: (p.label,
> model.predict(p.features)))
> trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
> float(parsedData.count())
> print("Training Error = " + str(trainErr)) end = time.time() 
> print("Time is = " + str(end - start))
>
>
>
>
>
>
>


Re: Changing log level of spark

2014-07-01 Thread Philip Limbeck
We changed the loglevel to DEBUG by replacing every INFO with DEBUG in
/root/ephemeral-hdfs/conf/log4j.properties and propagating it to the
cluster. There is some DEBUG output visible in both master and worker but
nothing really interesting regarding stages or scheduling. Since we
expected a little more than that, there could be 2 possibilites:
  a) There is still some other unknown way to set the loglevel to debug
  b) There is not that much log output to be expected in this direction, I
looked for "logDebug" (The log wrapper in spark) in github with 84 results,
which means that I doubt that there is not much else to expect.

We actually just want to have a little more insight into the system
behavior especially when using Shark since we ran into some serious
concurrency issues with blocking queries. So much for the background why
this is important to us.


On Thu, Jun 26, 2014 at 3:30 AM, Aaron Davidson  wrote:

> If you're using the spark-ec2 scripts, you may have to change
> /root/ephemeral-hdfs/conf/log4j.properties or something like that, as that
> is added to the classpath before Spark's own conf.
>
>
> On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer  wrote:
>
>> I have a log4j.xml in src/main/resources with
>>
>> 
>> 
>> http://jakarta.apache.org/log4j/";>
>> [...]
>> 
>> 
>> 
>> 
>> 
>>
>> and that is included in the jar I package with `sbt assembly`. That
>> works fine for me, at least on the driver.
>>
>> Tobias
>>
>> On Wed, Jun 25, 2014 at 2:25 PM, Philip Limbeck 
>> wrote:
>> > Hi!
>> >
>> > According to
>> >
>> https://spark.apache.org/docs/0.9.0/configuration.html#configuring-logging
>> ,
>> > changing log-level is just a matter of creating a log4j.properties
>> (which is
>> > in the classpath of spark) and changing log level there for the root
>> logger.
>> > I did this steps on every node in the cluster (master and worker nodes).
>> > However, after restart there is still no debug output as desired, but
>> only
>> > the default info log level.
>>
>
>


difference between worker and slave nodes

2014-07-01 Thread aminn_524

 
Can anyone explain to me what is difference between worker and slave? I hav
e one master and two slaves which are connected to each other, by using jps
command I can see master in master node and worker in slave nodes but I dont
have any worker in my master by using this command
/bin/spark-classorg.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077
I have thought the slaves would be working as worker



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/difference-between-worker-and-slave-nodes-tp8578.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Changing log level of spark

2014-07-01 Thread Surendranauth Hiraman
One thing we ran into was that there was another log4j.properties earlier
in the classpath. For us, it was in our MapR/Hadoop conf.

If that is the case, something like the following could help you track it
down. The only thing to watch out for is that you might have to walk up the
classloader hierarchy.

ClassLoader cl = Thread.currentThread().getContextClassLoader();
URL loc = cl.getResource("/log4j.properties");
System.out.println("loc);

-Suren




On Tue, Jul 1, 2014 at 9:20 AM, Philip Limbeck 
wrote:

> We changed the loglevel to DEBUG by replacing every INFO with DEBUG in
> /root/ephemeral-hdfs/conf/log4j.properties and propagating it to the
> cluster. There is some DEBUG output visible in both master and worker but
> nothing really interesting regarding stages or scheduling. Since we
> expected a little more than that, there could be 2 possibilites:
>   a) There is still some other unknown way to set the loglevel to debug
>   b) There is not that much log output to be expected in this direction, I
> looked for "logDebug" (The log wrapper in spark) in github with 84 results,
> which means that I doubt that there is not much else to expect.
>
> We actually just want to have a little more insight into the system
> behavior especially when using Shark since we ran into some serious
> concurrency issues with blocking queries. So much for the background why
> this is important to us.
>
>
> On Thu, Jun 26, 2014 at 3:30 AM, Aaron Davidson 
> wrote:
>
>> If you're using the spark-ec2 scripts, you may have to change
>> /root/ephemeral-hdfs/conf/log4j.properties or something like that, as that
>> is added to the classpath before Spark's own conf.
>>
>>
>> On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer 
>> wrote:
>>
>>> I have a log4j.xml in src/main/resources with
>>>
>>> 
>>> 
>>> http://jakarta.apache.org/log4j/";>
>>> [...]
>>> 
>>> 
>>> 
>>> 
>>> 
>>>
>>> and that is included in the jar I package with `sbt assembly`. That
>>> works fine for me, at least on the driver.
>>>
>>> Tobias
>>>
>>> On Wed, Jun 25, 2014 at 2:25 PM, Philip Limbeck 
>>> wrote:
>>> > Hi!
>>> >
>>> > According to
>>> >
>>> https://spark.apache.org/docs/0.9.0/configuration.html#configuring-logging
>>> ,
>>> > changing log-level is just a matter of creating a log4j.properties
>>> (which is
>>> > in the classpath of spark) and changing log level there for the root
>>> logger.
>>> > I did this steps on every node in the cluster (master and worker
>>> nodes).
>>> > However, after restart there is still no debug output as desired, but
>>> only
>>> > the default info log level.
>>>
>>
>>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v elos.io
W: www.velos.io


Re: Changing log level of spark

2014-07-01 Thread Yana Kadiyska
Are you looking at the driver log? (e.g. Shark?). I see a ton of
information in the INFO category on what query is being started, what
stage is starting and which executor stuff is sent to. So I'm not sure
if you're saying you see all that and you need more, or that you're
not seeing this type of information. I cannot speak to the ec2 setup,
just pointing out that under 0.9.1 I see quite a bit of scheduling
information in the driver log.

On Tue, Jul 1, 2014 at 9:20 AM, Philip Limbeck  wrote:
> We changed the loglevel to DEBUG by replacing every INFO with DEBUG in
> /root/ephemeral-hdfs/conf/log4j.properties and propagating it to the
> cluster. There is some DEBUG output visible in both master and worker but
> nothing really interesting regarding stages or scheduling. Since we expected
> a little more than that, there could be 2 possibilites:
>   a) There is still some other unknown way to set the loglevel to debug
>   b) There is not that much log output to be expected in this direction, I
> looked for "logDebug" (The log wrapper in spark) in github with 84 results,
> which means that I doubt that there is not much else to expect.
>
> We actually just want to have a little more insight into the system behavior
> especially when using Shark since we ran into some serious concurrency
> issues with blocking queries. So much for the background why this is
> important to us.
>
>
> On Thu, Jun 26, 2014 at 3:30 AM, Aaron Davidson  wrote:
>>
>> If you're using the spark-ec2 scripts, you may have to change
>> /root/ephemeral-hdfs/conf/log4j.properties or something like that, as that
>> is added to the classpath before Spark's own conf.
>>
>>
>> On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer  wrote:
>>>
>>> I have a log4j.xml in src/main/resources with
>>>
>>> 
>>> 
>>> http://jakarta.apache.org/log4j/";>
>>> [...]
>>> 
>>> 
>>> 
>>> 
>>> 
>>>
>>> and that is included in the jar I package with `sbt assembly`. That
>>> works fine for me, at least on the driver.
>>>
>>> Tobias
>>>
>>> On Wed, Jun 25, 2014 at 2:25 PM, Philip Limbeck 
>>> wrote:
>>> > Hi!
>>> >
>>> > According to
>>> >
>>> > https://spark.apache.org/docs/0.9.0/configuration.html#configuring-logging,
>>> > changing log-level is just a matter of creating a log4j.properties
>>> > (which is
>>> > in the classpath of spark) and changing log level there for the root
>>> > logger.
>>> > I did this steps on every node in the cluster (master and worker
>>> > nodes).
>>> > However, after restart there is still no debug output as desired, but
>>> > only
>>> > the default info log level.
>>
>>
>


Spark 1.0: Unable to Read LZO Compressed File

2014-07-01 Thread Uddin, Nasir M.
Dear Spark Users:

Spark 1.0 has been installed as Standalone - But it can't read any compressed 
(CMX/Snappy) and Sequence file residing on HDFS (it can read uncompressed files 
from HDFS). The key notable message is: "Unable to load native-hadoop 
library.". Other related messages are -

Caused by: java.lang.IllegalStateException: Cannot load 
com.ibm.biginsights.compress.CmxDecompressor without native library! at 
com.ibm.biginsights.compress.CmxDecompressor.(CmxDecompressor.java:65)

Here is the core-site.xml's key part:
io.compression.codecs
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec,com.ibm.biginsights.compress.CmxCodec
  

Here is the spark.env.sh:
export SPARK_WORKER_CORES=4
export SPARK_WORKER_MEMORY=10g
export SCALA_HOME=/opt/spark/scala-2.11.1
export JAVA_HOME=/opt/spark/jdk1.7.0_55
export SPARK_HOME=/opt/spark/spark-0.9.1-bin-hadoop2
export ADD_JARS=/opt/IHC/lib/compression.jar
export SPARK_CLASSPATH=/opt/IHC/lib/compression.jar
export SPARK_LIBRARY_PATH=/opt/IHC/lib/native/Linux-amd64-64/
export SPARK_MASTER_WEBUI_PORT=1080
export HADOOP_CONF_DIR=/opt/IHC/hadoop-conf

Note: core-site.xml and hdfs-site.xml are in hadoop-conf. CMX is an IBM branded 
splittable LZO based compression codec.

Any help to resolve the issue is appreciated.

Thanks,
Nasir
DTCC DISCLAIMER: This email and any files transmitted with it are confidential 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error, please notify us 
immediately and delete the email and any attachments from your system. The 
recipient should check this email and any attachments for the presence of 
viruses.  The company accepts no liability for any damage caused by any virus 
transmitted by this email.


Re: Spark Streaming question batch size

2014-07-01 Thread Yana Kadiyska
Are you saying that both streams come in at the same rate and you have
the same batch interval but the batch size ends up different? i.e. two
datapoints both arriving at X seconds after streaming starts end up in
two different batches? How do you define "real time values for both
streams"? I am trying to do something similar to you, I think -- but
I'm not clear on what your notion of time is.
My reading of your example above is that the streams just pump data in
at different rates -- first one got 7462 points in the first batch
interval, whereas stream2 saw 10493

On Tue, Jul 1, 2014 at 5:34 AM, Laeeq Ahmed  wrote:
> Hi,
>
> The window size in a spark streaming is time based which means we have
> different number of elements in each window. For example if you have two
> streams (might be more) which are related to each other and you want to
> compare them in a specific time interval. I am not clear how it will work.
> Although they start running simultaneously, they might have different number
> of elements in each time interval.
>
> The following is output for two streams which have same number of elements
> and ran simultaneously. The left most value is the number of elements in
> each window. If we add the number of elements them, they are same for both
> streams but we can't compare both streams as they are different in window
> size and number of windows.
>
> Can we somehow make windows based on real time values for both streams? or
> Can we make windows based on number of elements?
>
> (n, (mean, varience, SD))
> Stream 1
> (7462,(1.0535658165371238,4242.001306434091,65.13064798107025))
> (44826,(0.2546925855084064,5042.890184382894,71.0133099100647))
> (245466,(0.2857731601728941,5014.411691661449,70.81251084138628))
> (154852,(0.21907814309792514,3483.800160602281,59.023725404300606))
> (156345,(0.3075668844414613,7449.528181550462,86.31064929399189))
> (156603,(0.27785151491351234,5917.809892281489,76.9273026452994))
> (156047,(0.18130350363672296,4019.0232843737017,63.39576708561623))
>
> Stream 2
> (10493,(0.5554953964547791,1254.883548218503,35.42433553672536))
> (180649,(0.21684831234050583,1095.9634245399352,33.1053383087975))
> (179994,(0.22048869512317407,1443.0566458182718,37.98758541705792))
> (179455,(0.20473330254938552,1623.9538730448216,40.29831104456888))
> (269817,(0.16987953223480945,3270.663944782799,57.18971887308766))
> (101193,(0.21469292497504766,1263.0879032808723,35.53994799209577))
>
> Regards,
> Laeeq
>
>


Re: Question about VD and ED

2014-07-01 Thread Baoxu Shi(Dash)
Hi Bin,

VD and ED are ClassTags, you could treat them as placeholder, or template T in 
C (not 100% clear).

You do not need convert graph[String, Double] to Graph[VD,ED].

Check ClassTag’s definition in Scala could help.

Best,

On Jul 1, 2014, at 4:49 AM, Bin WU  wrote:

> Hi all,
> 
> I am a newbie to graphx.
> 
> I am currently having troubles understanding the types "VD" and "ED". I 
> notice that "VD" and "ED" are widely used in graphx implementation, but I 
> don't know why and how I am supposed to use them.
> 
> Specifically, say I have constructed a graph "graph : Graph[String, Double]", 
> when, why, and how should I transform it into the type Graph[VD, ED]?
> 
> Also, I don't know what package should I import. I have imported:
> 
> "
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD"
> 
> But the two types "VD" and "ED" are still not found.
> 
> Sorry for the stupid question.
> 
> Thanks in advance!
> 
> Best,
> Ben



Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-01 Thread Xiangrui Meng
You can use either bin/run-example or bin/spark-summit to run example
code. "scalac -d classes/ SparkKMeans.scala" doesn't recognize Spark
classpath. There are examples in the official doc:
http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here
-Xiangrui

On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk  wrote:
> Hello,
>
> I have installed spark-1.0.0 with scala2.10.3. I have built spark with
> "sbt/sbt assembly" and added
> "/home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar"
> to my CLASSPATH variable.
> Then I went here
> "../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples" created a
> new directory "classes" and compiled SparkKMeans.scala with "scalac -d
> classes/ SparkKMeans.scala"
> Then I navigated to "classes" (I commented this line in the scala file :
> package org.apache.spark.examples ) and tried to run it with "java -cp .
> SparkKMeans" and I get the following error:
> "Exception in thread "main" java.lang.NoClassDefFoundError:
> breeze/linalg/Vector
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
> at java.lang.Class.getMethod0(Class.java:2774)
> at java.lang.Class.getMethod(Class.java:1663)
> at
> sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
> at
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
> Caused by: java.lang.ClassNotFoundException: breeze.linalg.Vector
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 6 more
> "
> The jar under
> "/home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar"
> contains the breeze/linalg/Vector* path, I even tried to unpack it and put
> it in CLASSPATH to it does not seem to pick it up
>
>
> I am currently running java 1.8
> "java version "1.8.0_05"
> Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)"
>
> What I am doing wrong ?
>


Re: Questions about disk IOs

2014-07-01 Thread Xiangrui Meng
Try to reduce number of partitions to match the number of cores. We
will add treeAggregate to reduce the communication cost.

PR: https://github.com/apache/spark/pull/1110

-Xiangrui

On Tue, Jul 1, 2014 at 12:55 AM, Charles Li  wrote:
> Hi Spark,
>
> I am running LBFGS on our user data. The data size with Kryo serialisation is 
> about 210G. The weight size is around 1,300,000. I am quite confused that the 
> performance is very close whether the data is cached or not.
>
> The program is simple:
> points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..)
> points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not cached
> gradient = new LogisticGrandient();
> updater = new SquaredL2Updater();
> initWeight = Vectors.sparse(size, new int[]{}, new double[]{})
> result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, 
> convergeTol, maxIter, regParam, initWeight);
>
> I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its 
> cluster mode. Below are some arguments I am using:
> —executor-memory 10G
> —num-executors 50
> —executor-cores 2
>
> Storage Using:
> When caching:
> Cached Partitions 951
> Fraction Cached 100%
> Size in Memory 215.7GB
> Size in Tachyon 0.0B
> Size on Disk 1029.7MB
>
> The time cost by every aggregate is around 5 minutes with cache enabled. Lots 
> of disk IOs can be seen on the hadoop node. I have the same result with cache 
> disabled.
>
> Should data points caching improve the performance? Should caching decrease 
> the disk IO?
>
> Thanks in advance.


Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Aditya Varun Chadha
I attended yesterday on ustream.tv, but can't find the links to today's
streams anywhere. help!

-- 
Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)


Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Alexis Roos
*General Session / Keynotes :
 http://www.ustream.tv/channel/spark-summit-2014
Track A
: http://www.ustream.tv/channel/track-a1
Track
B: http://www.ustream.tv/channel/track-b1
Track
C: http://www.ustream.tv/channel/track-c1
*


On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha 
wrote:

> I attended yesterday on ustream.tv, but can't find the links to today's
> streams anywhere. help!
>
> --
> Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
>


Re: Could not compute split, block not found

2014-07-01 Thread Bill Jay
Hi Tobias,

Your explanation makes a lot of sense. Actually, I tried to use partial
data on the same program yesterday. It has been up for around 24 hours and
is still running correctly. Thanks!

Bill


On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer  wrote:

> Bill,
>
> let's say the processing time is t' and the window size t. Spark does not
> *require* t' < t. In fact, for *temporary* peaks in your streaming data, I
> think the way Spark handles it is very nice, in particular since 1) it does
> not mix up the order in which items arrived in the stream, so items from a
> later window will always be processed later, and 2) because an increase in
> data will not be punished with high load and unresponsive systems, but with
> disk space consumption instead.
>
> However, if all of your windows require t' > t processing time (and it's
> not because you are waiting, but because you actually do some computation),
> then you are in bad luck, because if you start processing the next window
> while the previous one is still processed, you have less resources for each
> and processing will take even longer. However, if you are only waiting
> (e.g., for network I/O), then maybe you can employ some asynchronous
> solution where your tasks return immediately and deliver their result via a
> callback later?
>
> Tobias
>
>
>
> On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay 
> wrote:
>
>> Tobias,
>>
>> Your suggestion is very helpful. I will definitely investigate it.
>>
>> Just curious. Suppose the batch size is t seconds. In practice, does
>> Spark always require the program to finish processing the data of t seconds
>> within t seconds' processing time? Can Spark begin to consume the new batch
>> before finishing processing the next batch? If Spark can do them together,
>> it may save the processing time and solve the problem of data piling up.
>>
>> Thanks!
>>
>> Bill
>>
>>
>>
>>
>> On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer 
>> wrote:
>>
>>> ​​If your batch size is one minute and it takes more than one minute to
>>> process, then I guess that's what causes your problem. The processing of
>>> the second batch will not start after the processing of the first is
>>> finished, which leads to more and more data being stored and waiting for
>>> processing; check the attached graph for a visualization of what I think
>>> may happen.
>>>
>>> Can you maybe do something hacky like throwing away a part of the data
>>> so that processing time gets below one minute, then check whether you still
>>> get that error?
>>>
>>> Tobias
>>>
>>>
>>> ​​
>>>
>>>
>>> On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay 
>>> wrote:
>>>
 Tobias,

 Thanks for your help. I think in my case, the batch size is 1 minute.
 However, it takes my program more than 1 minute to process 1 minute's
 data. I am not sure whether it is because the unprocessed data pile
 up. Do you have an suggestion on how to check it and solve it? Thanks!

 Bill


 On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer 
 wrote:

> Bill,
>
> were you able to process all information in time, or did maybe some
> unprocessed data pile up? I think when I saw this once, the reason
> seemed to be that I had received more data than would fit in memory,
> while waiting for processing, so old data was deleted. When it was
> time to process that data, it didn't exist any more. Is that a
> possible reason in your case?
>
> Tobias
>
> On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay 
> wrote:
> > Hi,
> >
> > I am running a spark streaming job with 1 minute as the batch size.
> It ran
> > around 84 minutes and was killed because of the exception with the
> following
> > information:
> >
> > java.lang.Exception: Could not compute split, block
> input-0-1403893740400
> > not found
> >
> >
> > Before it was killed, it was able to correctly generate output for
> each
> > batch.
> >
> > Any help on this will be greatly appreciated.
> >
> > Bill
> >
>


>>>
>>
>


Re: Could not compute split, block not found

2014-07-01 Thread Bill Jay
Hi Tathagata,

Yes. The input stream is from Kafka and my program reads the data, keeps
all the data in memory, process the data, and generate the output.

Bill


On Mon, Jun 30, 2014 at 11:45 PM, Tathagata Das  wrote:

> Are you by any change using only memory in the storage level of the input
> streams?
>
> TD
>
>
> On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer  wrote:
>
>> Bill,
>>
>> let's say the processing time is t' and the window size t. Spark does not
>> *require* t' < t. In fact, for *temporary* peaks in your streaming data, I
>> think the way Spark handles it is very nice, in particular since 1) it does
>> not mix up the order in which items arrived in the stream, so items from a
>> later window will always be processed later, and 2) because an increase in
>> data will not be punished with high load and unresponsive systems, but with
>> disk space consumption instead.
>>
>> However, if all of your windows require t' > t processing time (and it's
>> not because you are waiting, but because you actually do some computation),
>> then you are in bad luck, because if you start processing the next window
>> while the previous one is still processed, you have less resources for each
>> and processing will take even longer. However, if you are only waiting
>> (e.g., for network I/O), then maybe you can employ some asynchronous
>> solution where your tasks return immediately and deliver their result via a
>> callback later?
>>
>> Tobias
>>
>>
>>
>> On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay 
>> wrote:
>>
>>> Tobias,
>>>
>>> Your suggestion is very helpful. I will definitely investigate it.
>>>
>>> Just curious. Suppose the batch size is t seconds. In practice, does
>>> Spark always require the program to finish processing the data of t seconds
>>> within t seconds' processing time? Can Spark begin to consume the new batch
>>> before finishing processing the next batch? If Spark can do them together,
>>> it may save the processing time and solve the problem of data piling up.
>>>
>>> Thanks!
>>>
>>> Bill
>>>
>>>
>>>
>>>
>>> On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer 
>>> wrote:
>>>
 ​​If your batch size is one minute and it takes more than one minute to
 process, then I guess that's what causes your problem. The processing of
 the second batch will not start after the processing of the first is
 finished, which leads to more and more data being stored and waiting for
 processing; check the attached graph for a visualization of what I think
 may happen.

 Can you maybe do something hacky like throwing away a part of the data
 so that processing time gets below one minute, then check whether you still
 get that error?

 Tobias


 ​​


 On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay 
 wrote:

> Tobias,
>
> Thanks for your help. I think in my case, the batch size is 1 minute.
> However, it takes my program more than 1 minute to process 1 minute's
> data. I am not sure whether it is because the unprocessed data pile
> up. Do you have an suggestion on how to check it and solve it? Thanks!
>
> Bill
>
>
> On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer 
> wrote:
>
>> Bill,
>>
>> were you able to process all information in time, or did maybe some
>> unprocessed data pile up? I think when I saw this once, the reason
>> seemed to be that I had received more data than would fit in memory,
>> while waiting for processing, so old data was deleted. When it was
>> time to process that data, it didn't exist any more. Is that a
>> possible reason in your case?
>>
>> Tobias
>>
>> On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay 
>> wrote:
>> > Hi,
>> >
>> > I am running a spark streaming job with 1 minute as the batch size.
>> It ran
>> > around 84 minutes and was killed because of the exception with the
>> following
>> > information:
>> >
>> > java.lang.Exception: Could not compute split, block
>> input-0-1403893740400
>> > not found
>> >
>> >
>> > Before it was killed, it was able to correctly generate output for
>> each
>> > batch.
>> >
>> > Any help on this will be greatly appreciated.
>> >
>> > Bill
>> >
>>
>
>

>>>
>>
>


spark streaming rate limiting from kafka

2014-07-01 Thread Chen Song
In my use case, if I need to stop spark streaming for a while, data would
accumulate a lot on kafka topic-partitions. After I restart spark streaming
job, the worker's heap will go out of memory on the fetch of the 1st batch.

I am wondering if

* Is there a way to throttle reading from kafka in spark streaming jobs?
* Is there a way to control how far Kafka Dstream can read on
topic-partition (via offset for example). By setting this to a small
number, it will force DStream to read less data initially.
* Is there a way to limit the consumption rate at Kafka side? (This one is
not actually for spark streaming and doesn't seem to be question in this
group. But I am raising it anyway here.)

I have looked at code example below but doesn't seem it is supported.

KafkaUtils.createStream ...
Thanks, All
-- 
Chen Song


Re: Improving Spark multithreaded performance?

2014-07-01 Thread Kyle Ellrott
This all seems pretty hackish and a lot of trouble to get around
limitations in mllib.
The big limitation is that right now, the optimization algorithms work on
one large dataset at a time. We need a second of set of methods to work on
a large number of medium sized datasets.
I've started to code a new set of optimization methods to add into mllib.
I've started with GroupedGradientDecent (
https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala
)

GroupedGradientDecent is based on GradientDecent, but instead, it takes
RDD[(Int, (Double, Vector))] as its data input rather then RDD[(Double,
Vector)]. The Int serves as key to mark which elements should be grouped
together. This lets you multiplex several dataset optimizations into the
same RDD.

I think I've gotten the GroupedGradientDecent to work correctly. I need to
go up the stack and start adding methods like SVMWithSGD.trainGroup.

Does anybody have any thoughts on this?

Kyle



On Fri, Jun 27, 2014 at 6:36 PM, Xiangrui Meng  wrote:

> The RDD is cached in only one or two workers. All other executors need
> to fetch its content via network. Since the dataset is not huge, could
> you try this?
>
> val features: Array[Vector] = ...
> val featuresBc = sc.broadcast(features)
>  // parallel loops
>  val labels: Array[Double] =
>  val rdd = sc.parallelize(0 until 1, 1).flatMap(i =>
> featuresBc.value.view.zip(labels))
>  val model = SVMWithSGD.train(rdd)
>  models(i) = model
>
> Using BT broadcast factory would improve the performance of broadcasting.
>
> Best,
> Xiangrui
>
> On Fri, Jun 27, 2014 at 3:06 PM, Kyle Ellrott 
> wrote:
> > 1) I'm using the static SVMWithSGD.train, with no options.
> > 2) I have about 20,000 features (~5000 samples) that are being attached
> and
> > trained against 14,000 different sets of labels (ie I'll be doing 14,000
> > different training runs against the same sets of features trying to
> figure
> > out which labels can be learned), and I would also like to do cross fold
> > validation.
> >
> > The driver doesn't seem to be using too much memory. I left it as -Xmx8g
> and
> > it never complained.
> >
> > Kyle
> >
> >
> >
> > On Fri, Jun 27, 2014 at 1:18 PM, Xiangrui Meng  wrote:
> >>
> >> Hi Kyle,
> >>
> >> A few questions:
> >>
> >> 1) Did you use `setIntercept(true)`?
> >> 2) How many features?
> >>
> >> I'm a little worried about driver's load because the final aggregation
> >> and weights update happen on the driver. Did you check driver's memory
> >> usage as well?
> >>
> >> Best,
> >> Xiangrui
> >>
> >> On Fri, Jun 27, 2014 at 8:10 AM, Kyle Ellrott 
> >> wrote:
> >> > As far as I can tell there are is no data to broadcast (unless there
> is
> >> > something internal to mllib that needs to be broadcast) I've coalesced
> >> > the
> >> > input RDDs to keep the number of partitions limited. When running,
> I've
> >> > tried to get up to 500 concurrent stages, and I've coalesced the RDDs
> >> > down
> >> > to 2 partitions, so about 1000 tasks.
> >> > Despite having over 500 threads in the threadpool working on mllib
> >> > tasks,
> >> > the total CPU usage never really goes above 150%.
> >> > I've tried increasing 'spark.akka.threads' but that doesn't seem to do
> >> > anything.
> >> >
> >> > My one thought would be that maybe because I'm using MLUtils.kFold to
> >> > generate the RDDs is that because I have so many tasks working off
> RDDs
> >> > that
> >> > are permutations of original RDDs that maybe that is creating some
> sort
> >> > of
> >> > dependency bottleneck.
> >> >
> >> > Kyle
> >> >
> >> >
> >> > On Thu, Jun 26, 2014 at 6:35 PM, Aaron Davidson 
> >> > wrote:
> >> >>
> >> >> I don't have specific solutions for you, but the general things to
> try
> >> >> are:
> >> >>
> >> >> - Decrease task size by broadcasting any non-trivial objects.
> >> >> - Increase duration of tasks by making them less fine-grained.
> >> >>
> >> >> How many tasks are you sending? I've seen in the past something like
> 25
> >> >> seconds for ~10k total medium-sized tasks.
> >> >>
> >> >>
> >> >> On Thu, Jun 26, 2014 at 12:06 PM, Kyle Ellrott <
> kellr...@soe.ucsc.edu>
> >> >> wrote:
> >> >>>
> >> >>> I'm working to set up a calculation that involves calling mllib's
> >> >>> SVMWithSGD.train several thousand times on different permutations of
> >> >>> the
> >> >>> data. I'm trying to run the separate jobs using a threadpool to
> >> >>> dispatch the
> >> >>> different requests to a spark context connected a Mesos's cluster,
> >> >>> using
> >> >>> course scheduling, and a max of 2000 cores on Spark 1.0.
> >> >>> Total utilization of the system is terrible. Most of the 'aggregate
> at
> >> >>> GradientDescent.scala:178' stages(where mllib spends most of its
> time)
> >> >>> take
> >> >>> about 3 seconds, but have ~25 seconds of scheduler delay time.
> >> >>> What kind of things can I do to improve this?
> >> >>>
> >> >>> Kyle
> >> >>
> >> >>
> >> >
> >
> >
>


Re: spark streaming rate limiting from kafka

2014-07-01 Thread Luis Ángel Vicente Sánchez
Maybe reducing the batch duration would help :\


2014-07-01 17:57 GMT+01:00 Chen Song :

> In my use case, if I need to stop spark streaming for a while, data would
> accumulate a lot on kafka topic-partitions. After I restart spark streaming
> job, the worker's heap will go out of memory on the fetch of the 1st batch.
>
> I am wondering if
>
> * Is there a way to throttle reading from kafka in spark streaming jobs?
> * Is there a way to control how far Kafka Dstream can read on
> topic-partition (via offset for example). By setting this to a small
> number, it will force DStream to read less data initially.
> * Is there a way to limit the consumption rate at Kafka side? (This one is
> not actually for spark streaming and doesn't seem to be question in this
> group. But I am raising it anyway here.)
>
> I have looked at code example below but doesn't seem it is supported.
>
> KafkaUtils.createStream ...
> Thanks, All
> --
> Chen Song
>
>


Re: Re: spark table to hive table

2014-07-01 Thread John Omernik
Michael -

Does Spark SQL support rlike and like yet? I am running into that same
error with a basic select * from table where field like '%foo%' using the
hql() funciton.

Thanks




On Wed, May 28, 2014 at 2:22 PM, Michael Armbrust 
wrote:

> On Tue, May 27, 2014 at 6:08 PM, JaeBoo Jung 
> wrote:
>
>>  I already tried HiveContext as well as SqlContext.
>>
>> But it seems that Spark's HiveContext is not completely same as Apache
>> Hive.
>>
>> For example, SQL like 'SELECT RANK() OVER(ORDER BY VAL1 ASC) FROM TEST
>> LIMIT 10' works fine in Apache Hive,
>>
> Spark SQL doesn't support window functions yet (SPARK-1442
> ).  Sorry for the
> non-obvious error message!
>


Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Soumya Simanta
Are these sessions recorded ?


On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos  wrote:

>
>
>
>
>
>
> *General Session / Keynotes :
>  http://www.ustream.tv/channel/spark-summit-2014
>  Track A
> : http://www.ustream.tv/channel/track-a1
> Track
> B: http://www.ustream.tv/channel/track-b1
>  Track
> C: http://www.ustream.tv/channel/track-c1
> *
>
>
> On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha 
> wrote:
>
>> I attended yesterday on ustream.tv, but can't find the links to today's
>> streams anywhere. help!
>>
>> --
>> Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
>>
>
>


Re: Failed to launch Worker

2014-07-01 Thread Aaron Davidson
Where are you running the spark-class version? Hopefully also on the
workers.

If you're trying to centrally start/stop all workers, you can add a
"slaves" file to the spark conf/ directory which is just a list of your
hosts, one per line. Then you can just use "./sbin/start-slaves.sh" to
start the worker on all of your machines.

Note that this is already setup correctly if you're using the spark-ec2
scripts.


On Tue, Jul 1, 2014 at 5:53 AM, MEETHU MATHEW 
wrote:

> Yes.
>
> Thanks & Regards,
> Meethu M
>
>
>   On Tuesday, 1 July 2014 6:14 PM, Akhil Das 
> wrote:
>
>
>  Is this command working??
>
> java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/
> assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m
> -Xmx512m org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077
>
> Thanks
> Best Regards
>
>
> On Tue, Jul 1, 2014 at 6:08 PM, MEETHU MATHEW 
> wrote:
>
>
>  Hi ,
>
> I am using Spark Standalone mode with one master and 2 slaves.I am not
>  able to start the workers and connect it to the master using
>
> ./bin/spark-class org.apache.spark.deploy.worker.Worker
> spark://x.x.x.174:7077
>
> The log says
>
>  Exception in thread "main" org.jboss.netty.channel.ChannelException:
> Failed to bind to: master/x.x.x.174:0
>  at
> org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
>  ...
>  Caused by: java.net.BindException: Cannot assign requested address
>
> When I try to start the worker from the slaves using the following java
> command,its running without any exception
>
> java -cp
> ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
> org.apache.spark.deploy.worker.Worker spark://:master:7077
>
>
>
>
> Thanks & Regards,
> Meethu M
>
>
>
>
>


Re: Spark Streaming question batch size

2014-07-01 Thread Laeeq Ahmed
Hi Yana,

Yes, that is what I am saying. I need both streams to be at same pace. I do 
have timestamps for each datapoint. There is a way suggested by Tathagata das 
in an earlier post where you have have a bigger window than required and you 
fetch your required data from that window based on your timestamps. I was just 
looking if there are other cleaner ways to do it.

Regards
Laeeq
 


On Tuesday, July 1, 2014 4:23 PM, Yana Kadiyska  wrote:
 


Are you saying that both streams come in at the same rate and you have
the same batch interval but the batch size ends up different? i.e. two
datapoints both arriving at X seconds after streaming starts end up in
two different batches? How do you define "real time values for both
streams"? I am trying to do something similar to you, I think -- but
I'm not clear on what your notion of time is.
My reading of your example above is that the streams just pump data in
at different rates -- first one got 7462 points in the first batch
interval, whereas stream2 saw 10493


On Tue, Jul 1, 2014 at 5:34 AM, Laeeq Ahmed  wrote:
> Hi,
>
> The window size in a spark streaming is time based which means we have
> different number of elements in each window. For example if you have two
> streams (might be more) which are related to each other and you want to
> compare them in a specific time interval. I am not clear how it will work.
> Although they start running simultaneously, they might have different number
> of elements in each time interval.
>
> The following is output for two streams which have same number of elements
> and ran simultaneously. The left most value is the number of elements in
> each window. If we add the number of elements them, they are same for both
> streams but we can't compare both streams as they are different in window
> size and number of windows.
>
> Can we somehow make windows based on real time values for both streams? or
> Can we make windows based on number of elements?
>
> (n, (mean, varience, SD))
> Stream 1
> (7462,(1.0535658165371238,4242.001306434091,65.13064798107025))
> (44826,(0.2546925855084064,5042.890184382894,71.0133099100647))
> (245466,(0.2857731601728941,5014.411691661449,70.81251084138628))
> (154852,(0.21907814309792514,3483.800160602281,59.023725404300606))
> (156345,(0.3075668844414613,7449.528181550462,86.31064929399189))
> (156603,(0.27785151491351234,5917.809892281489,76.9273026452994))
> (156047,(0.18130350363672296,4019.0232843737017,63.39576708561623))
>
> Stream 2
> (10493,(0.5554953964547791,1254.883548218503,35.42433553672536))
> (180649,(0.21684831234050583,1095.9634245399352,33.1053383087975))
> (179994,(0.22048869512317407,1443.0566458182718,37.98758541705792))
> (179455,(0.20473330254938552,1623.9538730448216,40.29831104456888))
> (269817,(0.16987953223480945,3270.663944782799,57.18971887308766))
> (101193,(0.21469292497504766,1263.0879032808723,35.53994799209577))
>
> Regards,
> Laeeq
>
>

PySpark Driver from Jython

2014-07-01 Thread Surendranauth Hiraman
Has anyone tried running pyspark driver code in Jython, preferably by
calling python code within Java code?

I know CPython is the only interpreter tested because of the need to
support C extensions.

But in my case, C extensions would be called on the worker, not in the
driver.

And being able to execute the python driver from within my JVM is an
advantage in my current use case.

-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v elos.io
W: www.velos.io


Spark SQL : Join throws exception

2014-07-01 Thread Subacini B
Hi All,

Running this join query
 sql("SELECT * FROM  A_TABLE A JOIN  B_TABLE B WHERE
A.status=1").collect().foreach(println)

throws

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 1.0:3 failed 4 times, most recent failure: Exception
failure in TID 12 on host X.X.X.X:
*org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
No function to evaluate expression. type: UnresolvedAttribute, tree:
'A.status*

org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.eval(unresolved.scala:59)

org.apache.spark.sql.catalyst.expressions.Equals.eval(predicates.scala:147)

org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:100)

org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52)

org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52)
scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:137)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:134)
org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
org.apache.spark.scheduler.Task.run(Task.scala:51)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
java.lang.Thread.run(Thread.java:695)
Driver stacktrace:

Can someone help me.

Thanks in advance.


why is toBreeze private everywhere in mllib?

2014-07-01 Thread Koert Kuipers
its kind of handy to be able to convert stuff to breeze... is there some
other way i am supposed to access that functionality?


[ANNOUNCE] Flambo - A Clojure DSL for Apache Spark

2014-07-01 Thread Soren Macbeth
Yieldbot is pleased to announce the release of Flambo, our Clojure DSL for
Apache Spark.

Flambo allows one to write spark applications in pure Clojure as an
alternative to Scala, Java and Python currently available in Spark.

We have already written a substantial amount of internal code in clojure
using flambo and we are excited to hear and see what other will come up
with.

As ever, Pull Request and/or Issues on Github are greatly appreciated!

You can find links to source, api docs and literate source code here:

http://bit.ly/V8FmzC

-- @sorenmacbeth


Re: why is toBreeze private everywhere in mllib?

2014-07-01 Thread Xiangrui Meng
We were not ready to expose it as a public API in v1.0. Both breeze
and MLlib are in rapid development. It would be possible to expose it
as a developer API in v1.1. For now, it should be easy to define a
toBreeze method in your own project. -Xiangrui

On Tue, Jul 1, 2014 at 12:17 PM, Koert Kuipers  wrote:
> its kind of handy to be able to convert stuff to breeze... is there some
> other way i am supposed to access that functionality?


Re: Spark SQL : Join throws exception

2014-07-01 Thread Yin Huai
Seems it is a bug. I have opened
https://issues.apache.org/jira/browse/SPARK-2339 to track it.

Thank you for reporting it.

Yin


On Tue, Jul 1, 2014 at 12:06 PM, Subacini B  wrote:

> Hi All,
>
> Running this join query
>  sql("SELECT * FROM  A_TABLE A JOIN  B_TABLE B WHERE
> A.status=1").collect().foreach(println)
>
> throws
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 1.0:3 failed 4 times, most recent failure:
> Exception failure in TID 12 on host X.X.X.X: 
> *org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
> No function to evaluate expression. type: UnresolvedAttribute, tree:
> 'A.status*
>
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.eval(unresolved.scala:59)
>
> org.apache.spark.sql.catalyst.expressions.Equals.eval(predicates.scala:147)
>
> org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:100)
>
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52)
>
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52)
> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:137)
>
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:134)
> org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
> org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
>
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> org.apache.spark.scheduler.Task.run(Task.scala:51)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> java.lang.Thread.run(Thread.java:695)
> Driver stacktrace:
>
> Can someone help me.
>
> Thanks in advance.
>
>


spark-submit script and spark.files.userClassPathFirst

2014-07-01 Thread _soumya_
Hi,
 I'm trying to get rid of an error (NoSuchMethodError) while using Amazon's
s3 client on Spark. I'm using the Spark Submit script to run my code.
Reading about my options and other threads, it seemed the most logical way
would be make sure my jar is loaded first. Spark submit on debug shows the
same: 

spark.files.userClassPathFirst -> true

However, I can't seem to get rid of the error at runtime. The issue I
believe is happening because Spark has the older httpclient library in it's
classpath. I'm using:


org.apache.httpcomponents
httpclient
4.3


Any clues what might be happening?

Stack trace below:


Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.http.impl.conn.DefaultClientConnectionOperator.(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
at
org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:138)
at
org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:112)
at
org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:101)
at
org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:87)
at
org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:95)
at
com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:26)
at
com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
at com.amazonaws.http.AmazonHttpClient.(AmazonHttpClient.java:158)
at
com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:119)
at
com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:103)
at 
com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:357)
at 
com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:339)
at com.evocalize.rickshaw.commons.s3.S3Util.getS3Client(S3Util.java:35)
at
com.evocalize.rickshaw.commons.s3.S3Util.putFileFromLocalFS(S3Util.java:40)
at
com.evocalize.rickshaw.spark.actions.PackageFilesFunction.call(PackageFilesFunction.java:48)
at
com.evocalize.rickshaw.spark.applications.GenerateSEOContent.main(GenerateSEOContent.java:76)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-script-and-spark-files-userClassPathFirst-tp8604.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Re: spark table to hive table

2014-07-01 Thread Michael Armbrust
We do support LIKE and RLIKE, though up until recently

the keywords were incorrectly case sensitive.


On Tue, Jul 1, 2014 at 11:16 AM, John Omernik  wrote:

> Michael -
>
> Does Spark SQL support rlike and like yet? I am running into that same
> error with a basic select * from table where field like '%foo%' using the
> hql() funciton.
>
> Thanks
>
>
>
>
> On Wed, May 28, 2014 at 2:22 PM, Michael Armbrust 
> wrote:
>
>> On Tue, May 27, 2014 at 6:08 PM, JaeBoo Jung 
>> wrote:
>>
>>>  I already tried HiveContext as well as SqlContext.
>>>
>>> But it seems that Spark's HiveContext is not completely same as Apache
>>> Hive.
>>>
>>> For example, SQL like 'SELECT RANK() OVER(ORDER BY VAL1 ASC) FROM TEST
>>> LIMIT 10' works fine in Apache Hive,
>>>
>> Spark SQL doesn't support window functions yet (SPARK-1442
>> ).  Sorry for the
>> non-obvious error message!
>>
>
>


slf4j multiple bindings

2014-07-01 Thread Bill Jay
Hi all,

I have an issue with multiple slf4j bindings. My program was running
correctly. I just added the new dependency kryo. And when I submitted a
job, the job was killed because of the following error messages:

*SLF4J: Class path contains multiple SLF4J bindings.*


The log said there were three slf4j bindings:
spark-assembly-0.9.1-hadoop2.3.0.jar, hadoop lib, and my own jar file.

However, I did not explicitly add slf4j in my pom.xml file. I added
exclusions in the dependency of kryo but it did not work. Does anyone has
an idea how to fix this issue? Thanks!

Regards,

Bill


Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Mohammed Guller
I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3 
worker). Our app is fetching data from Cassandra and doing a basic filter, map, 
and countByKey on that data. I have run into a strange problem. Even if the 
number of rows in Cassandra is just 1M, the Spark job goes seems to go into an 
infinite loop and runs for hours. With a small amount of data (less than 100 
rows), the job does finish, but takes almost 30-40 seconds and we frequently 
see the messages shown below. If we run the same application on a single node 
Spark (--master local[4]), then we don't see these warnings and the task 
finishes in less than 6-7 seconds. Any idea what could be the cause for these 
problems when we run our application on a standalone 4-node spark cluster?

14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)
14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)
14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)
14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)
14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(0, 192.168.222.142, 39342, 0)
14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)
14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(0, 192.168.222.142, 39342, 0)
14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)
14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)
14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)
14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)
14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)
14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)
14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)
14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)
14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0)
14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0)
14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0)
14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9)
14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task 6.17:0)
14/06/30 19:30:42 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:43 WARN TaskSetManager: Lost TID 30193 (task 6.18:0)
14/06/30 19:30:43 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:45 WARN TaskSetManager: Lost TID 30559 (task 6.19:98)
14/06/30 19:30:45 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(1, 192.168.222.152, 45896, 0)
14/06/30 19:30:46 WARN TaskSetManager: Lost TID 30826 (task 6.20:0)
14/06/30 19:30:46 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(1, 192.168.222.152, 45896, 0)
14/06/30 19:30:48 WARN TaskSetManager: Lost TID 31098 (task 6.21:0)
14/06/30 19:30:48 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(1

Re: multiple passes in mapPartitions

2014-07-01 Thread Chris Fregly
also, multiple calls to mapPartitions() will be pipelined by the spark
execution engine into a single stage, so the overhead is minimal.


On Fri, Jun 13, 2014 at 9:28 PM, zhen  wrote:

> Thank you for your suggestion. We will try it out and see how it performs.
> We
> think the single call to mapPartitions will be faster but we could be
> wrong.
> It would be nice to have a "clone method" on the iterator.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/multiple-passes-in-mapPartitions-tp7555p7616.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Yana Kadiyska
A lot of things can get funny when you run distributed as opposed to
local -- e.g. some jar not making it over. Do you see anything of
interest in the log on the executor machines -- I'm guessing
192.168.222.152/192.168.222.164. From here
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
seems like the warning message is logged after the task fails -- but I
wonder if you might see something more useful as to why it failed to
begin with. As an example we've had cases in Hdfs where a small
example would work, but on a larger example we'd hit a bad file. But
the executor log is usually pretty explicit as to what happened...

On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller  wrote:
> I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3
> worker). Our app is fetching data from Cassandra and doing a basic filter,
> map, and countByKey on that data. I have run into a strange problem. Even if
> the number of rows in Cassandra is just 1M, the Spark job goes seems to go
> into an infinite loop and runs for hours. With a small amount of data (less
> than 100 rows), the job does finish, but takes almost 30-40 seconds and we
> frequently see the messages shown below. If we run the same application on a
> single node Spark (--master local[4]), then we don’t see these warnings and
> the task finishes in less than 6-7 seconds. Any idea what could be the cause
> for these problems when we run our application on a standalone 4-node spark
> cluster?
>
>
>
> 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)
>
> 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)
>
> 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)
>
> 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)
>
> 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(0, 192.168.222.142, 39342, 0)
>
> 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)
>
> 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(0, 192.168.222.142, 39342, 0)
>
> 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)
>
> 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)
>
> 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)
>
> 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)
>
> 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)
>
> 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)
>
> 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)
>
> 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)
>
> 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0)
>
> 14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0)
>
> 14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0)
>
> 14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
>
> 14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9)
>
> 14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 

Re: Fw: How Spark Choose Worker Nodes for respective HDFS block

2014-07-01 Thread Chris Fregly
yes, spark attempts to achieve data locality (PROCESS_LOCAL or NODE_LOCAL)
where possible just like MapReduce.  it's a best practice to co-locate your
Spark Workers on the same nodes as your HDFS Name Nodes for just this
reason.

this is achieved through the RDD.preferredLocations() interface method:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

on a related note, you can configure spark.locality.wait as the number of
millis to wait before falling back to a less-local data node (RACK_LOCAL):
  http://spark.apache.org/docs/latest/configuration.html

-chris


On Fri, Jun 13, 2014 at 11:06 PM, anishs...@yahoo.co.in <
anishs...@yahoo.co.in> wrote:

> Hi All
>
> Is there any communication between Spark MASTER node and Hadoop NameNode
> while distributing work to WORKER nodes, like we have in MapReduce.
>
> Please suggest
>
> TIA
>
> --
> Anish Sneh
> "Experience is the best teacher."
> http://in.linkedin.com/in/anishsneh
>
>
>  --
> * From: * anishs...@yahoo.co.in ;
> * To: * u...@spark.incubator.apache.org ;
>
> * Subject: * How Spark Choose Worker Nodes for respective HDFS block
> * Sent: * Fri, Jun 13, 2014 9:17:50 PM
>
>   Hi All
>
> I am new to Spark, workin on 3 node test cluster. I am trying to explore
> Spark scope in analytics, my Spark codes interacts with HDFS mostly.
>
> I have a confusion that how Spark choose on which node it will distribute
> its work.
>
> Since we assume that it can be an alternative to Hadoop MapReduce. In
> MapReduce we know that internally framework will distribute code (or logic)
> to the nearest TaskTracker which are co-located with DataNode or in same
> rack or probably nearest to the data blocks.
>
> My confusion is when I give HDFS path inside a Spark program how it choose
> which node is nearest (if it does).
>
> If it does not then how it will work when I have TBs of data where high
> network latency will be involved.
>
> My apologies for asking basic question, please suggest.
>
> TIA
> --
> Anish Sneh
> "Experience is the best teacher."
> http://www.anishsneh.com
>


Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Marco Shaw
They are recorded...  For example, 2013: http://spark-summit.org/2013

I'm assuming the 2014 videos will be up in 1-2 weeks.

Marco


On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta 
wrote:

> Are these sessions recorded ?
>
>
> On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos  wrote:
>
>>
>>
>>
>>
>>
>>
>> *General Session / Keynotes :
>>  http://www.ustream.tv/channel/spark-summit-2014
>>  Track A
>> : http://www.ustream.tv/channel/track-a1
>> Track
>> B: http://www.ustream.tv/channel/track-b1
>>  Track
>> C: http://www.ustream.tv/channel/track-c1
>> *
>>
>>
>> On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha 
>> wrote:
>>
>>> I attended yesterday on ustream.tv, but can't find the links to today's
>>> streams anywhere. help!
>>>
>>> --
>>> Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
>>>
>>
>>
>


Re: spark streaming rate limiting from kafka

2014-07-01 Thread Tobias Pfeiffer
Hi,

On Wed, Jul 2, 2014 at 1:57 AM, Chen Song  wrote:
>
> * Is there a way to control how far Kafka Dstream can read on
> topic-partition (via offset for example). By setting this to a small
> number, it will force DStream to read less data initially.
>

Please see the post at
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
Kafka's auto.offset.reset parameter may be what you are looking for.

Tobias


Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Matei Zaharia
Yup, we’re going to try to get the videos up as soon as possible.

Matei

On Jul 1, 2014, at 7:47 PM, Marco Shaw  wrote:

> They are recorded...  For example, 2013: http://spark-summit.org/2013
> 
> I'm assuming the 2014 videos will be up in 1-2 weeks.
> 
> Marco
> 
> 
> On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta  
> wrote:
> Are these sessions recorded ? 
> 
> 
> On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos  wrote:
> General Session / Keynotes :  http://www.ustream.tv/channel/spark-summit-2014
> 
> Track A : http://www.ustream.tv/channel/track-a1
> 
> Track B: http://www.ustream.tv/channel/track-b1
> 
> Track C: http://www.ustream.tv/channel/track-c1
> 
> 
> On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha  wrote:
> I attended yesterday on ustream.tv, but can't find the links to today's 
> streams anywhere. help!
> 
> -- 
> Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
> 
> 
> 



Re: [ANNOUNCE] Flambo - A Clojure DSL for Apache Spark

2014-07-01 Thread Matei Zaharia
Very cool, Soren, thanks for announcing this! It looks like it didn’t actually 
require a huge amount of new code either, which is great.

Matei

On Jul 1, 2014, at 12:31 PM, Soren Macbeth  wrote:

> Yieldbot is pleased to announce the release of Flambo, our Clojure DSL for 
> Apache Spark.
> 
> Flambo allows one to write spark applications in pure Clojure as an 
> alternative to Scala, Java and Python currently available in Spark.
> 
> We have already written a substantial amount of internal code in clojure 
> using flambo and we are excited to hear and see what other will come up with.
> 
> As ever, Pull Request and/or Issues on Github are greatly appreciated!
> 
> You can find links to source, api docs and literate source code here:
> 
> http://bit.ly/V8FmzC
> 
> -- @sorenmacbeth



Re: Spark 1.0: Unable to Read LZO Compressed File

2014-07-01 Thread Matei Zaharia
I’d suggest asking the IBM Hadoop folks, but my guess is that the library 
cannot be found in /opt/IHC/lib/native/Linux-amd64-64/. Or maybe if this 
exception is happening in your driver program, the driver program’s 
java.library.path doesn’t include this. (SPARK_LIBRARY_PATH from spark-env.sh 
only applies to stuff launched on the clusters).

Matei

On Jul 1, 2014, at 7:15 AM, Uddin, Nasir M.  wrote:

> Dear Spark Users:
>  
> Spark 1.0 has been installed as Standalone – But it can’t read any compressed 
> (CMX/Snappy) and Sequence file residing on HDFS (it can read uncompressed 
> files from HDFS). The key notable message is: “Unable to load native-hadoop 
> library…..”. Other related messages are –
>  
> Caused by: java.lang.IllegalStateException: Cannot load 
> com.ibm.biginsights.compress.CmxDecompressor without native library! at 
> com.ibm.biginsights.compress.CmxDecompressor.(CmxDecompressor.java:65)
>  
> Here is the core-site.xml’s key part:
> io.compression.codecs
> org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec,com.ibm.biginsights.compress.CmxCodec
>   
>  
> Here is the spark.env.sh:
> export SPARK_WORKER_CORES=4
> export SPARK_WORKER_MEMORY=10g
> export SCALA_HOME=/opt/spark/scala-2.11.1
> export JAVA_HOME=/opt/spark/jdk1.7.0_55
> export SPARK_HOME=/opt/spark/spark-0.9.1-bin-hadoop2
> export ADD_JARS=/opt/IHC/lib/compression.jar
> export SPARK_CLASSPATH=/opt/IHC/lib/compression.jar
> export SPARK_LIBRARY_PATH=/opt/IHC/lib/native/Linux-amd64-64/
> export SPARK_MASTER_WEBUI_PORT=1080
> export HADOOP_CONF_DIR=/opt/IHC/hadoop-conf
>  
> Note: core-site.xml and hdfs-site.xml are in hadoop-conf. CMX is an IBM 
> branded splittable LZO based compression codec.
>  
> Any help to resolve the issue is appreciated.
>  
> Thanks,
> Nasir
> 
> DTCC DISCLAIMER: This email and any files transmitted with it are 
> confidential and intended solely for the use of the individual or entity to 
> whom they are addressed. If you have received this email in error, please 
> notify us immediately and delete the email and any attachments from your 
> system. The recipient should check this email and any attachments for the 
> presence of viruses.  The company accepts no liability for any damage caused 
> by any virus transmitted by this email.



Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Soumya Simanta
Awesome. 

Just want to catch up on some sessions from other tracks. 

Learned a ton over the last two days. 

Thanks 
Soumya 






> On Jul 1, 2014, at 8:50 PM, Matei Zaharia  wrote:
> 
> Yup, we’re going to try to get the videos up as soon as possible.
> 
> Matei
> 
>> On Jul 1, 2014, at 7:47 PM, Marco Shaw  wrote:
>> 
>> They are recorded...  For example, 2013: http://spark-summit.org/2013
>> 
>> I'm assuming the 2014 videos will be up in 1-2 weeks.
>> 
>> Marco
>> 
>> 
>>> On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta  
>>> wrote:
>>> Are these sessions recorded ? 
>>> 
>>> 
 On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos  wrote:
 General Session / Keynotes :  
 http://www.ustream.tv/channel/spark-summit-2014
 
 Track A : http://www.ustream.tv/channel/track-a1
 
 Track B: http://www.ustream.tv/channel/track-b1
 
 Track C: http://www.ustream.tv/channel/track-c1
 
 
> On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha  
> wrote:
> I attended yesterday on ustream.tv, but can't find the links to today's 
> streams anywhere. help!
> 
> -- 
> Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
> 


Re: Failed to launch Worker

2014-07-01 Thread MEETHU MATHEW
I am running the ./bin/spark-class from the workers.

I have added my slaves in conf/slaves file.Both ./sbin/start-all.sh and 
./sbin/start-slaves.sh are returning "Failed to launch Worker" exception with 
log in the first mail.
 
I am using standalone spark cluster with hadoop 1.2.1


Thanks & Regards, 
Meethu M


On Tuesday, 1 July 2014 11:52 PM, Aaron Davidson  wrote:
 


Where are you running the spark-class version? Hopefully also on the workers.

If you're trying to centrally start/stop all workers, you can add a "slaves" 
file to the spark conf/ directory which is just a list of your hosts, one per 
line. Then you can just use "./sbin/start-slaves.sh" to start the worker on all 
of your machines.

Note that this is already setup correctly if you're using the spark-ec2 scripts.



On Tue, Jul 1, 2014 at 5:53 AM, MEETHU MATHEW  wrote:

Yes.
> 
>Thanks & Regards, 
>Meethu M
>
>
>
>On Tuesday, 1 July 2014 6:14 PM, Akhil Das  wrote:
> 
>
>
>Is this command working??
>
>
>java -cp 
>::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
>org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077
>
>
>
>Thanks
>Best Regards
>
>
>On Tue, Jul 1, 2014 at 6:08 PM, MEETHU MATHEW  wrote:
>
>
>>
>> Hi ,
>>
>>
>>I am using Spark Standalone mode with one master and 2 slaves.I am not  able 
>>to start the workers and connect it to the master using 
>>
>>
>>./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077
>>
>>
>>The log says
>>
>>
>>Exception in thread "main" org.jboss.netty.channel.ChannelException: Failed 
>>to bind to: master/x.x.x.174:0
>>at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
>>...
>>Caused by: java.net.BindException: Cannot assign requested address
>>
>>
>>When I try to start the worker from the slaves using the following java 
>>command,its running without any exception
>>
>>
>>java -cp 
>>::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
>> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
>>org.apache.spark.deploy.worker.Worker spark://:master:7077
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>Thanks & Regards, 
>>Meethu M
>
>
>

Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0

2014-07-01 Thread Aaron Davidson
In your spark-env.sh, do you happen to set SPARK_PUBLIC_DNS or something of
that kin? This error suggests the worker is trying to bind a server on the
master's IP, which clearly doesn't make sense



On Mon, Jun 30, 2014 at 11:59 PM, MEETHU MATHEW 
wrote:

> Hi,
>
> I did netstat -na | grep 192.168.125.174 and its showing
> 192.168.125.174:7077 LISTEN(after starting master)
>
> I tried to execute the following script from the slaves manually but it
> ends up with the same exception and log.This script is internally executing
> the java command.
>  /usr/local/spark-1.0.0/sbin/start-slave.sh 1 spark://192.168.125.174:7077
> In this case netstat is showing any connection established to master:7077.
>
> When we manually execute the java command,the connection is getting
> established to master.
>
> Thanks & Regards,
> Meethu M
>
>
>   On Monday, 30 June 2014 6:38 PM, Akhil Das 
> wrote:
>
>
>  Are you sure you have this ip 192.168.125.174  
> bind
> for that machine? (netstat -na | grep 192.168.125.174
> )
>
> Thanks
> Best Regards
>
>
> On Mon, Jun 30, 2014 at 5:34 PM, MEETHU MATHEW 
> wrote:
>
> Hi all,
>
> I reinstalled spark,reboot the system,but still I am not able to start the
> workers.Its throwing the following exception:
>
> Exception in thread "main" org.jboss.netty.channel.ChannelException:
> Failed to bind to: master/192.168.125.174:0
>
> I doubt the problem is with 192.168.125.174:0. Eventhough the command
> contains master:7077,why its showing 0 in the log.
>
> java -cp
> ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
> org.apache.spark.deploy.worker.Worker spark://master:7077
>
> Can somebody tell me  a solution.
>
> Thanks & Regards,
> Meethu M
>
>
>   On Friday, 27 June 2014 4:28 PM, MEETHU MATHEW 
> wrote:
>
>
>  Hi,
> ya I tried setting another PORT also,but the same problem..
> master is set in etc/hosts
>
> Thanks & Regards,
> Meethu M
>
>
>   On Friday, 27 June 2014 3:23 PM, Akhil Das 
> wrote:
>
>
> tha's strange, did you try setting the master port to something else (use
> SPARK_MASTER_PORT).
>
> Also you said you are able to start it from the java commandline
>
> java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/
> assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m
> -Xmx512m org.apache.spark.deploy.worker.Worker spark://:*master*:7077
>
> What is the master ip specified here? is it like you have entry for
> *master* in the /etc/hosts?
>
> Thanks
> Best Regards
>
>
> On Fri, Jun 27, 2014 at 3:09 PM, MEETHU MATHEW 
> wrote:
>
> Hi Akhil,
>
> I am running it in a LAN itself..The IP of the master is given correctly.
>
> Thanks & Regards,
> Meethu M
>
>
>   On Friday, 27 June 2014 2:51 PM, Akhil Das 
> wrote:
>
>
>  why is it binding to port 0? 192.168.125.174:0 :/
>
> Check the ip address of that master machine (ifconfig) looks like the ip
> address has been changed (hoping you are running this machines on a LAN)
>
> Thanks
> Best Regards
>
>
> On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW 
> wrote:
>
> Hi all,
>
> My Spark(Standalone mode) was running fine till yesterday.But now I am
> getting  the following exeception when I am running start-slaves.sh or
> start-all.sh
>
> slave3: failed to launch org.apache.spark.deploy.worker.Worker:
> slave3:   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> slave3:   at java.lang.Thread.run(Thread.java:662)
>
> The log files has the following lines.
>
> 14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j
> profile: org/apache/spark/log4j-defaults.properties
> 14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser
> 14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(hduser)
> 14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started
> 14/06/27 11:06:30 INFO Remoting: Starting remoting
> Exception in thread "main" org.jboss.netty.channel.ChannelException:
> Failed to bind to: master/192.168.125.174:0
>  at
> org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
> ...
> Caused by: java.net.BindException: Cannot assign requested address
>  ...
> I saw the same error reported before and have tried the following
> solutions.
>
> Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a
> different number..But nothing is working.
>
> When I try to start the worker from the respective machines using the
> following java command,its running without any exception
>
> java -cp
> ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
> -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
> org.apache.spark.d

Re: multiple passes in mapPartitions

2014-07-01 Thread Frank Austin Nothaft
Hi Zhen,

The Scala iterator trait supports cloning via the duplicate method 
(http://www.scala-lang.org/api/current/index.html#scala.collection.Iterator@duplicate:(Iterator[A],Iterator[A])).

Regards,

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

On Jun 13, 2014, at 9:28 PM, zhen  wrote:

> Thank you for your suggestion. We will try it out and see how it performs. We
> think the single call to mapPartitions will be faster but we could be wrong.
> It would be nice to have a "clone method" on the iterator.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/multiple-passes-in-mapPartitions-tp7555p7616.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Mayur Rustagi
It could be cause you are out of memory on the worker nodes & blocks are
not getting registered..
A older issue with 0.6.0 was with dead nodes causing loss of task & then
resubmission of data in an infinite loop... It was fixed in 0.7.0 though.
Are you seeing a crash log in this log.. or in the worker log @ 192.168.222.164
or any of the machines where the crash log is displayed.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Jul 2, 2014 at 7:51 AM, Yana Kadiyska 
wrote:

> A lot of things can get funny when you run distributed as opposed to
> local -- e.g. some jar not making it over. Do you see anything of
> interest in the log on the executor machines -- I'm guessing
> 192.168.222.152/192.168.222.164. From here
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
> seems like the warning message is logged after the task fails -- but I
> wonder if you might see something more useful as to why it failed to
> begin with. As an example we've had cases in Hdfs where a small
> example would work, but on a larger example we'd hit a bad file. But
> the executor log is usually pretty explicit as to what happened...
>
> On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller 
> wrote:
> > I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3
> > worker). Our app is fetching data from Cassandra and doing a basic
> filter,
> > map, and countByKey on that data. I have run into a strange problem.
> Even if
> > the number of rows in Cassandra is just 1M, the Spark job goes seems to
> go
> > into an infinite loop and runs for hours. With a small amount of data
> (less
> > than 100 rows), the job does finish, but takes almost 30-40 seconds and
> we
> > frequently see the messages shown below. If we run the same application
> on a
> > single node Spark (--master local[4]), then we don’t see these warnings
> and
> > the task finishes in less than 6-7 seconds. Any idea what could be the
> cause
> > for these problems when we run our application on a standalone 4-node
> spark
> > cluster?
> >
> >
> >
> > 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)
> >
> > 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)
> >
> > 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)
> >
> > 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)
> >
> > 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(0, 192.168.222.142, 39342, 0)
> >
> > 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)
> >
> > 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(0, 192.168.222.142, 39342, 0)
> >
> > 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)
> >
> > 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)
> >
> > 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)
> >
> > 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)
> >
> > 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)
> >
> > 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)
> >
> > 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)
> >
> > 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)
> >
> > 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from
> > BlockManagerId(2, 192.168.222.164, 57185, 0)
> >
> > 14/

Re: Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition

2014-07-01 Thread Mayur Rustagi
Ideally you should be converting RDD to schemardd ?
You are creating UnionRDD to join across dstream rdd?


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Tue, Jul 1, 2014 at 3:11 PM, Honey Joshi  wrote:

> Hi,
> I am trying to run a project which takes data as a DStream and dumps the
> data in the Shark table after various operations. I am getting the
> following error :
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted:
> Task 0.0:0 failed 1 times (most recent failure: Exception failure:
> java.lang.ClassCastException: org.apache.spark.rdd.UnionPartition cannot
> be cast to org.apache.spark.rdd.HadoopPartition)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
> at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> at scala.Option.foreach(Option.scala:236)
> at
>
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> Can someone please explain the cause of this error, I am also using a
> Spark Context with the existing Streaming Context.
>


Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-01 Thread Wanda Hawk
I want to make some minor modifications in the SparkMeans.scala so running the 
basic example won't do. 
I have also packed my code under a "jar" file with sbt. It completes 
successfully but when I try to run it : "java -jar myjar.jar" I get the same 
error:
"Exception in thread "main" java.lang.NoClassDefFoundError: breeze/linalg/Vector
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
        at java.lang.Class.getMethod0(Class.java:2774)
        at java.lang.Class.getMethod(Class.java:1663)
        at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
"

If "scalac -d classes/ SparkKMeans.scala" can't see my classpath, why does it 
succeeds in compiling and does not give the same error ? 
The error itself "NoClassDefFoundError" means that the files are available at 
compile time, but for some reason I cannot figure out they are not available at 
run time. Does anyone know why ?

Thank you


On Tuesday, July 1, 2014 7:03 PM, Xiangrui Meng  wrote:
 


You can use either bin/run-example or bin/spark-summit to run example
code. "scalac -d classes/ SparkKMeans.scala" doesn't recognize Spark
classpath. There are examples in the official doc:
http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here
-Xiangrui


On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk  wrote:
> Hello,
>
> I have installed spark-1.0.0 with scala2.10.3. I have built spark with
> "sbt/sbt assembly" and added
> "/home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar"
> to my CLASSPATH variable.
> Then I went here
> "../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples" created a
> new directory "classes" and compiled SparkKMeans.scala with "scalac -d
> classes/ SparkKMeans.scala"
> Then I navigated to "classes" (I commented this line in the scala file :
> package org.apache.spark.examples ) and tried to run it with "java -cp .
> SparkKMeans" and I get the following error:
> "Exception in thread "main" java.lang.NoClassDefFoundError:
> breeze/linalg/Vector
>         at java.lang.Class.getDeclaredMethods0(Native Method)
>         at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
>         at java.lang.Class.getMethod0(Class.java:2774)
>         at java.lang.Class.getMethod(Class.java:1663)
>         at
> sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
>         at
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
> Caused by: java.lang.ClassNotFoundException: breeze.linalg.Vector
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>         ... 6 more
> "
> The jar under
> "/home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar"
> contains the breeze/linalg/Vector* path, I even tried to unpack it and put
> it in CLASSPATH to it does not seem to pick it up
>
>
> I am currently running java 1.8
> "java version "1.8.0_05"
> Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)"
>
> What I am doing wrong ?
>

Re: Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition

2014-07-01 Thread Honey Joshi
On Wed, July 2, 2014 1:11 am, Mayur Rustagi wrote:
> Ideally you should be converting RDD to schemardd ?
> You are creating UnionRDD to join across dstream rdd?
>
>
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
>
> On Tue, Jul 1, 2014 at 3:11 PM, Honey Joshi
> 
>> wrote:
>>
>
>> Hi,
>> I am trying to run a project which takes data as a DStream and dumps the
>>  data in the Shark table after various operations. I am getting the
>> following error :
>>
>> Exception in thread "main" org.apache.spark.SparkException: Job
>> aborted:
>> Task 0.0:0 failed 1 times (most recent failure: Exception failure:
>> java.lang.ClassCastException: org.apache.spark.rdd.UnionPartition cannot
>>  be cast to org.apache.spark.rdd.HadoopPartition) at
>>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$sched
>> uler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>> at
>>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$sched
>> uler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
>> at
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.sc
>> ala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:102
>> 6)
>> at
>>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(
>> DAGScheduler.scala:619)
>> at
>>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(
>> DAGScheduler.scala:619)
>> at scala.Option.foreach(Option.scala:236) at
>>
>> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala
>> :619)
>> at
>>
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonf
>> un$receive$1.applyOrElse(DAGScheduler.scala:207)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at
>> akka.actor.ActorCell.invoke(ActorCell.scala:456)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at
>> akka.dispatch.Mailbox.run(Mailbox.scala:219)
>> at
>>
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(Abstra
>> ctDispatcher.scala:386)
>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>>
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.j
>> ava:1339)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979
>> )
>> at
>>
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread
>> .java:107)
>>
>>
>> Can someone please explain the cause of this error, I am also using a
>> Spark Context with the existing Streaming Context.
>>
>>
>

I am using spark 0.9.0-Incubating, so it doesnt have anything to do with
schemaRDD.This error is probably coming when I am trying to use one spark
context and one shark context in the same job.Is there any way to
incorporate two context in one job?
Regards

Honey Joshi
Ideata-Analytics



Re: Help understanding spark.task.maxFailures

2014-07-01 Thread Mayur Rustagi
stragglers?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Tue, Jul 1, 2014 at 12:40 AM, Yana Kadiyska 
wrote:

> Hi community, this one should be an easy one:
>
> I have left spark.task.maxFailures to it's default (which should be
> 4). I see a job that shows the following statistics for Tasks:
> Succeeded/Total
>
> 7109/819 (1 failed)
>
> So there were 819 tasks to start with. I have 2 executors in that
> cluster. From Spark docs it says spark.task.maxFailures is the number
> of times to try a task before a job is given up. So I was imagining
> that 819*4 (i.e. 3276) would be the max number to ever see in the
> succeeded (accounting for retries on every possibly task). even that
> 3276*2 (6552, if it's per task per executor) does not account for 7109
> successfull tasks.
>
> Could anyone help explain why I'm seeing such high number of succeeded
> tasks?
>


Re: spark streaming counter metrics

2014-07-01 Thread Mayur Rustagi
You may be able to mix StreamingListener & SparkListener to get meaningful
information about your task. however you need to connect a lot of pieces to
make sense of the flow..

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Mon, Jun 30, 2014 at 9:58 PM, Chen Song  wrote:

> I am new to spark streaming and wondering if spark streaming tracks
> counters (e.g., how many rows in each consumer, how many rows routed to an
> individual reduce task, etc.) in any form so I can get an idea of how data
> is skewed? I checked spark job page but don't seem to find any.
>
>
>
> --
> Chen Song
>
>


Re: Help alleviating OOM errors

2014-07-01 Thread Mayur Rustagi
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Mon, Jun 30, 2014 at 8:09 PM, Yana Kadiyska 
wrote:

> Hi,
>
> our cluster seems to have a really hard time with OOM errors on the
> executor. Periodically we'd see a task that gets sent to a few
> executors, one would OOM, and then the job just stays active for hours
> (sometimes 30+ whereas normally it completes sub-minute).
>
> So I have a few questions:
>
> 1. Why am I seeing OOMs to begin with?
>
> I'm running with defaults for
> spark.storage.memoryFraction
> spark.shuffle.memoryFraction
>
> so my understanding is that if Spark exceeds 60% of available memory,
> data will be spilled to disk? Am I misunderstanding this? In the
> attached screenshot, I see a single stage with 2 tasks on the same
> executor -- no disk spills but OOM.
>
You need to configure the  spark.shuffle.spill true again in the config,
What is causing you to OOM, it could be that you are trying to just simply
sortbykey & keys are bigger memory of executor causing the OOM, can you put
the stack.

>
> 2. How can I reduce the likelyhood of seeing OOMs -- I am a bit
> concerned that I don't see a spill at all so not sure if decreasing
> spark.storage.memoryFraction is what needs to be done
>


>
> 3. Why does an OOM seem to break the executor so hopelessly? I am
> seeing times upwards of 30hrs once an OOM occurs. Why is that -- the
> task *should* take under a minute, so even if the whole RDD was
> recomputed from scratch, 30hrs is very mysterious to me. Hadoop can
> process this in about 10-15 minutes, so I imagine even if the whole
> job went to disk it should still not take more than an hour
>
When OOM occurs it could cause the RDD to spill to disk, the repeat task
may be forced to read data from disk & cause the overall slowdown, not to
mention the RDD may be send to different executor to be processed, are you
seeing the slow tasks as process_local  or node_local atleast?

>
> Any insight into this would be much appreciated.
> Running Spark 0.9.1
>


Re: Callbacks on freeing up of RDDs

2014-07-01 Thread Mayur Rustagi
A lot of RDD that you create in Code may not even be constructed as the
tasks layer is optimized in the DAG scheduler.. The closest is onUnpersistRDD
in SparkListner.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Mon, Jun 30, 2014 at 4:48 PM, Jaideep Dhok 
wrote:

> Hi all,
> I am trying to create a custom RDD class for result set of queries
> supported in InMobi Grill (http://inmobi.github.io/grill/)
>
> Each result set has a schema (similar to Hive's TableSchema) and a path in
> HDFS containing the result set data.
>
> An easy way of doing this would be to create a temp table in Hive, and use
> HCatInputFormat to create an RDD using the newAPIHadoopRDD call. I've
> already done this and it works.
>
> However, I also want to *delete* the temp table when the RDD is
> unpersisted, or when the SparkContext is gone. How could I do that in Spark?
>
> Does Spark allow users to register code to be executed when an RDD is
> freed? Something like the OutputCommitter in Hadoop?
>
> Thanks,
> Jaideep
>
> _
> The information contained in this communication is intended solely for the
> use of the individual or entity to whom it is addressed and others
> authorized to receive it. It may contain confidential or legally privileged
> information. If you are not the intended recipient you are hereby notified
> that any disclosure, copying, distribution or taking any action in reliance
> on the contents of this information is strictly prohibited and may be
> unlawful. If you have received this communication in error, please notify
> us immediately by responding to this email and then delete it from your
> system. The firm is neither liable for the proper and complete transmission
> of the information contained in this communication nor for any delay in its
> receipt.


Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-01 Thread Wanda Hawk
Got it ! Ran the jar with spark-submit. Thanks !


On Wednesday, July 2, 2014 9:16 AM, Wanda Hawk  wrote:
 


I want to make some minor modifications in the SparkMeans.scala so running the 
basic example won't do. 
I have also packed my code under a "jar" file with sbt. It completes 
successfully but when I try to run it : "java -jar myjar.jar" I get the same 
error:
"Exception in thread "main" java.lang.NoClassDefFoundError: breeze/linalg/Vector
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
        at java.lang.Class.getMethod0(Class.java:2774)
        at java.lang.Class.getMethod(Class.java:1663)
        at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
"

If "scalac -d classes/ SparkKMeans.scala" can't see my classpath, why does it 
succeeds in compiling and does not give the same error ? 
The error itself "NoClassDefFoundError" means that the files are available at 
compile time, but for some reason I cannot figure out they are not available at 
run time. Does anyone know why ?

Thank you


On Tuesday, July 1, 2014 7:03 PM, Xiangrui Meng  wrote:
 


You can use either bin/run-example or bin/spark-summit to run example
code. "scalac -d classes/ SparkKMeans.scala" doesn't recognize Spark
classpath. There
 are examples in the official doc:
http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here
-Xiangrui


On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk  wrote:
> Hello,
>
> I have installed spark-1.0.0 with scala2.10.3. I have built spark with
> "sbt/sbt assembly" and added
>
 
"/home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar"
> to my CLASSPATH variable.
> Then I went here
> "../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples" created a
> new directory "classes" and compiled SparkKMeans.scala with "scalac -d
> classes/ SparkKMeans.scala"
> Then I navigated to "classes" (I commented this line in the scala file :
> package org.apache.spark.examples ) and tried to run it with "java -cp .
> SparkKMeans" and I get the following error:
> "Exception in thread "main" java.lang.NoClassDefFoundError:
>
 breeze/linalg/Vector
>         at java.lang.Class.getDeclaredMethods0(Native Method)
>         at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
>         at java.lang.Class.getMethod0(Class.java:2774)
>         at java.lang.Class.getMethod(Class.java:1663)
>         at
> sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
>         at
> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
> Caused by: java.lang.ClassNotFoundException: breeze.linalg.Vector
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>         ... 6 more
> "
> The jar under
> "/home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar"
> contains the breeze/linalg/Vector* path, I even tried to unpack it and put
> it in CLASSPATH to it does not seem to pick it up
>
>
> I am currently running java 1.8
> "java version "1.8.0_05"
> Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)"
>
> What I am doing wrong ?
>