Re: Feasibility Project - Text Processing and Category Classification

2015-08-28 Thread Ritesh Kumar Singh
Load the textFile as an RDD. Something like this:
>
> > val file = sc.textFile("/path/to/file")


After this you can manipulate this RDD to filter texts the way you want
them :

> > val a1 = file.filter( line => line.contains("[ERROR]") )
> > val a2 = file.filter( line => line.contains("[WARN]") )
> > val a3 = file.filter( line => line.contains("[INFO]") )


You can view the lines using the println method like this:

> > a1.foreach(println)


You can also count the number of such lines using the count function like
this:

> > val b1 = file.filter( line => line.contains("[ERROR]") ).count()


Regards,


> *Ritesh Kumar Singh,**https://riteshtoday.wordpress.com/
> <https://riteshtoday.wordpress.com/>*


Re: Unsupported major.minor version 51.0

2015-08-11 Thread Ritesh Kumar Singh
Can you please mention the output for the following :

java -version

javac -version


Re: Local spark jars not being detected

2015-06-20 Thread Ritesh Kumar Singh
Yes, finally solved. It was there in front of my eyes all time.

Thanks a lot Pete.


Local spark jars not being detected

2015-06-20 Thread Ritesh Kumar Singh
Hi,

I'm using IntelliJ ide for my spark project.
I've compiled spark 1.3.0 for scala 2.11.4 and here's the one of the
compiled jar installed in my m2 folder :

~/.m2/repository/org/apache/spark/spark-core_2.11/1.3.0/spark-core_2.11-1.3.0.jar

But when I add this dependency in my pom file for the project :


org.apache.spark
spark-core_$(scala.version)
${spark.version}
provided


I'm getting Dependency "org.apache.spark:spark-core_$(scala.version):1.3.0"
not found.
Why is this happening and what's the workaround ?


akka configuration not found

2015-06-15 Thread Ritesh Kumar Singh
Hi,

Though my project has nothing to do with akka, I'm getting this error :

Exception in thread "main" com.typesafe.config.ConfigException$Missing: No
configuration setting found for key 'akka.version'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:145)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:151)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:206)
at akka.actor.ActorSystem$Settings.(ActorSystem.scala:168)
at akka.actor.ActorSystemImpl.(ActorSystem.scala:504)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
at
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
at
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1832)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1823)
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:267)
at org.apache.spark.SparkContext.(SparkContext.scala:270)
at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at Test.main(Test.java:9)

There is no reference to akka anywhere in the code / pom file.
Any fixes?

Thanks,
Ritesh


Error using spark 1.3.0 with maven

2015-06-15 Thread Ritesh Kumar Singh
Hi,

I'm getting this error while running spark as a java project using maven :

15/06/15 17:11:38 INFO SparkContext: Running Spark version 1.3.0
15/06/15 17:11:38 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/06/15 17:11:38 INFO SecurityManager: Changing view acls to: root
15/06/15 17:11:38 INFO SecurityManager: Changing modify acls to: root
15/06/15 17:11:38 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No
configuration setting found for key 'akka.version'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:145)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:151)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:206)
at akka.actor.ActorSystem$Settings.(ActorSystem.scala:168)
at akka.actor.ActorSystemImpl.(ActorSystem.scala:504)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
at
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
at
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1832)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1823)
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:267)
at org.apache.spark.SparkContext.(SparkContext.scala:270)
at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at Test.main(Test.java:9)


==

My Test.java file contains following :

import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

public class Test {
  public static void main(String[] args) {
String logFile = "/code/data.txt";
SparkConf conf = new
SparkConf().setMaster("local[4]").setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD logData = sc.textFile(logFile);

long numAs = logData.filter(new Function() {
  public Boolean call(String s) { return s.contains("a"); }
}).count();

long numBs = logData.filter(new Function() {
  public Boolean call(String s) { return s.contains("b"); }
}).count();

System.out.println("Lines with a: " + numAs + ", lines with b: " +
numBs);
  }
}


==

My pom file contains the following :


http://maven.apache.org/POM/4.0.0"; xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance";
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
  4.0.0

  dexample
  sparktest
  Testing spark with maven
  jar
  1.0-SNAPSHOT

  

  org.apache.spark
  spark-core_2.10
  1.3.0

  
  

  
org.apache.maven.plugins
maven-jar-plugin
2.6

  sparktest
  

  true
  Test
  dependency-jars/

  

  
  
org.apache.maven.plugins
maven-compiler-plugin
3.3

  1.7
  1.7

  
  
org.apache.maven.plugins
maven-assembly-plugin

  

  attached

package

  sparktest
  
jar-with-dependencies
  
  

  Test

  

  

  

  



Re: Can't build Spark 1.3

2015-06-02 Thread Ritesh Kumar Singh
It did hang for me too. High RAM consumption during build. Had to free a
lot of RAM and introduce swap memory just to get it build in my 3rd attempt.
Everything else looks fine. You can download the prebuilt versions from the
Spark homepage to save yourself from all this trouble.

Thanks,
Ritesh


Re: Recommended Scala version

2015-05-26 Thread Ritesh Kumar Singh
Yes, recommended version is 2.10 as all the features are not supported by
2.11 version. Kafka libraries and JDBC components are yet to be ported to
2.11 version. And so if your project doesn't depend on these components,
you can give v2.11 a try.

Here's a link

for
building with 2.11 version.

Though, you won't be running into any issues if you try v2.10 as of now.
But then again, the future releases will have to shift to 2.11 version once
support for v2.10 ends in the long run.


On Tue, May 26, 2015 at 8:21 PM, Punyashloka Biswal 
wrote:

> Dear Spark developers and users,
>
> Am I correct in believing that the recommended version of Scala to use
> with Spark is currently 2.10? Is there any plan to switch to 2.11 in
> future? Are there any advantages to using 2.11 today?
>
> Regards,
> Punya


Re: Official Docker container for Spark

2015-05-22 Thread Ritesh Kumar Singh
Use this:
sequenceiq/docker

Here's a link to their github repo:
docker-spark 


They have repos for other big data tools too which are agin really nice.
Its being maintained properly by their devs and


Re: Overlapping classes warnings

2015-04-09 Thread Ritesh Kumar Singh
I found this jira <https://jira.codehaus.org/browse/MSHADE-128> when
googling for fixes. Wonder if it can fix anything here.
But anyways, thanks for the help :)

On Fri, Apr 10, 2015 at 2:46 AM, Sean Owen  wrote:

> I agree, but as I say, most are out of the control of Spark. They
> aren't because of unnecessary dependencies.
>
> On Thu, Apr 9, 2015 at 5:14 PM, Ritesh Kumar Singh
>  wrote:
> > Though the warnings can be ignored, they add up in the log files while
> > compiling other projects too. And there are a lot of those warnings. Any
> > workaround? How do we modify the pom.xml file to exclude these
> unnecessary
> > dependencies?
> >
> > On Fri, Apr 10, 2015 at 2:29 AM, Sean Owen  wrote:
> >>
> >> Generally, you can ignore these things. They mean some artifacts
> >> packaged other artifacts, and so two copies show up when all the JAR
> >> contents are merged.
> >>
> >> But here you do show a small dependency convergence problem; beanutils
> >> 1.7 is present but beanutills-core 1.8 is too even though these should
> >> be harmonized. I imagine one could be excluded; I imagine we could
> >> harmonize the version manually. In practice, I also imagine it doesn't
> >> cause any problem but feel free to propose a fix along those lines.
> >>
> >> On Thu, Apr 9, 2015 at 4:54 PM, Ritesh Kumar Singh
> >>  wrote:
> >> > Hi,
> >> >
> >> > During compilation I get a lot of these:
> >> >
> >> > [WARNING] kryo-2.21.jar, reflectasm-1.07-shaded.jar define
> >> >  23 overlappping classes:
> >> >
> >> > [WARNING] commons-beanutils-1.7.0.jar,
> commons-beanutils-core-1.8.0.jar
> >> > define
> >> >  82 overlappping classes:
> >> >
> >> > [WARNING] commons-beanutils-1.7.0.jar, commons-collections-3.2.1.jar,
> >> > commons-beanutils-core-1.8.0.jar define 10 overlappping classes:
> >> >
> >> >
> >> > And a lot of others. How do I fix these?
> >
> >
>


Re: Overlapping classes warnings

2015-04-09 Thread Ritesh Kumar Singh
Though the warnings can be ignored, they add up in the log files while
compiling other projects too. And there are a lot of those warnings. Any
workaround? How do we modify the pom.xml file to exclude these unnecessary
dependencies?

On Fri, Apr 10, 2015 at 2:29 AM, Sean Owen  wrote:

> Generally, you can ignore these things. They mean some artifacts
> packaged other artifacts, and so two copies show up when all the JAR
> contents are merged.
>
> But here you do show a small dependency convergence problem; beanutils
> 1.7 is present but beanutills-core 1.8 is too even though these should
> be harmonized. I imagine one could be excluded; I imagine we could
> harmonize the version manually. In practice, I also imagine it doesn't
> cause any problem but feel free to propose a fix along those lines.
>
> On Thu, Apr 9, 2015 at 4:54 PM, Ritesh Kumar Singh
>  wrote:
> > Hi,
> >
> > During compilation I get a lot of these:
> >
> > [WARNING] kryo-2.21.jar, reflectasm-1.07-shaded.jar define
> >  23 overlappping classes:
> >
> > [WARNING] commons-beanutils-1.7.0.jar, commons-beanutils-core-1.8.0.jar
> > define
> >  82 overlappping classes:
> >
> > [WARNING] commons-beanutils-1.7.0.jar, commons-collections-3.2.1.jar,
> > commons-beanutils-core-1.8.0.jar define 10 overlappping classes:
> >
> >
> > And a lot of others. How do I fix these?
>


Overlapping classes warnings

2015-04-09 Thread Ritesh Kumar Singh
Hi,

During compilation I get a lot of these:

[WARNING] kryo-2.21.jar, reflectasm-1.07-shaded.jar define
 23 overlappping classes:

[WARNING] commons-beanutils-1.7.0.jar, commons-beanutils-core-1.8.0.jar define
 82 overlappping classes:

[WARNING] commons-beanutils-1.7.0.jar, commons-collections-3.2.1.jar,
commons-beanutils-core-1.8.0.jar define 10 overlappping classes:


And a lot of others. How do I fix these?


Migrating from Spark 0.8.0 to Spark 1.3.0

2015-04-03 Thread Ritesh Kumar Singh
Hi,

Are there any tutorials that explains all the changelogs between Spark
0.8.0 and Spark 1.3.0 and how can we approach this issue.


Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread Ritesh Kumar Singh
try using breeze (scala linear algebra library)

On Fri, Feb 27, 2015 at 5:56 PM, shahab  wrote:

> Thanks a lot Vijay, let me see how it performs.
>
> Best
> Shahab
>
>
> On Friday, February 27, 2015, Vijay Saraswat  wrote:
>
>> Available in GML --
>>
>> http://x10-lang.org/x10-community/applications/global-matrix-library.html
>>
>> We are exploring how to make it available within Spark. Any ideas would
>> be much appreciated.
>>
>> On 2/27/15 7:01 AM, shahab wrote:
>>
>>> Hi,
>>>
>>> I just wonder if there is any Sparse Matrix implementation available  in
>>> Spark, so it can be used in spark application?
>>>
>>> best,
>>> /Shahab
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: Mllib error

2014-12-10 Thread Ritesh Kumar Singh
How did you build your spark 1.1.1 ?

On Wed, Dec 10, 2014 at 10:41 AM, amin mohebbi 
wrote:

> I'm trying to build a very simple scala standalone app using the Mllib,
> but I get the following error when trying to bulid the program:
>
> Object mllib is not a member of package org.apache.spark
>
>
> please note I just migrated from 1.0.2 to 1.1.1
>
>
>
> Best Regards
>
> ...
>
> Amin Mohebbi
>
> PhD candidate in Software Engineering
>  at university of Malaysia
>
> Tel : +60 18 2040 017
>
>
>
> E-Mail : tp025...@ex.apiit.edu.my
>
>   amin_...@me.com
>


Re: Install Apache Spark on a Cluster

2014-12-08 Thread Ritesh Kumar Singh
On a rough note,

Step 1: Install Hadoop2.x in all the machines on cluster
Step 2: Check if Hadoop cluster is working
Step 3: Setup Apache Spark as given on the documentation page for the
cluster.
Check the status of cluster on the master UI

As it is some data mining project, configure Hive too.
You can use Spark SQL or AMPLAB Shark as a database store

On Mon, Dec 8, 2014 at 11:01 PM, riginos  wrote:

> My thesis is related to big data mining and I have a cluster in the
> laboratory of my university. My task is to install apache spark on it and
> use it for extraction purposes. Is there any understandable guidance on how
> to do this ?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Install-Apache-Spark-on-a-Cluster-tp20580.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How take top N of top M from RDD as RDD

2014-12-01 Thread Ritesh Kumar Singh
For converting an Array or any List to a RDD, we can try using :

>sc.parallelize(groupedScore)//or whatever the name of the list
variable is

On Mon, Dec 1, 2014 at 8:14 PM, Xuefeng Wu  wrote:

> Hi, I have a problem, it is easy in Scala code, but I can not take the top
> N from RDD as RDD.
>
>
> There are 1 Student Score, ask take top 10 age, and then take top 10
> from each age, the result is 100 records.
>
> The Scala code is here, but how can I do it in RDD,  *for RDD.take return
> is Array, but other RDD.*
>
> example Scala code:
>
> import scala.util.Random
>
> case class StudentScore(age: Int, num: Int, score: Int, name: Int)
>
> val scores = for {
>   i <- 1 to 1
> } yield {
>   StudentScore(Random.nextInt(100), Random.nextInt(100), Random.nextInt(), 
> Random.nextInt())
> }
>
>
> def takeTop(scores: Seq[StudentScore], byKey: StudentScore => Int): Seq[(Int, 
> Seq[StudentScore])] = {
>   val groupedScore = scores.groupBy(byKey)
>.map{case (_, _scores) => 
> (_scores.foldLeft(0)((acc, v) => acc + v.score), _scores)}.toSeq
>   groupedScore.sortBy(_._1).take(10)
> }
>
> val topScores = for {
>   (_, ageScores) <- takeTop(scores, _.age)
>   (_, numScores) <- takeTop(ageScores, _.num)
> } yield {
>   numScores
> }
>
> topScores.size
>
>
> --
>
> ~Yours, Xuefeng Wu/吴雪峰  敬上
>
>


Re: Setting network variables in spark-shell

2014-11-30 Thread Ritesh Kumar Singh
Spark configuration settings can be found here


Hope it helps :)

On Sun, Nov 30, 2014 at 9:55 PM, Brian Dolan 
wrote:

> Howdy Folks,
>
> What is the correct syntax in 1.0.0 to set networking variables in spark
> shell?  Specifically, I'd like to set the spark.akka.frameSize
>
> I'm attempting this:
>
> spark-shell -Dspark.akka.frameSize=1 --executor-memory 4g
>
>
> Only to get this within the session:
>
> System.getProperty("spark.executor.memory")
> res0: String = 4g
> System.getProperty("spark.akka.frameSize")
> res1: String = null
>
>
> I don't believe I am violating protocol, but I have also posted this to
> SO:
> http://stackoverflow.com/questions/27215288/how-do-i-set-spark-akka-framesize-in-spark-shell
>
> ~~
> May All Your Sequences Converge
>
>
>
>


Re: spark-shell giving me error of unread block data

2014-11-19 Thread Ritesh Kumar Singh
As Marcelo mentioned, the issue occurs mostly when incompatible classes are
used by executors or drivers.  Try out if the output is coming on
spark-shell. If yes, then most probably in your case, there might be some
issue with your configuration files. It will be helpful if you can paste
the contents of the config files you edited.

On Thu, Nov 20, 2014 at 5:45 AM, Anson Abraham 
wrote:

> Sorry meant cdh 5.2 w/ spark 1.1.
>
> On Wed, Nov 19, 2014, 17:41 Anson Abraham  wrote:
>
>> yeah CDH distribution (1.1).
>>
>> On Wed Nov 19 2014 at 5:29:39 PM Marcelo Vanzin 
>> wrote:
>>
>>> On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham 
>>> wrote:
>>> > yeah but in this case i'm not building any files.  just deployed out
>>> config
>>> > files in CDH5.2 and initiated a spark-shell to just read and output a
>>> file.
>>>
>>> In that case it is a little bit weird. Just to be sure, you are using
>>> CDH's version of Spark, not trying to run an Apache Spark release on
>>> top of CDH, right? (If that's the case, then we could probably move
>>> this conversation to cdh-us...@cloudera.org, since it would be
>>> CDH-specific.)
>>>
>>>
>>> > On Wed Nov 19 2014 at 4:52:51 PM Marcelo Vanzin 
>>> wrote:
>>> >>
>>> >> Hi Anson,
>>> >>
>>> >> We've seen this error when incompatible classes are used in the driver
>>> >> and executors (e.g., same class name, but the classes are different
>>> >> and thus the serialized data is different). This can happen for
>>> >> example if you're including some 3rd party libraries in your app's
>>> >> jar, or changing the driver/executor class paths to include these
>>> >> conflicting libraries.
>>> >>
>>> >> Can you clarify whether any of the above apply to your case?
>>> >>
>>> >> (For example, one easy way to trigger this is to add the
>>> >> spark-examples jar shipped with CDH5.2 in the classpath of your
>>> >> driver. That's one of the reasons I filed SPARK-4048, but I digress.)
>>> >>
>>> >>
>>> >> On Tue, Nov 18, 2014 at 1:59 PM, Anson Abraham <
>>> anson.abra...@gmail.com>
>>> >> wrote:
>>> >> > I'm essentially loading a file and saving output to another
>>> location:
>>> >> >
>>> >> > val source = sc.textFile("/tmp/testfile.txt")
>>> >> > source.saveAsTextFile("/tmp/testsparkoutput")
>>> >> >
>>> >> > when i do so, i'm hitting this error:
>>> >> > 14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
>>> >> > :15
>>> >> > org.apache.spark.SparkException: Job aborted due to stage failure:
>>> Task
>>> >> > 0 in
>>> >> > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
>>> stage
>>> >> > 0.0
>>> >> > (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateExceptio
>>> n:
>>> >> > unread
>>> >> > block data
>>> >> >
>>> >> >
>>> >> > java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(
>>> ObjectInputStream.java:2421)
>>> >> >
>>> >> > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
>>> >> >
>>> >> > java.io.ObjectInputStream.defaultReadFields(ObjectInputStrea
>>> m.java:1990)
>>> >> >
>>> >> > java.io.ObjectInputStream.readSerialData(ObjectInputStream.
>>> java:1915)
>>> >> >
>>> >> >
>>> >> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStre
>>> am.java:1798)
>>> >> >
>>> >> > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>> >> > java.io.ObjectInputStream.readObject(ObjectInputStream.java
>>> :370)
>>> >> >
>>> >> >
>>> >> > org.apache.spark.serializer.JavaDeserializationStream.readOb
>>> ject(JavaSerializer.scala:62)
>>> >> >
>>> >> >
>>> >> > org.apache.spark.serializer.JavaSerializerInstance.deseriali
>>> ze(JavaSerializer.scala:87)
>>> >> >
>>> >> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.
>>> scala:162)
>>> >> >
>>> >> >
>>> >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1145)
>>> >> >
>>> >> >
>>> >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:615)
>>> >> > java.lang.Thread.run(Thread.java:744)
>>> >> > Driver stacktrace:
>>> >> > at
>>> >> >
>>> >> > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$sch
>>> eduler$DAGScheduler$$failJobAndIndependentStages(DAGSchedule
>>> r.scala:1185)
>>> >> > at
>>> >> >
>>> >> > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$
>>> 1.apply(DAGScheduler.scala:1174)
>>> >> > at
>>> >> >
>>> >> > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$
>>> 1.apply(DAGScheduler.scala:1173)
>>> >> > at
>>> >> >
>>> >> > scala.collection.mutable.ResizableArray$class.foreach(Resiza
>>> bleArray.scala:59)
>>> >> > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.
>>> scala:47)
>>> >> > at
>>> >> >
>>> >> > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGSchedu
>>> ler.scala:1173)
>>> >> > at
>>> >> >
>>> >> > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskS
>>> etFailed$1.apply(DAGScheduler.scala:688)
>>> >> > at
>>> >> >
>>> >> > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskS
>>> etFaile

Re: spark-shell giving me error of unread block data

2014-11-18 Thread Ritesh Kumar Singh
It can be a serialization issue.
Happens when there are different versions installed on the same system.
What do you mean by the first time you installed and tested it out?

On Wed, Nov 19, 2014 at 3:29 AM, Anson Abraham 
wrote:

> I'm essentially loading a file and saving output to another location:
>
> val source = sc.textFile("/tmp/testfile.txt")
> source.saveAsTextFile("/tmp/testsparkoutput")
>
> when i do so, i'm hitting this error:
> 14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
> :15
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 0.0 (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateException:
> unread block data
>
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
>
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:744)
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
> Cant figure out what the issue is.  I'm running in CDH5.2 w/ version of
> spark being 1.1.  The file i'm loading is literally just 7 MB.  I thought
> it was jar files mismatch, but i did a compare and see they're all
> identical.  But seeing as how they were all installed through CDH parcels,
> not sure how there would be version mismatch on the nodes and master.  Oh
> yeah 1 master node w/ 2 worker nodes and running in standalone not through
> yarn.  So as a just in case, i copied the jars from the master to the 2
> worker nodes as just in case, and still same issue.
> Weird thing is, first time i installed and tested it out, it worked, but
> now it doesn't.
>
> Any help here would be greatly appreciated.
>


Re: Spark On Yarn Issue: Initial job has not accepted any resources

2014-11-18 Thread Ritesh Kumar Singh
Not sure how to solve this, but spotted these lines in the logs:

14/11/18 14:28:23 INFO YarnAllocationHandler: Container marked as
*failed*: container_1415961020140_0325_01_02

14/11/18 14:28:38 INFO YarnAllocationHandler: Container marked as
*failed*: container_1415961020140_0325_01_03

And the lines following it says its trying to allocate some space of
1408B but its failing to do so. You might want to look into that


On Tue, Nov 18, 2014 at 1:23 PM, LinCharlie  wrote:

> Hi All:
> I was submitting a spark_program.jar to `spark on yarn cluster` on a
> driver machine with yarn-client mode. Here is the spark-submit command I
> used:
>
> ./spark-submit --master yarn-client --class
> com.charlie.spark.grax.OldFollowersExample --queue dt_spark
> ~/script/spark-flume-test-0.1-SNAPSHOT-hadoop2.0.0-mr1-cdh4.2.1.jar
>
> The queue `dt_spark` was free, and the program was submitted succesfully
> and running on the cluster.  But on console, it showed repeatedly that:
>
> 14/11/18 15:11:48 WARN YarnClientClusterScheduler: Initial job has not
> accepted any resources; check your cluster UI to ensure that workers are
> registered and have sufficient memory
>
> Checked the cluster UI logs, I find no errors:
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/disk5/yarn/usercache/linqili/filecache/6957209742046754908/spark-assembly-1.0.2-hadoop2.0.0-cdh4.2.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/hadoop/hadoop-2.0.0-cdh4.2.1/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 14/11/18 14:28:16 INFO SecurityManager: Changing view acls to: hadoop,linqili
> 14/11/18 14:28:16 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(hadoop, linqili)
> 14/11/18 14:28:17 INFO Slf4jLogger: Slf4jLogger started
> 14/11/18 14:28:17 INFO Remoting: Starting remoting
> 14/11/18 14:28:17 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkyar...@longzhou-hdp3.lz.dscc:37187]
> 14/11/18 14:28:17 INFO Remoting: Remoting now listens on addresses: 
> [akka.tcp://sparkyar...@longzhou-hdp3.lz.dscc:37187]
> 14/11/18 14:28:17 INFO ExecutorLauncher: ApplicationAttemptId: 
> appattempt_1415961020140_0325_01
> 14/11/18 14:28:17 INFO ExecutorLauncher: Connecting to ResourceManager at 
> longzhou-hdpnn.lz.dscc/192.168.19.107:12032
> 14/11/18 14:28:17 INFO ExecutorLauncher: Registering the ApplicationMaster
> 14/11/18 14:28:18 INFO ExecutorLauncher: Waiting for spark driver to be 
> reachable.
> 14/11/18 14:28:18 INFO ExecutorLauncher: Master now available: 
> 192.168.59.90:36691
> 14/11/18 14:28:18 INFO ExecutorLauncher: Listen to driver: 
> akka.tcp://spark@192.168.59.90:36691/user/CoarseGrainedScheduler
> 14/11/18 
>  
> 14:28:18 INFO ExecutorLauncher: Allocating 1 executors.
> 14/11/18 14:28:18 INFO YarnAllocationHandler: Allocating 1 executor 
> containers with 1408 of memory each.
> 14/11/18 14:28:18 INFO YarnAllocationHandler: ResourceRequest (host : *, num 
> containers: 1, priority = 1 , capability : memory: 1408)
> 14/11/18 14:28:18 INFO YarnAllocationHandler: Allocating 1 executor 
> containers with 1408 of memory each.
> 14/11/18 14:28:18 INFO YarnAllocationHandler: ResourceRequest (host : *, num 
> containers: 1, priority = 1 , capability : memory: 1408)
> 14/11/18 14:28:18 INFO RackResolver: Resolved longzhou-hdp3.lz.dscc to /rack1
> 14/11/18 14:28:18 INFO YarnAllocationHandler: launching container on 
> container_1415961020140_0325_01_02 host longzhou-hdp3.lz.dscc
> 14/11/18 14:28:18 INFO ExecutorRunnable: Starting Executor Container
> 14/11/18 14:28:18 INFO ExecutorRunnable: Connecting to ContainerManager at 
> longzhou-hdp3.lz.dscc:12040
> 14/11/18 14:28:18 INFO ExecutorRunnable: Setting up ContainerLaunchContext
> 14/11/18 14:28:18 INFO ExecutorRunnable: Preparing Local resources
> 14/11/18 14:28:18 INFO ExecutorLauncher: All executors have launched.
> 14/11/18 14:28:18 INFO ExecutorLauncher: Started progress reporter thread - 
> sleep time : 5000
> 14/11/18 14:28:18 INFO YarnAllocationHandler: ResourceRequest (host : *, num 
> containers: 0, priority = 1 , capability : memory: 1408)
> 14/11/18 14:28:18 INFO ExecutorRunnable: Prepared Local resources 
> Map(__spark__.jar -> resource {, scheme: "hdfs", host: 
> "longzhou-hdpnn.lz.dscc", port: 11000, file: 
> "/user/linqili/.sparkStaging/application_1415961020140_0325/spark-assembly-1.0.2-hadoop2.0.0-cdh4.2.1.jar",
>  }, size: 134859131, timestamp: 1416292093988, type: FILE, visibility: 
> PRIVATE, )
> 14/11/18 14:28:18 INFO ExecutorRunnable: Setting up executor with commands: 
> List($JAVA_HOME/bin/java, -server, -XX:OnO

Re: Returning breeze.linalg.DenseMatrix from method

2014-11-17 Thread Ritesh Kumar Singh
Yeah, it works.

Although when I try to define a var of type DenseMatrix, like this:

var mat1: DenseMatrix[Double]

It gives an error saying we need to initialise the matrix mat1 at the time
of declaration.
Had to initialise it as :
var mat1: DenseMatrix[Double] = DenseMatrix.zeros[Double](1,1)

Anyways, it works now
Thanks for helping :)

On Mon, Nov 17, 2014 at 4:56 PM, tribhuvan...@gmail.com <
tribhuvan...@gmail.com> wrote:

> This should fix it --
>
> def func(str: String): DenseMatrix*[Double]* = {
> ...
> ...
> }
>
> So, why is this required?
> Think of it like this -- If you hadn't explicitly mentioned Double, it
> might have been that the calling function expected a
> DenseMatrix[SomeOtherType], and performed a SomeOtherType-specific
> operation which may have not been supported by the returned
> DenseMatrix[Double]. (I'm also assuming that SomeOtherType has no subtype
> relations with Double).
>
> On 17 November 2014 00:14, Ritesh Kumar Singh <
> riteshoneinamill...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a method that returns DenseMatrix:
>> def func(str: String): DenseMatrix = {
>> ...
>> ...
>> }
>>
>> But I keep getting this error:
>> *class DenseMatrix takes type parameters*
>>
>> I tried this too:
>> def func(str: String): DenseMatrix(Int, Int, Array[Double]) = {
>> ...
>> ...
>> }
>> But this gives me this error:
>> *'=' expected but '(' found*
>>
>> Any possible fixes?
>>
>
>
>
> --
> *Tribhuvanesh Orekondy*
>


Re: RandomGenerator class not found exception

2014-11-17 Thread Ritesh Kumar Singh
It's still not working. Keep getting the same error.

I even deleted the commons-math3/* folder containing the jar. And then
under the directory "org/apache/commons/" made a folder called 'math3' and
put the commons-math3-3.3.jar in it.
Still its not working.

I also tried sc.addJar("/path/to/jar") within spark-shell and in my project
sourcefile
It still didn't import the jar at both locations.
More

Any fixes? Please help

On Mon, Nov 17, 2014 at 2:14 PM, Chitturi Padma <
learnings.chitt...@gmail.com> wrote:

> Include the commons-math3/3.3 in class path while submitting jar to spark
> cluster. Like..
> spark-submit --driver-class-path maths3.3jar --class MainClass --master
> spark cluster url appjar
>
> On Mon, Nov 17, 2014 at 1:55 PM, Ritesh Kumar Singh [via Apache Spark User
> List] <[hidden email] <http://user/SendEmail.jtp?type=node&node=19057&i=0>
> > wrote:
>
>> My sbt file for the project includes this:
>>
>> libraryDependencies ++= Seq(
>> "org.apache.spark"  %% "spark-core"  % "1.1.0",
>> "org.apache.spark"  %% "spark-mllib" % "1.1.0",
>> "org.apache.commons" % "commons-math3" % "3.3"
>> )
>> =
>>
>> Still I am getting this error:
>>
>> >java.lang.NoClassDefFoundError:
>> org/apache/commons/math3/random/RandomGenerator
>>
>> =
>>
>> The jar at location:
>> ~/.m2/repository/org/apache/commons/commons-math3/3.3 contains the random
>> generator class:
>>
>>  $ jar tvf commons-math3-3.3.jar | grep RandomGenerator
>> >org/apache/commons/math3/random/RandomGenerator.class
>> >org/apache/commons/math3/random/UniformRandomGenerator.class
>> >org/apache/commons/math3/random/SynchronizedRandomGenerator.class
>> >org/apache/commons/math3/random/AbstractRandomGenerator.class
>> >org/apache/commons/math3/random/RandomGeneratorFactory$1.class
>> >org/apache/commons/math3/random/RandomGeneratorFactory.class
>> >org/apache/commons/math3/random/StableRandomGenerator.class
>> >org/apache/commons/math3/random/NormalizedRandomGenerator.class
>> >org/apache/commons/math3/random/JDKRandomGenerator.class
>> >org/apache/commons/math3/random/GaussianRandomGenerator.class
>>
>>
>> Please help
>>
>>
>> --
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055.html
>>  To start a new topic under Apache Spark User List, email [hidden email]
>> <http://user/SendEmail.jtp?type=node&node=19057&i=1>
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
> --
> View this message in context: Re: RandomGenerator class not found
> exception
> <http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19057.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>


RandomGenerator class not found exception

2014-11-17 Thread Ritesh Kumar Singh
My sbt file for the project includes this:

libraryDependencies ++= Seq(
"org.apache.spark"  %% "spark-core"  % "1.1.0",
"org.apache.spark"  %% "spark-mllib" % "1.1.0",
"org.apache.commons" % "commons-math3" % "3.3"
)
=

Still I am getting this error:

>java.lang.NoClassDefFoundError:
org/apache/commons/math3/random/RandomGenerator

=

The jar at location: ~/.m2/repository/org/apache/commons/commons-math3/3.3
contains the random generator class:

 $ jar tvf commons-math3-3.3.jar | grep RandomGenerator
>org/apache/commons/math3/random/RandomGenerator.class
>org/apache/commons/math3/random/UniformRandomGenerator.class
>org/apache/commons/math3/random/SynchronizedRandomGenerator.class
>org/apache/commons/math3/random/AbstractRandomGenerator.class
>org/apache/commons/math3/random/RandomGeneratorFactory$1.class
>org/apache/commons/math3/random/RandomGeneratorFactory.class
>org/apache/commons/math3/random/StableRandomGenerator.class
>org/apache/commons/math3/random/NormalizedRandomGenerator.class
>org/apache/commons/math3/random/JDKRandomGenerator.class
>org/apache/commons/math3/random/GaussianRandomGenerator.class


Please help


Returning breeze.linalg.DenseMatrix from method

2014-11-16 Thread Ritesh Kumar Singh
Hi,

I have a method that returns DenseMatrix:
def func(str: String): DenseMatrix = {
...
...
}

But I keep getting this error:
*class DenseMatrix takes type parameters*

I tried this too:
def func(str: String): DenseMatrix(Int, Int, Array[Double]) = {
...
...
}
But this gives me this error:
*'=' expected but '(' found*

Any possible fixes?


Re: How to kill a Spark job running in cluster mode ?

2014-11-12 Thread Ritesh Kumar Singh
I remember there was some issue with the above command in previous
veresions of spark. Its nice that its working now :)

On Wed, Nov 12, 2014 at 5:50 PM, Tao Xiao  wrote:

> Thanks for your replies.
>
> Actually we can kill a driver by the command "bin/spark-class
> org.apache.spark.deploy.Client kill  " if you
> know the driver id.
>
> 2014-11-11 22:35 GMT+08:00 Ritesh Kumar Singh <
> riteshoneinamill...@gmail.com>:
>
>> There is a property :
>>spark.ui.killEnabled
>> which needs to be set true for killing applications directly from the
>> webUI.
>> Check the link:
>> Kill Enable spark job
>> <http://spark.apache.org/docs/latest/configuration.html#spark-ui>
>>
>> Thanks
>>
>> On Tue, Nov 11, 2014 at 7:42 PM, Sonal Goyal 
>> wrote:
>>
>>> The web interface has a kill link. You can try using that.
>>>
>>> Best Regards,
>>> Sonal
>>> Founder, Nube Technologies <http://www.nubetech.co>
>>>
>>> <http://in.linkedin.com/in/sonalgoyal>
>>>
>>>
>>>
>>> On Tue, Nov 11, 2014 at 7:28 PM, Tao Xiao 
>>> wrote:
>>>
>>>> I'm using Spark 1.0.0 and I'd like to kill a job running in cluster
>>>> mode, which means the driver is not running on local node.
>>>>
>>>> So how can I kill such a job? Is there a command like "hadoop job
>>>> -kill " which kills a running MapReduce job ?
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>


Re: Spark-submit and Windows / Linux mixed network

2014-11-11 Thread Ritesh Kumar Singh
Never tried this form but just guessing,

What's the output when you submit this jar: \\shares\publish\Spark\app1\
someJar.jar
using spark-submit.cmd


Re: How to kill a Spark job running in cluster mode ?

2014-11-11 Thread Ritesh Kumar Singh
There is a property :
   spark.ui.killEnabled
which needs to be set true for killing applications directly from the webUI.
Check the link:
Kill Enable spark job


Thanks

On Tue, Nov 11, 2014 at 7:42 PM, Sonal Goyal  wrote:

> The web interface has a kill link. You can try using that.
>
> Best Regards,
> Sonal
> Founder, Nube Technologies 
>
> 
>
>
>
> On Tue, Nov 11, 2014 at 7:28 PM, Tao Xiao 
> wrote:
>
>> I'm using Spark 1.0.0 and I'd like to kill a job running in cluster mode,
>> which means the driver is not running on local node.
>>
>> So how can I kill such a job? Is there a command like "hadoop job -kill
>> " which kills a running MapReduce job ?
>>
>> Thanks
>>
>
>


Re: save as file

2014-11-11 Thread Ritesh Kumar Singh
We have RDD.saveAsTextFile and RDD.saveAsObjectFile for saving the output
to any location specified. The params to be provided are:
>path of storage location
>no. of partitions

For giving an hdfs path we use the following format:
"/user///"

On Tue, Nov 11, 2014 at 6:28 PM, Naveen Kumar Pokala <
npok...@spcapitaliq.com> wrote:

> Hi,
>
>
>
> I am spark 1.1.0. I need a help regarding saving rdd in a JSON file?
>
>
>
> How to do that? And how to mentions hdfs path in the program.
>
>
>
>
>
> -Naveen
>
>
>
>
>


Fwd: disable log4j for spark-shell

2014-11-11 Thread Ritesh Kumar Singh
-- Forwarded message --
From: Ritesh Kumar Singh 
Date: Tue, Nov 11, 2014 at 2:18 PM
Subject: Re: disable log4j for spark-shell
To: lordjoe 
Cc: u...@spark.incubator.apache.org


go to your spark home and then into the conf/ directory and then edit the
log4j.properties file i.e. :

>gedit $SPARK_HOME/conf/log4j.properties

and set root logger to:
   log4j.rootCategory=WARN, console

U don't need to build spark for the changes to take place. Whenever you
open spark-shel, it by default looks into the conf directories and loads
all the properties.

Thanks

On Tue, Nov 11, 2014 at 6:34 AM, lordjoe  wrote:

> public static void main(String[] args) throws Exception {
>  System.out.println("Set Log to Warn");
> Logger rootLogger = Logger.getRootLogger();
> rootLogger.setLevel(Level.WARN);
> ...
>  works for me
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p18535.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: disable log4j for spark-shell

2014-11-11 Thread Ritesh Kumar Singh
go to your spark home and then into the conf/ directory and then edit the
log4j.properties file i.e. :

>gedit $SPARK_HOME/conf/log4j.properties

and set root logger to:
   log4j.rootCategory=WARN, console

U don't need to build spark for the changes to take place. Whenever you
open spark-shel, it by default looks into the conf directories and loads
all the properties.

Thanks

On Tue, Nov 11, 2014 at 6:34 AM, lordjoe  wrote:

> public static void main(String[] args) throws Exception {
>  System.out.println("Set Log to Warn");
> Logger rootLogger = Logger.getRootLogger();
> rootLogger.setLevel(Level.WARN);
> ...
>  works for me
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p18535.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Fwd: Executor Lost Failure

2014-11-11 Thread Ritesh Kumar Singh
Yes... found the output on web UI of the slave.

Thanks :)

On Tue, Nov 11, 2014 at 2:48 AM, Ankur Dave  wrote:

> At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh <
> riteshoneinamill...@gmail.com> wrote:
> > Tasks are now getting submitted, but many tasks don't happen.
> > Like, after opening the spark-shell, I load a text file from disk and try
> > printing its contentsas:
> >
> >>sc.textFile("/path/to/file").foreach(println)
> >
> > It does not give me any output.
>
> That's because foreach launches tasks on the slaves. When each task tries
> to print its lines, they go to the stdout file on the slave rather than to
> your console at the driver. You should see the file's contents in each of
> the slaves' stdout files in the web UI.
>
> This only happens when running on a cluster. In local mode, all the tasks
> are running locally and can output to the driver, so foreach(println) is
> more useful.
>
> Ankur
>


Fwd: Executor Lost Failure

2014-11-10 Thread Ritesh Kumar Singh
-- Forwarded message --
From: Ritesh Kumar Singh 
Date: Mon, Nov 10, 2014 at 10:52 PM
Subject: Re: Executor Lost Failure
To: Akhil Das 


Tasks are now getting submitted, but many tasks don't happen.
Like, after opening the spark-shell, I load a text file from disk and try
printing its contentsas:

>sc.textFile("/path/to/file").foreach(println)

It does not give me any output. While running this:

>sc.textFile("/path/to/file").count

gives me the right number of lines in the text file.
Not sure what the error is. But here is the output on the console for print
case:

14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(215230) called with
curMem=709528, maxMem=463837593
14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6 stored as values in
memory (estimated size 210.2 KB, free 441.5 MB)
14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(17239) called with
curMem=924758, maxMem=463837593
14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6_piece0 stored as
bytes in memory (estimated size 16.8 KB, free 441.5 MB)
14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory
on gonephishing.local:42648 (size: 16.8 KB, free: 442.3 MB)
14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block
broadcast_6_piece0
14/11/10 22:48:02 INFO FileInputFormat: Total input paths to process : 1
14/11/10 22:48:02 INFO SparkContext: Starting job: foreach at :13
14/11/10 22:48:02 INFO DAGScheduler: Got job 3 (foreach at :13)
with 2 output partitions (allowLocal=false)
14/11/10 22:48:02 INFO DAGScheduler: Final stage: Stage 3(foreach at
:13)
14/11/10 22:48:02 INFO DAGScheduler: Parents of final stage: List()
14/11/10 22:48:02 INFO DAGScheduler: Missing parents: List()
14/11/10 22:48:02 INFO DAGScheduler: Submitting Stage 3 (Desktop/mnd.txt
MappedRDD[7] at textFile at :13), which has no missing parents
14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(2504) called with
curMem=941997, maxMem=463837593
14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7 stored as values in
memory (estimated size 2.4 KB, free 441.4 MB)
14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(1602) called with
curMem=944501, maxMem=463837593
14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7_piece0 stored as
bytes in memory (estimated size 1602.0 B, free 441.4 MB)
14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory
on gonephishing.local:42648 (size: 1602.0 B, free: 442.3 MB)
14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block
broadcast_7_piece0
14/11/10 22:48:02 INFO DAGScheduler: Submitting 2 missing tasks from Stage
3 (Desktop/mnd.txt MappedRDD[7] at textFile at :13)
14/11/10 22:48:02 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
14/11/10 22:48:02 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
6, gonephishing.local, PROCESS_LOCAL, 1216 bytes)
14/11/10 22:48:02 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
7, gonephishing.local, PROCESS_LOCAL, 1216 bytes)
14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory
on gonephishing.local:48857 (size: 1602.0 B, free: 442.3 MB)
14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory
on gonephishing.local:48857 (size: 16.8 KB, free: 442.3 MB)
14/11/10 22:48:02 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
6) in 308 ms on gonephishing.local (1/2)
14/11/10 22:48:02 INFO DAGScheduler: Stage 3 (foreach at :13)
finished in 0.321 s
14/11/10 22:48:02 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
7) in 315 ms on gonephishing.local (2/2)
14/11/10 22:48:02 INFO SparkContext: Job finished: foreach at :13,
took 0.376602079 s
14/11/10 22:48:02 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks
have all completed, from pool

===



On Mon, Nov 10, 2014 at 8:01 PM, Akhil Das 
wrote:

> ​Try adding the following configurations also, might work.
>
>  spark.rdd.compress true
>
>   spark.storage.memoryFraction 1
>   spark.core.connection.ack.wait.timeout 600
>   spark.akka.frameSize 50
>
> Thanks
> Best Regards
>
> On Mon, Nov 10, 2014 at 6:51 PM, Ritesh Kumar Singh <
> riteshoneinamill...@gmail.com> wrote:
>
>> Hi,
>>
>> I am trying to submit my application using spark-submit, using following
>> spark-default.conf params:
>>
>> spark.master spark://:7077
>> spark.eventLog.enabled   true
>> spark.serializer
>> org.apache.spark.serializer.KryoSerializer
>> spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value
>> -Dnumbers="one two three"
>>
>> ===
>> But every time I am getting this error:
>>
>> 14/11/10 18:39:17 ERROR TaskSchedulerImpl: Lost executor 1 on aa.local:
>> remote Akka cli

Re: Executor Lost Failure

2014-11-10 Thread Ritesh Kumar Singh
On Mon, Nov 10, 2014 at 10:52 PM, Ritesh Kumar Singh <
riteshoneinamill...@gmail.com> wrote:

> Tasks are now getting submitted, but many tasks don't happen.
> Like, after opening the spark-shell, I load a text file from disk and try
> printing its contentsas:
>
> >sc.textFile("/path/to/file").foreach(println)
>
> It does not give me any output. While running this:
>
> >sc.textFile("/path/to/file").count
>
> gives me the right number of lines in the text file.
> Not sure what the error is. But here is the output on the console for
> print case:
>
> 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(215230) called with
> curMem=709528, maxMem=463837593
> 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6 stored as values in
> memory (estimated size 210.2 KB, free 441.5 MB)
> 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(17239) called with
> curMem=924758, maxMem=463837593
> 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6_piece0 stored as
> bytes in memory (estimated size 16.8 KB, free 441.5 MB)
> 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in
> memory on gonephishing.local:42648 (size: 16.8 KB, free: 442.3 MB)
> 14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block
> broadcast_6_piece0
> 14/11/10 22:48:02 INFO FileInputFormat: Total input paths to process : 1
> 14/11/10 22:48:02 INFO SparkContext: Starting job: foreach at :13
> 14/11/10 22:48:02 INFO DAGScheduler: Got job 3 (foreach at :13)
> with 2 output partitions (allowLocal=false)
> 14/11/10 22:48:02 INFO DAGScheduler: Final stage: Stage 3(foreach at
> :13)
> 14/11/10 22:48:02 INFO DAGScheduler: Parents of final stage: List()
> 14/11/10 22:48:02 INFO DAGScheduler: Missing parents: List()
> 14/11/10 22:48:02 INFO DAGScheduler: Submitting Stage 3 (Desktop/mnd.txt
> MappedRDD[7] at textFile at :13), which has no missing parents
> 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(2504) called with
> curMem=941997, maxMem=463837593
> 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7 stored as values in
> memory (estimated size 2.4 KB, free 441.4 MB)
> 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(1602) called with
> curMem=944501, maxMem=463837593
> 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7_piece0 stored as
> bytes in memory (estimated size 1602.0 B, free 441.4 MB)
> 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in
> memory on gonephishing.local:42648 (size: 1602.0 B, free: 442.3 MB)
> 14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block
> broadcast_7_piece0
> 14/11/10 22:48:02 INFO DAGScheduler: Submitting 2 missing tasks from Stage
> 3 (Desktop/mnd.txt MappedRDD[7] at textFile at :13)
> 14/11/10 22:48:02 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
> 14/11/10 22:48:02 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
> 6, gonephishing.local, PROCESS_LOCAL, 1216 bytes)
> 14/11/10 22:48:02 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
> 7, gonephishing.local, PROCESS_LOCAL, 1216 bytes)
> 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in
> memory on gonephishing.local:48857 (size: 1602.0 B, free: 442.3 MB)
> 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in
> memory on gonephishing.local:48857 (size: 16.8 KB, free: 442.3 MB)
> 14/11/10 22:48:02 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
> 6) in 308 ms on gonephishing.local (1/2)
> 14/11/10 22:48:02 INFO DAGScheduler: Stage 3 (foreach at :13)
> finished in 0.321 s
> 14/11/10 22:48:02 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
> 7) in 315 ms on gonephishing.local (2/2)
> 14/11/10 22:48:02 INFO SparkContext: Job finished: foreach at
> :13, took 0.376602079 s
> 14/11/10 22:48:02 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks
> have all completed, from pool
>
> ===
>
>
>
> On Mon, Nov 10, 2014 at 8:01 PM, Akhil Das 
> wrote:
>
>> ​Try adding the following configurations also, might work.
>>
>>  spark.rdd.compress true
>>
>>   spark.storage.memoryFraction 1
>>   spark.core.connection.ack.wait.timeout 600
>>   spark.akka.frameSize 50
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Nov 10, 2014 at 6:51 PM, Ritesh Kumar Singh <
>> riteshoneinamill...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to submit my application using spark-submit, using following
>>> spark-default.conf params:
>>>
>>> spark.master spark://:7077
>>> spark.eventLog.enabled   true
>>> spark.serializer
>>

Executor Lost Failure

2014-11-10 Thread Ritesh Kumar Singh
Hi,

I am trying to submit my application using spark-submit, using following
spark-default.conf params:

spark.master spark://:7077
spark.eventLog.enabled   true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value
-Dnumbers="one two three"

===
But every time I am getting this error:

14/11/10 18:39:17 ERROR TaskSchedulerImpl: Lost executor 1 on aa.local:
remote Akka client disassociated
14/11/10 18:39:17 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:17 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:20 ERROR TaskSchedulerImpl: Lost executor 2 on aa.local:
remote Akka client disassociated
14/11/10 18:39:20 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:20 WARN TaskSetManager: Lost task 1.1 in stage 0.0 (TID 3,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:26 ERROR TaskSchedulerImpl: Lost executor 4 on aa.local:
remote Akka client disassociated
14/11/10 18:39:26 WARN TaskSetManager: Lost task 0.2 in stage 0.0 (TID 5,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:26 WARN TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:29 ERROR TaskSchedulerImpl: Lost executor 5 on aa.local:
remote Akka client disassociated
14/11/10 18:39:29 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 7,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:29 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job
14/11/10 18:39:29 WARN TaskSetManager: Lost task 1.3 in stage 0.0 (TID 6,
aa.local): ExecutorLostFailure (executor lost)
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 0.0 (TID 7, gonephishing.local): ExecutorLostFailure
(executor lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

=
Any fixes?


Re: Removing INFO logs

2014-11-10 Thread Ritesh Kumar Singh
It works.

Thanks

On Mon, Nov 10, 2014 at 6:32 PM, YANG Fan  wrote:

> Hi,
>
> In conf/log4j.properties, change the following
>
> log4j.rootCategory=INFO, console
>
> to
>  log4j.rootCategory=WARN, console
>
> This works for me.
>
> Best,
> Fan
>
> On Mon, Nov 10, 2014 at 8:21 PM, Ritesh Kumar Singh <
> riteshoneinamill...@gmail.com> wrote:
>
>> How can I remove all the INFO logs that appear on the console when I
>> submit an application using spark-submit?
>>
>
>


Removing INFO logs

2014-11-10 Thread Ritesh Kumar Singh
How can I remove all the INFO logs that appear on the console when I submit
an application using spark-submit?