Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manas kar
You are quite right. I am getting this error while profiling my module to
see what is the minimum resources I can use to achieve my SLA.
My point is that if resource constraint creates this problem, then this
issue is just waiting to happen in a larger scenario(Though the probability
of happening will be less)

I hope to get some guidance as to what parameter I can use in order to
totally avoid this issue.
I am guessing spark.shuffle.io.preferDirectBufs = false but I am not sure.

..Manas

On Tue, Mar 15, 2016 at 2:30 PM, Iain Cundy <iain.cu...@amdocs.com> wrote:

> Hi Manas
>
>
>
> I saw a very similar problem while using mapWithState. Timeout on
> BlockManager remove leading to a stall.
>
>
>
> In my case it only occurred when there was a big backlog of micro-batches,
> combined with a shortage of memory. The adding and removing of blocks
> between new and old tasks was interleaved.  Don’t really know what caused
> it. Once I fixed the problems that were causing the backlog – in my case
> state compaction not working with Kryo in 1.6.0 (with Kryo workaround
> rather than patch) – I’ve never seen it again.
>
>
>
> So if you’ve got a backlog or other issue to fix maybe you’ll get lucky
> too J.
>
>
>
> Cheers
>
> Iain
>
>
>
> *From:* manas kar [mailto:poorinsp...@gmail.com]
> *Sent:* 15 March 2016 14:49
> *To:* Ted Yu
> *Cc:* user
> *Subject:* [MARKETING] Re: mapwithstate Hangs with Error cleaning
> broadcast
>
>
>
> I am using spark 1.6.
>
> I am not using any broadcast variable.
>
> This broadcast variable is probably used by the state management of
> mapwithState
>
>
>
> ...Manas
>
>
>
> On Tue, Mar 15, 2016 at 10:40 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
> Which version of Spark are you using ?
>
>
>
> Can you show the code snippet w.r.t. broadcast variable ?
>
>
>
> Thanks
>
>
>
> On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar <poorinsp...@gmail.com>
> wrote:
>
> Hi,
>  I have a streaming application that takes data from a kafka topic and uses
> mapwithstate.
>  After couple of hours of smooth running of the application I see a problem
> that seems to have stalled my application.
> The batch seems to have been stuck after the following error popped up.
> Has anyone seen this error or know what causes it?
> 14/03/2016 21:41:13,295 ERROR org.apache.spark.ContextCleaner: 95 - Error
> cleaning broadcast 7456
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at
>
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at
>
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at
>
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
> at
>
> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
> at
>
> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
> at
>
> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
> at
>
> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233)
> at
>
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189)
> at
>
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180)
> at scala.Option.foreach(Option.scala:236)
> at
>
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180)
> at
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
> at
> org.apache.spark.ContextCleaner.org
> $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)
> at
> org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [120 seconds]
> at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at
> scala.concurrent.impl.Promise$DefaultPromise.resu

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manas kar
I am using spark 1.6.
I am not using any broadcast variable.
This broadcast variable is probably used by the state management of
mapwithState

...Manas

On Tue, Mar 15, 2016 at 10:40 AM, Ted Yu  wrote:

> Which version of Spark are you using ?
>
> Can you show the code snippet w.r.t. broadcast variable ?
>
> Thanks
>
> On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar 
> wrote:
>
>> Hi,
>>  I have a streaming application that takes data from a kafka topic and
>> uses
>> mapwithstate.
>>  After couple of hours of smooth running of the application I see a
>> problem
>> that seems to have stalled my application.
>> The batch seems to have been stuck after the following error popped up.
>> Has anyone seen this error or know what causes it?
>> 14/03/2016 21:41:13,295 ERROR org.apache.spark.ContextCleaner: 95 - Error
>> cleaning broadcast 7456
>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
>> seconds]. This timeout is controlled by spark.rpc.askTimeout
>> at
>> org.apache.spark.rpc.RpcTimeout.org
>> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>> at
>>
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>> at
>>
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>> at
>>
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>> at
>> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
>> at
>>
>> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
>> at
>>
>> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
>> at
>>
>> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
>> at
>>
>> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
>> at
>>
>> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233)
>> at
>>
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189)
>> at
>>
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180)
>> at scala.Option.foreach(Option.scala:236)
>> at
>>
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180)
>> at
>> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
>> at
>> org.apache.spark.ContextCleaner.org
>> $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)
>> at
>> org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>> [120 seconds]
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>> at
>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>> at
>>
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>> at scala.concurrent.Await$.result(package.scala:107)
>> at
>> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>> ... 12 more
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/mapwithstate-Hangs-with-Error-cleaning-broadcast-tp26500.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


How to handle Option[Int] in dataframe

2015-11-02 Thread manas kar
Hi,
 I have a case class with many columns that are Option[Int] or
Option[Array[Byte]] and such.
 I would like to save it to parquet file and later read it back to my case
class too.
 I found that Option[Int] when null returns 0 when the field is Null.
 My question:
 Is there a way to get Option[Int] from a row instead of Int from a
dataframe?

...Manas

Some more description

/*My case class*/
case class Student(name: String, age: Option[Int])

val s = new Student("Manas",Some(35))
val s1 = new Student("Manas1",None)
val student =sc.makeRDD(List(s,s1)).toDF

/*Now writing the dataframe*/
student.write.parquet("/tmp/t1")

/*Lets read it back*/
val st1 = sqlContext.read.parquet("/tmp/t1")
st1.show

+--++
|  name| age|
+--++
| Manas|  35|
|Manas1|null|
+--++

But now I want to cast my dataframe to the dataframe[Student]. What is the
easiest way to do it?

..Manas


newAPIHadoopRDD file name

2015-04-18 Thread Manas Kar
I would like to get the file name along with the associated objects so that
I can do further mapping on it.

My code below gives me AvroKey[myObject], NullWritable but I don't know how
to get the file that gave those objects.

 sc.newAPIHadoopRDD(job.getConfiguration,
classOf[AvroKeyInputFormat[myObject]],
classOf[AvroKey[myObject]],
classOf[NullWritable])

Basically I would like to end up having a tuple of (FileName,
AvroKey[MyObject, NullWritable])

Any help is appreciated.

.Manas


Re: Spark unit test fails

2015-04-06 Thread Manas Kar
Trying to bump up the rank of the question.
Any example on Github can someone point to?

..Manas

On Fri, Apr 3, 2015 at 9:39 AM, manasdebashiskar manasdebashis...@gmail.com
 wrote:

 Hi experts,
  I am trying to write unit tests for my spark application which fails with
 javax.servlet.FilterRegistration error.

 I am using CDH5.3.2 Spark and below is my dependencies list.
 val spark   = 1.2.0-cdh5.3.2
 val esriGeometryAPI = 1.2
 val csvWriter   = 1.0.0
 val hadoopClient= 2.3.0
 val scalaTest   = 2.2.1
 val jodaTime= 1.6.0
 val scalajHTTP  = 1.0.1
 val avro= 1.7.7
 val scopt   = 3.2.0
 val config  = 1.2.1
 val jobserver   = 0.4.1
 val excludeJBossNetty = ExclusionRule(organization = org.jboss.netty)
 val excludeIONetty = ExclusionRule(organization = io.netty)
 val excludeEclipseJetty = ExclusionRule(organization =
 org.eclipse.jetty)
 val excludeMortbayJetty = ExclusionRule(organization =
 org.mortbay.jetty)
 val excludeAsm = ExclusionRule(organization = org.ow2.asm)
 val excludeOldAsm = ExclusionRule(organization = asm)
 val excludeCommonsLogging = ExclusionRule(organization =
 commons-logging)
 val excludeSLF4J = ExclusionRule(organization = org.slf4j)
 val excludeScalap = ExclusionRule(organization = org.scala-lang,
 artifact = scalap)
 val excludeHadoop = ExclusionRule(organization = org.apache.hadoop)
 val excludeCurator = ExclusionRule(organization = org.apache.curator)
 val excludePowermock = ExclusionRule(organization = org.powermock)
 val excludeFastutil = ExclusionRule(organization = it.unimi.dsi)
 val excludeJruby = ExclusionRule(organization = org.jruby)
 val excludeThrift = ExclusionRule(organization = org.apache.thrift)
 val excludeServletApi = ExclusionRule(organization = javax.servlet,
 artifact = servlet-api)
 val excludeJUnit = ExclusionRule(organization = junit)

 I found the link (
 http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-SecurityException-when-running-tests-with-Spark-1-0-0-td6747.html#a6749
 ) talking about the issue and the work around of the same.
 But that work around does not get rid of the problem for me.
 I am using an SBT build which can't be changed to maven.

 What am I missing?


 Stack trace
 -
 [info] FiltersRDDSpec:
 [info] - Spark Filter *** FAILED ***
 [info]   java.lang.SecurityException: class
 javax.servlet.FilterRegistration's signer information does not match
 signer information of other classes in the same package
 [info]   at java.lang.ClassLoader.checkCerts(Unknown Source)
 [info]   at java.lang.ClassLoader.preDefineClass(Unknown Source)
 [info]   at java.lang.ClassLoader.defineClass(Unknown Source)
 [info]   at java.security.SecureClassLoader.defineClass(Unknown Source)
 [info]   at java.net.URLClassLoader.defineClass(Unknown Source)
 [info]   at java.net.URLClassLoader.access$100(Unknown Source)
 [info]   at java.net.URLClassLoader$1.run(Unknown Source)
 [info]   at java.net.URLClassLoader$1.run(Unknown Source)
 [info]   at java.security.AccessController.doPrivileged(Native Method)
 [info]   at java.net.URLClassLoader.findClass(Unknown Source)

 Thanks
 Manas
  Manas Kar

 --
 View this message in context: Spark unit test fails
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-unit-test-fails-tp22368.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



Spark unit test fails

2015-04-03 Thread Manas Kar
Hi experts,
 I am trying to write unit tests for my spark application which fails with
javax.servlet.FilterRegistration error.

I am using CDH5.3.2 Spark and below is my dependencies list.
val spark   = 1.2.0-cdh5.3.2
val esriGeometryAPI = 1.2
val csvWriter   = 1.0.0
val hadoopClient= 2.3.0
val scalaTest   = 2.2.1
val jodaTime= 1.6.0
val scalajHTTP  = 1.0.1
val avro= 1.7.7
val scopt   = 3.2.0
val config  = 1.2.1
val jobserver   = 0.4.1
val excludeJBossNetty = ExclusionRule(organization = org.jboss.netty)
val excludeIONetty = ExclusionRule(organization = io.netty)
val excludeEclipseJetty = ExclusionRule(organization =
org.eclipse.jetty)
val excludeMortbayJetty = ExclusionRule(organization =
org.mortbay.jetty)
val excludeAsm = ExclusionRule(organization = org.ow2.asm)
val excludeOldAsm = ExclusionRule(organization = asm)
val excludeCommonsLogging = ExclusionRule(organization =
commons-logging)
val excludeSLF4J = ExclusionRule(organization = org.slf4j)
val excludeScalap = ExclusionRule(organization = org.scala-lang,
artifact = scalap)
val excludeHadoop = ExclusionRule(organization = org.apache.hadoop)
val excludeCurator = ExclusionRule(organization = org.apache.curator)
val excludePowermock = ExclusionRule(organization = org.powermock)
val excludeFastutil = ExclusionRule(organization = it.unimi.dsi)
val excludeJruby = ExclusionRule(organization = org.jruby)
val excludeThrift = ExclusionRule(organization = org.apache.thrift)
val excludeServletApi = ExclusionRule(organization = javax.servlet,
artifact = servlet-api)
val excludeJUnit = ExclusionRule(organization = junit)

I found the link (
http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-SecurityException-when-running-tests-with-Spark-1-0-0-td6747.html#a6749
) talking about the issue and the work around of the same.
But that work around does not get rid of the problem for me.
I am using an SBT build which can't be changed to maven.

What am I missing?


Stack trace
-
[info] FiltersRDDSpec:
[info] - Spark Filter *** FAILED ***
[info]   java.lang.SecurityException: class
javax.servlet.FilterRegistration's signer information does not match
signer information of other classes in the same package
[info]   at java.lang.ClassLoader.checkCerts(Unknown Source)
[info]   at java.lang.ClassLoader.preDefineClass(Unknown Source)
[info]   at java.lang.ClassLoader.defineClass(Unknown Source)
[info]   at java.security.SecureClassLoader.defineClass(Unknown Source)
[info]   at java.net.URLClassLoader.defineClass(Unknown Source)
[info]   at java.net.URLClassLoader.access$100(Unknown Source)
[info]   at java.net.URLClassLoader$1.run(Unknown Source)
[info]   at java.net.URLClassLoader$1.run(Unknown Source)
[info]   at java.security.AccessController.doPrivileged(Native Method)
[info]   at java.net.URLClassLoader.findClass(Unknown Source)

Thanks
Manas


Re: Cannot run spark-shell command not found.

2015-03-30 Thread Manas Kar
If you are only interested in getting a hands on with Spark and not with
building it with specific version of Hadoop use one of the bundle provider
like cloudera.

It will give you a very easy way to install and monitor your services.( I
find installing via cloudera manager
http://www.cloudera.com/content/cloudera/en/downloads.html) super easy.
Currently there are on Spark 1.2.

..Manas

On Mon, Mar 30, 2015 at 1:34 PM, vance46 wang2...@purdue.edu wrote:

 Hi all,

 I'm a newbee try to setup spark for my research project on a RedHat system.
 I've downloaded spark-1.3.0.tgz and untared it. and installed python, java
 and scala. I've set JAVA_HOME and SCALA_HOME and then try to use sudo
 sbt/sbt assembly according to
 https://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_CentOS6
 .
 It pop-up with sbt command not found. I then try directly start
 spark-shell in ./bin using sudo ./bin/spark-shell and still command
 not
 found. I appreciate your help in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-spark-shell-command-not-found-tp22299.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: PairRDD serialization exception

2015-03-11 Thread Manas Kar
Hi Sean,
Below is the sbt dependencies that I am using.

I gave another try by removing the provided keyword which failed with the
same error.
What confuses me is that the stack trace appears after few of the stages
have already run completely.

  object V {
val spark = 1.2.0-cdh5.3.0
val esriGeometryAPI = 1.2
val csvWriter = 1.0.0
val hadoopClient = 2.5.0
val scalaTest = 2.2.1
val jodaTime = 1.6.0
val scalajHTTP = 1.0.1
val avro   = 1.7.7
val scopt  = 3.2.0
val breeze = 0.8.1
val config = 1.2.1
  }
  object Libraries {
val EEAMessage  = com.waterloopublic %% eeaformat %
1.0-SNAPSHOT
val avro= org.apache.avro % avro-mapred %
V.avro classifier hadoop2
val spark  = org.apache.spark % spark-core_2.10 %
V.spark % provided
val hadoopClient= org.apache.hadoop % hadoop-client %
V.hadoopClient % provided
val esriGeometryAPI  = com.esri.geometry % esri-geometry-api %
V.esriGeometryAPI
val scalaTest = org.scalatest %% scalatest %
V.scalaTest % test
val csvWriter  = com.github.tototoshi %% scala-csv %
V.csvWriter
val jodaTime   = com.github.nscala-time %% nscala-time
% V.jodaTime % provided
val scalajHTTP= org.scalaj %% scalaj-http % V.scalajHTTP
val scopt= com.github.scopt %% scopt % V.scopt
val breeze  = org.scalanlp %% breeze % V.breeze
val breezeNatives   = org.scalanlp %% breeze-natives % V.breeze
val config  = com.typesafe % config % V.config
  }

There are only few more things to try(like reverting back to Spark 1.1)
before I run out of idea completely.
Please share your insights.

..Manas

On Wed, Mar 11, 2015 at 9:44 AM, Sean Owen so...@cloudera.com wrote:

 This usually means you are mixing different versions of code. Here it
 is complaining about a Spark class. Are you sure you built vs the
 exact same Spark binaries, and are not including them in your app?

 On Wed, Mar 11, 2015 at 1:40 PM, manasdebashiskar
 manasdebashis...@gmail.com wrote:
  (This is a repost. May be a simpler subject will fetch more attention
 among
  experts)
 
  Hi,
   I have a CDH5.3.2(Spark1.2) cluster.
   I am getting an local class incompatible exception for my spark
 application
  during an action.
  All my classes are case classes(To best of my knowledge)
 
  Appreciate any help.
 
  Exception in thread main org.apache.spark.SparkException: Job aborted
 due
  to stage failure: Task 0 in stage 3.0 failed 4 times, most recent
 failure:
  Lost task 0.3 in stage 3.0 (TID 346, datanode02):
  java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions;
 local
  class incompatible:stream classdesc serialVersionUID =
 8789839749593513237,
  local class serialVersionUID = -4145741279224749316
  at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
  at
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
  at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
  at
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
  at
 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
  at org.apache.spark.scheduler.Task.run(Task.scala:56)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
  at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
 
 
  Thanks
  Manas
  Manas Kar
 
  
  View this message in context: PairRDD serialization exception
  Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Joining data using Latitude, Longitude

2015-03-11 Thread Manas Kar
There are few techniques currently available.
Geomesa which uses GeoHash also can be proved useful.(
https://github.com/locationtech/geomesa)

Other potential candidate is
https://github.com/Esri/gis-tools-for-hadoop especially
https://github.com/Esri/geometry-api-java for inner customization.

If you want to ask questions like nearby me then these are the basic steps.
1) Index your geometry data which uses R-Tree.
2) Write your joiner logic that takes advantage of the index tree to get
you faster access.

Thanks
Manas


On Wed, Mar 11, 2015 at 5:55 AM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 Ted Dunning and Ellen Friedman's Time Series Databases has a section on
 this with some approaches to geo-encoding:

 https://www.mapr.com/time-series-databases-new-ways-store-and-access-data
 http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf

 On Tue, Mar 10, 2015 at 3:53 PM, John Meehan jnmee...@gmail.com wrote:

 There are some techniques you can use If you geohash
 http://en.wikipedia.org/wiki/Geohash the lat-lngs.  They will
 naturally be sorted by proximity (with some edge cases so watch out).  If
 you go the join route, either by trimming the lat-lngs or geohashing them,
 you’re essentially grouping nearby locations into buckets — but you have to
 consider the borders of the buckets since the nearest location may actually
 be in an adjacent bucket.  Here’s a paper that discusses an implementation:
 http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf

 On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Are you using SparkSQL for the join? In that case I'm not quiet sure you
 have a lot of options to join on the nearest co-ordinate. If you are using
 the normal Spark code (by creating key-pair on lat,lon) you can apply
 certain logic like trimming the lat,lon etc. If you want more specific
 computing then you are better off using haversine formula.
 http://www.movable-type.co.uk/scripts/latlong.html






java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local class incompatible: stream classdesc

2015-03-10 Thread Manas Kar
Hi,
 I have a CDH5.3.2(Spark1.2) cluster.
 I am getting an local class incompatible exception for my spark
application during an action.
All my classes are case classes(To best of my knowledge)

Appreciate any help.

Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 3.0 (TID 346, datanode02):
java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local
class incompatible: stream classdesc serialVersionUID =
8789839749593513237, local class serialVersionUID = -4145741279224749316
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Thanks
Manas


Re: RDDs

2015-03-03 Thread Manas Kar
The above is a great example using thread.
Does any one have an example using scala/Akka Future to do the same.
I am looking for an example like that which uses a Akka Future and does
something if the Future Timesout

On Tue, Mar 3, 2015 at 7:00 AM, Kartheek.R kartheek.m...@gmail.com wrote:

 Hi TD,
 You can always run two jobs on the same cached RDD, and they can run in
 parallel (assuming you launch the 2 jobs from two different threads)

 Is this a correct way to launch jobs from two different threads?

 val threadA = new Thread(new Runnable {
   def run() {
   for(i- 0 until end)
   {
 val numAs = logData.filter(line = line.contains(a))
 println(Lines with a: %s.format(numAs.count))
   }
  }
 })

val threadB = new Thread(new Runnable {
   def run() {
   for(i- 0 until end)
   {
 val numBs = logData.filter(line = line.contains(b))
 println(Lines with b: %s.format(numBs.count))
   }
   }
 })




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p21892.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: RDDs

2015-03-03 Thread Manas Kar
The above is a great example using thread.
Does any one have an example using scala/Akka Future to do the same.
I am looking for an example like that which uses a Akka Future and does
something if the Future Timesout

On Tue, Mar 3, 2015 at 9:16 AM, Manas Kar manasdebashis...@gmail.com
wrote:

 The above is a great example using thread.
 Does any one have an example using scala/Akka Future to do the same.
 I am looking for an example like that which uses a Akka Future and does
 something if the Future Timesout

 On Tue, Mar 3, 2015 at 7:00 AM, Kartheek.R kartheek.m...@gmail.com
 wrote:

 Hi TD,
 You can always run two jobs on the same cached RDD, and they can run in
 parallel (assuming you launch the 2 jobs from two different threads)

 Is this a correct way to launch jobs from two different threads?

 val threadA = new Thread(new Runnable {
   def run() {
   for(i- 0 until end)
   {
 val numAs = logData.filter(line = line.contains(a))
 println(Lines with a: %s.format(numAs.count))
   }
  }
 })

val threadB = new Thread(new Runnable {
   def run() {
   for(i- 0 until end)
   {
 val numBs = logData.filter(line = line.contains(b))
 println(Lines with b: %s.format(numBs.count))
   }
   }
 })




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p21892.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





How to debug a Hung task

2015-02-27 Thread Manas Kar
Hi,
 I have a spark application that hangs on doing just one task (Rest 200-300
task gets completed in reasonable time)
I can see in the Thread dump which function gets stuck how ever I don't
have a clue as to what value is causing that behaviour.
Also, logging the inputs before the function is executed does not help as
the actual message gets buried in logs.

How do one go about debugging such case?
Also, is there a way I can wrap my function inside some sort of timer based
environment and if it took too long I would throw a stack trace or some
sort.

Thanks
Manas


How to print more lines in spark-shell

2015-02-23 Thread Manas Kar
Hi experts,
 I am using Spark 1.2 from CDH5.3.
 When I issue commands like
 myRDD.take(10) the result gets truncated after 4-5 records.

Is there a way to configure the same to show more items?

..Manas


Master dies after program finishes normally

2015-02-12 Thread Manas Kar
Hi,
 I have a Hidden Markov Model running with 200MB data.
 Once the program finishes (i.e. all stages/jobs are done) the program
hangs for 20 minutes or so before killing master.

In the spark master the following log appears.

2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught fatal
error from thread [sparkMaster-akka.actor.default-dispatcher-31] shutting
down ActorSystem [sparkMaster]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.List$.newBuilder(List.scala:396)
at
scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:69)
at
scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:105)
at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:58)
at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:53)
at
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:26)
at
org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:22)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at
scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.json4s.MonadicJValue.org
$json4s$MonadicJValue$$findDirectByName(MonadicJValue.scala:22)
at org.json4s.MonadicJValue.$bslash(MonadicJValue.scala:16)
at
org.apache.spark.util.JsonProtocol$.taskStartFromJson(JsonProtocol.scala:450)
at
org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:423)
at
org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71)
at
org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69)
at
org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55)
at
org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:726)
at
org.apache.spark.deploy.master.Master.removeApplication(Master.scala:675)
at
org.apache.spark.deploy.master.Master.finishApplication(Master.scala:653)
at
org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$29.apply(Master.scala:399)

Can anyone help?

..Manas


Re: Master dies after program finishes normally

2015-02-12 Thread Manas Kar
I have 5 workers each executor-memory 8GB of memory. My driver memory is 8
GB as well. They are all 8 core machines.

To answer Imran's question my configurations are thus.
executor_total_max_heapsize = 18GB
This problem happens at the end of my program.

I don't have to run a lot of jobs to see this behaviour.
I can see my output correctly in HDFS and all.
I will give it one more try after increasing master's memory(which is
default 296MB to 512 MB)

..manas

On Thu, Feb 12, 2015 at 2:14 PM, Arush Kharbanda ar...@sigmoidanalytics.com
 wrote:

 How many nodes do you have in your cluster, how many cores, what is the
 size of the memory?

 On Fri, Feb 13, 2015 at 12:42 AM, Manas Kar manasdebashis...@gmail.com
 wrote:

 Hi Arush,
  Mine is a CDH5.3 with Spark 1.2.
 The only change to my spark programs are
 -Dspark.driver.maxResultSize=3g -Dspark.akka.frameSize=1000.

 ..Manas

 On Thu, Feb 12, 2015 at 2:05 PM, Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 What is your cluster configuration? Did you try looking at the Web UI?
 There are many tips here

 http://spark.apache.org/docs/1.2.0/tuning.html

 Did you try these?

 On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar manasdebashis...@gmail.com
 wrote:

 Hi,
  I have a Hidden Markov Model running with 200MB data.
  Once the program finishes (i.e. all stages/jobs are done) the program
 hangs for 20 minutes or so before killing master.

 In the spark master the following log appears.

 2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught
 fatal error from thread [sparkMaster-akka.actor.default-dispatcher-31]
 shutting down ActorSystem [sparkMaster]
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 at scala.collection.immutable.List$.newBuilder(List.scala:396)
 at
 scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:69)
 at
 scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:105)
 at
 scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:58)
 at
 scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:53)
 at
 scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
 at
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at
 org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:26)
 at
 org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:22)
 at
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
 at
 scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
 at org.json4s.MonadicJValue.org
 $json4s$MonadicJValue$$findDirectByName(MonadicJValue.scala:22)
 at org.json4s.MonadicJValue.$bslash(MonadicJValue.scala:16)
 at
 org.apache.spark.util.JsonProtocol$.taskStartFromJson(JsonProtocol.scala:450)
 at
 org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:423)
 at
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71)
 at
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69)
 at
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
 at
 org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55)
 at
 org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:726)
 at
 org.apache.spark.deploy.master.Master.removeApplication(Master.scala:675)
 at
 org.apache.spark.deploy.master.Master.finishApplication(Master.scala:653)
 at
 org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$29.apply(Master.scala:399)

 Can anyone help?

 ..Manas




 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Master dies after program finishes normally

2015-02-12 Thread Manas Kar
I have 5 workers each executor-memory 8GB of memory. My driver memory is 8
GB as well. They are all 8 core machines.

To answer Imran's question my configurations are thus.

executor_total_max_heapsize = 18GB

This problem happens at the end of my program.

I don't have to run a lot of jobs to see this behaviour.

I can see my output correctly in HDFS and all.

I will give it one more try after increasing master's memory(which is
default 296MB to 512 MB)


Spark 1.2 + Avro file does not work in HDP2.2

2014-12-12 Thread Manas Kar
Hi Experts,
 I have recently installed HDP2.2(Depends on hadoop 2.6).
 My spark 1.2 is built with hadoop 2.4 profile.

 My program has following dependencies
val avro= org.apache.avro % avro-mapred %1.7.7
val spark   = org.apache.spark % spark-core_2.10 % 1.2.0 %
provided

My program to read avro files fails with the following error. What am I
doing wrong?


java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at 
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Asymmetric spark cluster memory utilization

2014-10-25 Thread Manas Kar
Hi,
 I have a spark cluster that has 5 machines with 32 GB memory each and 2
machines with 24 GB each.

I believe the spark.executor.memory will assign the executor memory for all
executors.


How can I use 32 GB memory from the first 5 machines and 24 GB from the
next 2 machines.

Thanks
..Manas


How to create Track per vehicle using spark RDD

2014-10-14 Thread Manas Kar
Hi,
 I have an RDD containing Vehicle Number , timestamp, Position.
 I want to get the lag function equivalent to my RDD to be able to create
track segment of each Vehicle.

Any help?

PS: I have tried reduceByKey and then splitting the List of position in
tuples. For me it runs out of memory every time because of the volume of
data.

...Manas

*For some reason I have never got any reply to my emails to the user group.
I am hoping to break that trend this time. :)*


Null values in Date field only when RDD is saved as File.

2014-10-03 Thread Manas Kar
Hi,
 I am using a library that parses Ais Messages. My code which follows the
simple steps gives me null values in Date field.

1) Get the message from file.
2) parse the message.
3) map the message RDD to only keep the (Date, SomeInfo)
4) take top 100 element.
Result = the Date field appears fine on the screen

5) save the tuple RDD(created at step 4) to HDFS using SaveAsTextFile.
Result = When I check the saved File all my Date fields are Null.

Please guide me in the right direction.


Spark Streaming Example with CDH5

2014-06-17 Thread manas Kar
Hi Spark Gurus, 
 I am trying to compile a spark streaming example with CDH5 and having
problem compiling it. 
Has anyone created an example spark streaming using CDH5(preferably Spark
0.9.1) and would be kind enough to share the build.sbt(.scala) file?(or
point to their example on github). I know there is a streaming example  here
https://github.com/apache/spark/tree/master/examples   but I am looking
for something that runs with CDH5.


My build.scala files looks like given below.

 object Dependency {
// Versions
object V {
val Akka = 2.3.0
val scala = 2.10.4 
val cloudera = 0.9.0-cdh5.0.0
}

val sparkCore  = org.apache.spark %% spark-core% V.cloudera
val sparkStreaming = org.apache.spark %% spark-streaming % V.cloudera

resolvers ++= Seq( cloudera repo at
https://repository.cloudera.com/artifactory/cloudera-repos/;,
  haddop repo at
https://repository.cloudera.com/content/repositories/releases/;)

I have also attached the complete build.scala file for sake of completeness.
sbt dist gives the following error:
 object SecurityManager is not a member of package org.apache.spark
[error] import org.apache.spark.{SparkConf, SecurityManager}


build.scala
http://apache-spark-user-list.1001560.n3.nabble.com/file/n7796/build.scala  


Appreciate the great work the spark community is doing. It is by far the
best thing I have worked on.

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Example-with-CDH5-tp7796.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Shark on cloudera CDH5 error

2014-05-05 Thread manas Kar
No replies yet. Guess everyone who had this problem knew the obvious reason
why the error occurred. 
It took me some time to figure out the work around though. 

It seems shark depends on
/var/lib/spark/shark-0.9.1/lib_managed/jars/org.apache.hadoop/hadoop-core/hadoop-core.jar
for client server communication.

CDH5 should rely on hadoop-core-2.3.0-mr1-cdh5.0.0.jar. 

1) Grab it from other CDH modules(I chose hadoop) and get this jar from it's
library. 
2) Remove the jar in
/var/lib/spark/shark-0.9.1/lib_managed/jars/org.apache.hadoop/hadoop-core
3) place the jar from(step1) in hadoop-core folder of step2.

Hope this saves some time for some one who has the similar problem.

..Manas




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Shark-on-cloudera-CDH5-error-tp5226p5374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Can I share RDD between a pyspark and spark API

2014-05-05 Thread manas Kar
Hi experts.
 I have some pre-built python parsers that I am planning to use, just
because I don't want to write them again in scala. However after the data is
parsed I would like to take the RDD and use it in a scala program.(Yes, I
like scala more than python and more comfortable in scala :)

In doing so I don't want to push the parsed data to disk and then re-obtain
it via the scala class. Is there a way I can achieve what I want in an
efficient way?

..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-I-share-RDD-between-a-pyspark-and-spark-API-tp5415.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


ETL for postgres to hadoop

2014-04-08 Thread Manas Kar
Hi All,
I have some spatial data in postgres machine. I want to be able 
to move that data to Hadoop and do some geo-processing.
I tried using sqoop to move the data to Hadoop but it complained about the 
position data(which it says can't recognize)
Does anyone have any idea as to how to do it easily?

Thanks
Manas


www.exactearth.com[cid:ee_gradient_tm_150wide.png@f20f7501e5a14d6f85ec33629f725228]www.exactearth.com
   Manas Kar
Intermediate Software Developer, Product Development | exactEarth Ltd.

60 Struck Ct. Cambridge, Ontario N1R 8L2
office. +1.519.622.4445 ext. 5869 | direct: +1.519.620.5869
email. manas@exactearth.com

web. www.exactearth.com






This e-mail and any attachment is for authorized use by the intended 
recipient(s) only. It contains proprietary or confidential information and is 
not to be copied, disclosed to, retained or used by, any other party. If you 
are not an intended recipient then please promptly delete this e-mail, any 
attachment and all copies and inform the sender. Thank you.
inline: ee_gradient_tm_150wide.png