Re: s3 vfs on Mesos Slaves

2015-05-13 Thread jay vyas
Might I ask why vfs?  I'm new to vfs and not sure wether or not it predates
the hadoop file system interfaces (HCFS).

After all spark natively supports any HCFS by leveraging the hadoop
FileSystem api and class loaders and so on.

So simply putting those resources on your classpath should be sufficient to
directly connect to s3. By using the sc.hadoopFile (...) commands.
On May 13, 2015 12:16 PM, Akhil Das ak...@sigmoidanalytics.com wrote:

 Did you happened to have a look at this https://github.com/abashev/vfs-s3

 Thanks
 Best Regards

 On Tue, May 12, 2015 at 11:33 PM, Stephen Carman scar...@coldlight.com
 wrote:

  We have a small mesos cluster and these slaves need to have a vfs setup
 on
  them so that the slaves can pull down the data they need from S3 when
 spark
  runs.
 
  There doesn’t seem to be any obvious way online on how to do this or how
  easily accomplish this. Does anyone have some best practices or some
 ideas
  about how to accomplish this?
 
  An example stack trace when a job is ran on the mesos cluster…
 
  Any idea how to get this going? Like somehow bootstrapping spark on run
 or
  something?
 
  Thanks,
  Steve
 
 
  java.io.IOException: Unsupported scheme s3n for URI s3n://removed
  at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData$Iter.init(ClquetPartitionedData.java:330)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
  at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
  at org.apache.spark.scheduler.Task.run(Task.scala:64)
  at
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
  at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
  15/05/12 13:57:51 ERROR Executor: Exception in task 0.1 in stage 0.0 (TID
  1)
  java.lang.RuntimeException: java.io.IOException: Unsupported scheme s3n
  for URI s3n://removed
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:307)
  at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
  at org.apache.spark.scheduler.Task.run(Task.scala:64)
  at
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
  at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
  Caused by: java.io.IOException: Unsupported scheme s3n for URI
  s3n://removed
  at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData$Iter.init(ClquetPartitionedData.java:330)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
  ... 8 more
 
  This e-mail is intended solely for the above-mentioned recipient and it
  may contain confidential or privileged information. If you have received
 it
  in error, please notify us immediately and delete the e-mail. You must
 not
  copy, distribute, disclose or take any action in reliance on it. In
  addition, the contents of an attachment to this e-mail may contain
 software
  viruses which could damage your own computer system. While ColdLight
  Solutions, LLC has taken every reasonable precaution to minimize this
 risk,
  we cannot accept liability for any damage which you sustain as a result
 of
  software viruses. You should perform your own virus checks before opening
  the attachment.
 



Re: Keep or remove Debian packaging in Spark?

2015-02-10 Thread jay vyas
  
   and in recent conversations I didn't hear dissent to the idea of
   removing this.
  
   Is this still useful enough to fix up? All else equal I'd like to
   start to walk back some of the complexity of the build, but I
   don't know how all-else-equal it is. Certainly, it sounds like
   nobody intends these to be used to actually deploy Spark.
  
   I don't doubt it's useful to someone, but can they maintain the
   packaging logic elsewhere?
  
   --
   --- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For
   additional commands, e-mail: dev-h...@spark.apache.org
  
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For
  additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional
 commands, e-mail: dev-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-- 
jay vyas


spark akka fork : is the source anywhere?

2015-01-28 Thread jay vyas
Hi spark. Where is akka coming from in spark ?

I see the distribution referenced is a spark artifact... but not in the
apache namespace.

 akka.grouporg.spark-project.akka/akka.group
 akka.version2.3.4-spark/akka.version

Clearly this is a deliberate thought out change (See SPARK-1812), but its
not clear where 2.3.4 spark is coming from and who is maintaining its
release?

-- 
jay vyas

PS

I've had some conversations with will benton as well about this, and its
clear that some modifications to akka are needed, or else a protobug error
occurs, which amount to serialization incompatibilities, hence if one wants
to build spark from sources, the patched akka is required (or else, manual
patching needs to be done)...

15/01/28 22:58:10 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkWorker-akka.remote.default-remote-dispatcher-6] shutting down
ActorSystem [sparkWorker] java.lang.VerifyError: class
akka.remote.WireFormats$AkkaControlMessage overrides final method
getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;


Re: Standardized Spark dev environment

2015-01-20 Thread jay vyas
I can comment on both...  hi will and nate :)

1) Will's Dockerfile solution is  the most  simple direct solution to the
dev environment question : its a  efficient way to build and develop spark
environments for dev/test..  It would be cool to put that Dockerfile
(and/or maybe a shell script which uses it) in the top level of spark as
the build entry point.  For total platform portability, u could wrap in a
vagrantfile to launch a lightweight vm, so that windows worked equally
well.

2) However, since nate mentioned  vagrant and bigtop, i have to chime in :)
the vagrant recipes in bigtop are a nice reference deployment of how to
deploy spark in a heterogenous hadoop style environment, and tighter
integration testing w/ bigtop for spark releases would be lovely !  The
vagrant stuff use puppet to deploy an n node VM or docker based cluster, in
which users can easily select components (including
spark,yarn,hbase,hadoop,etc...) by simnply editing a YAML file :
https://github.com/apache/bigtop/blob/master/bigtop-deploy/vm/vagrant-puppet/vagrantconfig.yaml
As nate said, it would be alot of fun to get more cross collaboration
between the spark and bigtop communities.   Input on how we can better
integrate spark (wether its spork, hbase integration, smoke tests aroudn
the mllib stuff, or whatever, is always welcome )






On Tue, Jan 20, 2015 at 10:21 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 How many profiles (hadoop / hive /scala) would this development environment
 support ?

 As many as we want. We probably want to cover a good chunk of the build
 matrix https://issues.apache.org/jira/browse/SPARK-2004 that Spark
 officially supports.

 What does this provide, concretely?

 It provides a reliable way to create a “good” Spark development
 environment. Roughly speaking, this probably should mean an environment
 that matches Jenkins, since that’s where we run “official” testing and
 builds.

 For example, Spark has to run on Java 6 and Python 2.6. When devs build and
 run Spark locally, we can make sure they’re doing it on these versions of
 the languages with a simple vagrant up.

 Nate, could you comment on how something like this would relate to the
 Bigtop effort?

 http://chapeau.freevariable.com/2014/08/jvm-test-docker.html

 Will, that’s pretty sweet. I tried something similar a few months ago as an
 experiment to try building/testing Spark within a container. Here’s the
 shell script I used https://gist.github.com/nchammas/60b04141f3b9f053faaa
 
 against the base CentOS Docker image to setup an environment ready to build
 and test Spark.

 We want to run Spark unit tests within containers on Jenkins, so it might
 make sense to develop a single Docker image that can be used as both a “dev
 environment” as well as execution container on Jenkins.

 Perhaps that’s the approach to take instead of looking into Vagrant.

 Nick

 On Tue Jan 20 2015 at 8:22:41 PM Will Benton wi...@redhat.com wrote:

 Hey Nick,
 
  I did something similar with a Docker image last summer; I haven't
 updated
  the images to cache the dependencies for the current Spark master, but it
  would be trivial to do so:
 
  http://chapeau.freevariable.com/2014/08/jvm-test-docker.html
 
 
  best,
  wb
 
 
  - Original Message -
   From: Nicholas Chammas nicholas.cham...@gmail.com
   To: Spark dev list dev@spark.apache.org
   Sent: Tuesday, January 20, 2015 6:13:31 PM
   Subject: Standardized Spark dev environment
  
   What do y'all think of creating a standardized Spark development
   environment, perhaps encoded as a Vagrantfile, and publishing it under
   `dev/`?
  
   The goal would be to make it easier for new developers to get started
  with
   all the right configs and tools pre-installed.
  
   If we use something like Vagrant, we may even be able to make it so
 that
  a
   single Vagrantfile creates equivalent development environments across
 OS
  X,
   Linux, and Windows, without having to do much (or any) OS-specific
 work.
  
   I imagine for committers and regular contributors, this exercise may
 seem
   pointless, since y'all are probably already very comfortable with your
   workflow.
  
   I wonder, though, if any of you think this would be worthwhile as a
   improvement to the new Spark developer experience.
  
   Nick
  
 
 ​




-- 
jay vyas


Re: EndpointWriter : Dropping message failure ReliableDeliverySupervisor errors...

2014-12-20 Thread jay vyas
Hi folks.

In the end, I found that the problem was that I was using IP Addresses
instead of hostnames.

I guess, maybe,  reverse dns is a requirement for spark slave - master
communications...  ?



On Fri, Dec 19, 2014 at 7:21 PM, jay vyas jayunit100.apa...@gmail.com
wrote:

 Hi spark.   Im trying to understand the akka debug messages when
 networking doesnt work properly.  any hints would be great on this.

 SIMPLE TESTS I RAN

 - i tried a ping, works.
 - i tried a telnet to the 7077 port of master, from slave, also works.

 LOGS

 1) On the master I see this WARN log buried:

 ReliableDeliverySupervisor: Association with remote system
 [akka.tcp://sparkWorker@s2.docker:45477] has failed, address is now gated
 for [500] ms  Reason is: [Disassociated].

 2) I also see a periodic, repeated ERROR message :

  ERROR EndpointWriter: dropping message [class
 akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://
 sparkMaster@172.17.0.12:7077


 Any idea what these folks mean?   From what i can tel, i can telnet from
 s2.docker to my master server.

 Any thoughts for more debugging of this would be appreciated! im out of
 ideas for the time being 

 --
 jay vyas




-- 
jay vyas


EndpointWriter : Dropping message failure ReliableDeliverySupervisor errors...

2014-12-19 Thread jay vyas
Hi spark.   Im trying to understand the akka debug messages when networking
doesnt work properly.  any hints would be great on this.

SIMPLE TESTS I RAN

- i tried a ping, works.
- i tried a telnet to the 7077 port of master, from slave, also works.

LOGS

1) On the master I see this WARN log buried:

ReliableDeliverySupervisor: Association with remote system
[akka.tcp://sparkWorker@s2.docker:45477] has failed, address is now gated
for [500] ms  Reason is: [Disassociated].

2) I also see a periodic, repeated ERROR message :

 ERROR EndpointWriter: dropping message [class
akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://
sparkMaster@172.17.0.12:7077


Any idea what these folks mean?   From what i can tel, i can telnet from
s2.docker to my master server.

Any thoughts for more debugging of this would be appreciated! im out of
ideas for the time being 

-- 
jay vyas


Is there a way for scala compiler to catch unserializable app code?

2014-11-16 Thread jay vyas
This is more a curiosity than an immediate problem.

Here is my question: I ran into this easily solved issue
http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou
recently.  The solution was to replace my class with a scala singleton,
which i guess is readily serializable.

So its clear that spark needs to serialize objects which carry the driver
methods for an app, in order to run... but I'm wondering,,, maybe there is
a way to change or update the spark API to catch unserializable spark apps
at compile time?


-- 
jay vyas


Re: best IDE for scala + spark development?

2014-10-26 Thread Jay Vyas
I tried the scala eclipse ide but in scala 2.10 I ran into some weird issues 
http://stackoverflow.com/questions/24253084/scalaide-and-cryptic-classnotfound-errors
 ... So I switched to IntelliJ and was much more satisfied...

I've written a post on how I use fedora,sbt, and intellij for spark apps.
http://jayunit100.blogspot.com/2014/07/set-up-spark-application-devleopment.html?m=1

The IntelliJ sbt plugin is imo less buggy then the eclipse scalaIDE stuff.  For 
example, I found I had to set some special preferences

Finally... given sbts automated recompile option, if you just use tmux, and vim 
nerdtree, with sbt , you could come pretty close to something like an IDE 
without all the drama ..

 On Oct 26, 2014, at 11:07 AM, ll duy.huynh@gmail.com wrote:
 
 i'm new to both scala and spark.  what IDE / dev environment do you find most
 productive for writing code in scala with spark?  is it just vim + sbt?  or
 does a full IDE like intellij works out better?  thanks!
 
 
 
 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/best-IDE-for-scala-spark-development-tp8965.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



OutOfMemoryError when running sbt/sbt test

2014-08-26 Thread jay vyas
Hi spark.

I've been trying to build spark, but I've been getting lots of oome
exceptions.

https://gist.github.com/jayunit100/d424b6b825ce8517d68c

For the most part, they are of the form:

java.lang.OutOfMemoryError: unable to create new native thread

I've attempted to hard code the get_mem_opts function, which is in the
sbt-launch-lib.bash file, to
have various very high parameter sizes (i.e. -Xms5g) with high
MaxPermSize, etc... and to no avail.

Any thoughts on this would be appreciated.

I know of others having the same problem as well.

Thanks!

-- 
jay vyas


Re: OutOfMemoryError when running sbt/sbt test

2014-08-26 Thread Jay Vyas
Thanks...! Some questions below.

1) you are suggesting that maybe this OOME is a symptom/red herring , and the 
true cause of it is that a thread can't span because of ulimit... If so 
possibly this could be flagged early on in the build.  And -- where are so many 
threads coming from that I need to up my limit?   Is this a new feature added 
to spark recently, and if so will it effect deployments scenarios as well?

And 

2) possibly SBT_OPTS is where the memory settings should be ? If so, then why 
do we have the get_mem_opts wrapper function coded to send memory manually as 
Xmx/Xms options?
  execRunner $java_cmd \
${SBT_OPTS:-$default_sbt_opts} \
$(get_mem_opts $sbt_mem) \
${java_opts} \
${java_args[@]} \
-jar $sbt_jar \
${sbt_commands[@]} \
${residual_args[@]}



 On Aug 26, 2014, at 8:58 PM, Mubarak Seyed spark.devu...@gmail.com wrote:
 
 What is your ulimit value?
 
 
 On Tue, Aug 26, 2014 at 5:49 PM, jay vyas jayunit100.apa...@gmail.com 
 wrote:
 Hi spark.
 
 I've been trying to build spark, but I've been getting lots of oome
 exceptions.
 
 https://gist.github.com/jayunit100/d424b6b825ce8517d68c
 
 For the most part, they are of the form:
 
 java.lang.OutOfMemoryError: unable to create new native thread
 
 I've attempted to hard code the get_mem_opts function, which is in the
 sbt-launch-lib.bash file, to
 have various very high parameter sizes (i.e. -Xms5g) with high
 MaxPermSize, etc... and to no avail.
 
 Any thoughts on this would be appreciated.
 
 I know of others having the same problem as well.
 
 Thanks!
 
 --
 jay vyas