SparkSQL Dataframe : partitionColumn, lowerBound, upperBound, numPartitions in context of reading from MySQL

2016-03-30 Thread Soumya Simanta
I'm trying to understand what the following configurations mean and their
 implication on reading data from a MySQL table. I'm looking for options
that will impact my read throughput when reading data from a large table.


Thanks.





partitionColumn, lowerBound, upperBound, numPartitions These options must
all be specified if any of them is specified. They describe how to
partition the table when reading in parallel from multiple workers.
partitionColumn must be a numeric column from the table in question. Notice
thatlowerBound and upperBound are just used to decide the partition stride,
not for filtering the rows in table. So all rows in the table will be
partitioned and returned.


Re: Yarn client mode: Setting environment variables

2016-02-17 Thread Soumya Simanta
Can you give some examples of what variables  you are trying to set ?

On Thu, Feb 18, 2016 at 1:01 AM, Lin Zhao  wrote:

> I've been trying to set some environment variables for the spark executors
> but haven't had much like. I tried editting conf/spark-env.sh but it
> doesn't get through to the executors. I'm running 1.6.0 and yarn, any
> pointer is appreciated.
>
> Thanks,
> Lin
>


Re: Building Spark behind a proxy

2015-01-29 Thread Soumya Simanta
I can do a
wget http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom

and get the file successfully on a shell.



On Thu, Jan 29, 2015 at 11:51 AM, Boromir Widas vcsub...@gmail.com wrote:

 At least a part of it is due to connection refused, can you check if curl
 can reach the URL with proxies -
 [FATAL] Non-resolvable parent POM: Could not transfer artifact
 org.apache:apache:pom:14 from/to central (
 http://repo.maven.apache.org/maven2): Error transferring file: Connection
 refused from
 http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom

 On Thu, Jan 29, 2015 at 11:35 AM, Soumya Simanta soumya.sima...@gmail.com
  wrote:



 On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 Does  the error change on build with and without the built options?

 What do you mean by build options? I'm just doing ./sbt/sbt assembly from
 $SPARK_HOME


 Did you try using maven? and doing the proxy settings there.


  No I've not tried maven yet. However, I did set proxy settings inside my
 .m2/setting.xml, but it didn't make any difference.






Re: Building Spark behind a proxy

2015-01-29 Thread Soumya Simanta
On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda 
ar...@sigmoidanalytics.com wrote:

 Does  the error change on build with and without the built options?

What do you mean by build options? I'm just doing ./sbt/sbt assembly from
$SPARK_HOME


 Did you try using maven? and doing the proxy settings there.


 No I've not tried maven yet. However, I did set proxy settings inside my
.m2/setting.xml, but it didn't make any difference.


Building Spark behind a proxy

2015-01-29 Thread Soumya Simanta
I'm trying to build Spark (v1.1.1 and v1.2.0) behind a proxy using
./sbt/sbt assembly and I get the following error. I've set the http and
https proxy as well as the JAVA_OPTS. Any idea what am I missing ?


[warn] one warning found
org.apache.maven.model.building.ModelBuildingException: 1 problem was
encountered while building the effective model for
org.apache.spark:spark-parent:1.1.1
[FATAL] Non-resolvable parent POM: Could not transfer artifact
org.apache:apache:pom:14 from/to central (
http://repo.maven.apache.org/maven2): Error transferring file: Connection
refused from
http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom and
'parent.relativePath' points at wrong local POM @ line 21, column 11

at
org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195)
at
org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841)
at
org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664)
at
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310)
at
org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232)
at
com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61)
at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41)
at
com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128)
at
com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49)
at
com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28)
at SparkBuild$.projectDefinitions(SparkBuild.scala:165)
at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:458)
at sbt.Load$$anonfun$24.apply(Load.scala:415)
at sbt.Load$$anonfun$24.apply(Load.scala:415)
at scala.collection.immutable.Stream.flatMap(Stream.scala:442)
at sbt.Load$.loadUnit(Load.scala:415)
at sbt.Load$$anonfun$15$$anonfun$apply$11.apply(Load.scala:256)
at sbt.Load$$anonfun$15$$anonfun$apply$11.apply(Load.scala:256)
at
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:93)
at
sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:92)
at sbt.BuildLoader.apply(BuildLoader.scala:143)
at sbt.Load$.loadAll(Load.scala:312)
at sbt.Load$.loadURI(Load.scala:264)
at sbt.Load$.load(Load.scala:260)
at sbt.Load$.load(Load.scala:251)
at sbt.Load$.apply(Load.scala:134)
at sbt.Load$.defaultLoad(Load.scala:37)
at sbt.BuiltinCommands$.doLoadProject(Main.scala:473)
at
sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:467)
at
sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:467)
at
sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60)
at
sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:60)
at
sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62)
at
sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:62)
at sbt.Command$.process(Command.scala:95)
at
sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:100)
at
sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:100)
at sbt.State$$anon$1.process(State.scala:179)
at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:100)
at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:100)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.MainLoop$.next(MainLoop.scala:100)
at sbt.MainLoop$.run(MainLoop.scala:93)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:71)
at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:66)
at sbt.Using.apply(Using.scala:25)
at sbt.MainLoop$.runWithNewLog(MainLoop.scala:66)
at sbt.MainLoop$.runAndClearLast(MainLoop.scala:49)
at sbt.MainLoop$.runLoggedLoop(MainLoop.scala:33)
at sbt.MainLoop$.runLogged(MainLoop.scala:25)
at sbt.StandardMain$.runManaged(Main.scala:57)
at sbt.xMain.run(Main.scala:29)
at xsbt.boot.Launch$$anonfun$run$1.apply(Launch.scala:109)
at xsbt.boot.Launch$.withContextLoader(Launch.scala:129)
at xsbt.boot.Launch$.run(Launch.scala:109)
at xsbt.boot.Launch$$anonfun$apply$1.apply(Launch.scala:36)
at xsbt.boot.Launch$.launch(Launch.scala:117)
at xsbt.boot.Launch$.apply(Launch.scala:19)
at xsbt.boot.Boot$.runImpl(Boot.scala:44)
at xsbt.boot.Boot$.main(Boot.scala:20)
at xsbt.boot.Boot.main(Boot.scala)
[error] 

Spark UI and Spark Version on Google Compute Engine

2015-01-17 Thread Soumya Simanta
I'm deploying Spark using the Click to Deploy Hadoop - Install Apache
Spark on Google Compute Engine.

I can run Spark jobs on the REPL and read data from Google storage.
However, I'm not sure how to access the Spark UI in this deployment. Can
anyone help?

Also, it deploys Spark 1.1. It there an easy way to bump it to Spark 1.2 ?

Thanks
-Soumya


[image: Inline image 1]

[image: Inline image 2]


Re: Sharing sqlContext between Akka router and routee actors ...

2014-12-18 Thread Soumya Simanta
why do you need a router? I mean cannot you do with just one actor which
has the SQLContext inside it?

On Thu, Dec 18, 2014 at 9:45 PM, Manoj Samel manojsamelt...@gmail.com
wrote:

 Hi,

 Akka router creates a sqlContext and creates a bunch of routees actors
  with sqlContext as parameter. The actors then execute query on that
 sqlContext.

 Would this pattern be a issue ? Any other way sparkContext etc. should be
 shared cleanly in Akka routers/routees ?

 Thanks,



Trying to understand a basic difference between these two configurations

2014-12-05 Thread Soumya Simanta
I'm trying to understand the conceptual difference between these two
configurations in term of performance (using Spark standalone cluster)

Case 1:

1 Node
60 cores
240G of memory
50G of data on local file system

Case 2:

6 Nodes
10 cores per node
40G of memory per node
50G of data on HDFS
nodes are connected using a 10G network

I just wanted to validate my understanding.

1.  Reads in case 1 will be slower compared to case 2 because, in case 2
all 6 nodes can read the data in parallel from HDFS. However, if I change
the file system to HDFS in Case 1, my read speeds will be conceptually the
same as case 2. Correct ?

2. Once the data is loaded, case 1 will execute operations faster because
there is no network overhead and all shuffle operations are local.

3. Obviously, case 1 is bad from a fault tolerance point of view because we
have a single point of failure.

Thanks
-Soumya


Re: Spark or MR, Scala or Java?

2014-11-22 Thread Soumya Simanta
Thanks Sean.

adding user@spark.apache.org again.

On Sat, Nov 22, 2014 at 9:35 PM, Sean Owen so...@cloudera.com wrote:

 On Sun, Nov 23, 2014 at 2:20 AM, Soumya Simanta
 soumya.sima...@gmail.com wrote:
  Is the MapReduce API simpler or the implementation? Almost, every Spark
  presentation has a slide that shows 100+ lines of Hadoop MR code in Java
 and
  the same feature implemented in 3 lines of Scala code on Spark. So the
 Spark
  API is certainly simpler, at least based on what I know. What am I
 missing
  here?

 The implementation is simpler. The API is not. However I don't think
 anyone 'really' uses the M/R API directly now. They use Crunch or
 maybe Cascading. These are also much less than 100 lines for word
 count, on top of M/R.

  Can you please expand on what you mean by efficient ? Better
 performance
  and/or reliability,  fewer resources or something else?

 All of the above. Map/Reduce is simple and easy to understand, and
 Spark is actually hard to reason about, and heavy-weight. Of course,
 as soon as your work spans more than one MapReduce, this reasoning
 changes a lot. But MapReduce is better for truly map-only, or
 map-with-a-reduce-only, workloads. It is optimized for this case. The
 shuffle is still better.



Re: MongoDB Bulk Inserts

2014-11-21 Thread Soumya Simanta
bulkLoad has the connection to MongoDB ?

On Fri, Nov 21, 2014 at 4:34 PM, Benny Thompson ben.d.tho...@gmail.com
wrote:

 I tried using RDD#mapPartitions but my job completes prematurely and
 without error as if nothing gets done.  What I have is fairly simple

 sc
 .textFile(inputFile)
 .map(parser.parse)
 .mapPartitions(bulkLoad)

 But the Iterator[T] of mapPartitions is always empty, even though I know
 map is generating records.


 On Thu Nov 20 2014 at 9:25:54 PM Soumya Simanta soumya.sima...@gmail.com
 wrote:

 On Thu, Nov 20, 2014 at 10:18 PM, Benny Thompson ben.d.tho...@gmail.com
 wrote:

 I'm trying to use MongoDB as a destination for an ETL I'm writing in
 Spark.  It appears I'm gaining a lot of overhead in my system databases
 (and possibly in the primary documents themselves);  I can only assume it's
 because I'm left to using PairRDD.saveAsNewAPIHadoopFile.

 - Is there a way to batch some of the data together and use Casbah
 natively so I can use bulk inserts?


 Why cannot you write Mongo in a RDD#mapPartition ?



 - Is there maybe a less hacky way to load to MongoDB (instead of
 using saveAsNewAPIHadoopFile)?


 If the latency (time by which all data should be in Mongo) is not a
 concern you can try a separate process that uses Akka/Casbah to write from
 HDFS into Mongo.






Re: MongoDB Bulk Inserts

2014-11-20 Thread Soumya Simanta
On Thu, Nov 20, 2014 at 10:18 PM, Benny Thompson ben.d.tho...@gmail.com
wrote:

 I'm trying to use MongoDB as a destination for an ETL I'm writing in
 Spark.  It appears I'm gaining a lot of overhead in my system databases
 (and possibly in the primary documents themselves);  I can only assume it's
 because I'm left to using PairRDD.saveAsNewAPIHadoopFile.

 - Is there a way to batch some of the data together and use Casbah
 natively so I can use bulk inserts?


Why cannot you write Mongo in a RDD#mapPartition ?



 - Is there maybe a less hacky way to load to MongoDB (instead of
 using saveAsNewAPIHadoopFile)?


If the latency (time by which all data should be in Mongo) is not a concern
you can try a separate process that uses Akka/Casbah to write from HDFS
into Mongo.


Parsing a large XML file using Spark

2014-11-18 Thread Soumya Simanta
If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump
that all revision information also) that is stored in HDFS, is it possible
to parse it in parallel/faster using Spark? Or do we have to use something
like a PullParser or Iteratee?

My current solution is to read the single XML file in the first pass -
write it to HDFS and then read the small files in parallel on the Spark
workers.

Thanks
-Soumya


SparkSQL performance

2014-10-31 Thread Soumya Simanta
I was really surprised to see the results here, esp. SparkSQL not
completing
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style

I was under the impression that SparkSQL performs really well because it
can optimize the RDD operations and load only the columns that are
required. This essentially means in most cases SparkSQL should be as fast
as Spark is.

I would be very interested to hear what others in the group have to say
about this.

Thanks
-Soumya


Re: SparkSQL performance

2014-10-31 Thread Soumya Simanta
I agree. My personal experience with Spark core is that it performs really
well once you tune it properly.

As far I understand SparkSQL under the hood performs many of these
optimizations (order of Spark operations) and uses a more efficient storage
format. Is this assumption correct?

Has anyone done any comparison of SparkSQL with Impala ? The fact that many
of the queries don't even finish in the benchmark is quite surprising and
hard to believe.

A few months ago there were a few emails about Spark not being able to
handle large volumes (TBs) of data. That myth was busted recently when the
folks at Databricks published their sorting record results.


Thanks
-Soumya






On Fri, Oct 31, 2014 at 7:35 PM, Du Li l...@yahoo-inc.com wrote:

   We have seen all kinds of results published that often contradict each
 other. My take is that the authors often know more tricks about how to tune
 their own/familiar products than the others. So the product on focus is
 tuned for ideal performance while the competitors are not. The authors are
 not necessarily biased but as a consequence the results are.

  Ideally it’s critical for the user community to be informed of all the
 in-depth tuning tricks of all products. However, realistically, there is a
 big gap in terms of documentation. Hope the Spark folks will make a
 difference. :-)

  Du


   From: Soumya Simanta soumya.sima...@gmail.com
 Date: Friday, October 31, 2014 at 4:04 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: SparkSQL performance

   I was really surprised to see the results here, esp. SparkSQL not
 completing
 http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style

  I was under the impression that SparkSQL performs really well because it
 can optimize the RDD operations and load only the columns that are
 required. This essentially means in most cases SparkSQL should be as fast
 as Spark is.

  I would be very interested to hear what others in the group have to say
 about this.

  Thanks
 -Soumya





Re: sbt/sbt compile error [FATAL]

2014-10-29 Thread Soumya Simanta
Are you trying to compile the master branch ? Can you try branch-1.1 ?

On Wed, Oct 29, 2014 at 6:55 AM, HansPeterS hanspeter.sl...@gmail.com
wrote:

 Hi,

 I have cloned sparked as:
 git clone g...@github.com:apache/spark.git
 cd spark
 sbt/sbt compile

 Apparently http://repo.maven.apache.org/maven2 is no longer valid.
 See the error further below.
 Is this correct?
 And what should it be changed to?


 Everything seems to go smooth until :
 [info] downloading

 https://repo1.maven.org/maven2/org/ow2/asm/asm-tree/5.0.3/asm-tree-5.0.3.jar
 ...
 [info]  [SUCCESSFUL ] org.ow2.asm#asm-tree;5.0.3!asm-tree.jar (709ms)
 [info] Done updating.
 [info] Compiling 1 Scala source to
 /root/spark/project/spark-style/target/scala-2.10/classes...
 [info] Compiling 9 Scala sources to

 /root/.sbt/0.13/staging/ec3aa8f39111944cc5f2/sbt-pom-reader/target/scala-2.10/sbt-0.13/classes...
 [warn] there were 1 deprecation warning(s); re-run with -deprecation for
 details
 [warn] one warning found
 [info] Compiling 3 Scala sources to
 /root/spark/project/target/scala-2.10/sbt-0.13/classes...
 org.apache.maven.model.building.ModelBuildingException: 1 problem was
 encountered while building the effective model for
 org.apache.spark:spark-parent:1.2.0-SNAPSHOT
 [FATAL] Non-resolvable parent POM: Could not transfer artifact
 org.apache:apache:pom:14 from/to central (
 http://repo.maven.apache.org/maven2): Error transferring file:
 repo.maven.apache.org from
 http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom and
 'parent.relativePath' points at wrong local POM @ line 22, column 11


 Regards  Hans



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-compile-error-FATAL-tp17629.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: install sbt

2014-10-28 Thread Soumya Simanta
sbt is just a jar file. So you really don't need to install anything. Once
you run the jar file (sbt-launch.jar) it can download the required
dependencies.

I use an executable script called sbt that has the following contents.

  SBT_OPTS=-Xms1024M -Xmx2048M -Xss1M -XX:+CMSClassUnloadingEnabled
-XX:MaxPermSize=1024M

  echo $SBT_OPTS

  java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar $@

On Tue, Oct 28, 2014 at 12:13 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 If you're just calling sbt from within the spark/sbt folder, it should
 download and install automatically.

 Nick


 2014년 10월 28일 화요일, Ted Yuyuzhih...@gmail.com님이 작성한 메시지:

 Have you read this ?
 http://lancegatlin.org/tech/centos-6-install-sbt

 On Tue, Oct 28, 2014 at 7:54 AM, Pagliari, Roberto 
 rpagli...@appcomsci.com wrote:

 Is there a repo or some kind of instruction about how to install sbt for
 centos?



 Thanks,







Re: run multiple spark applications in parallel

2014-10-28 Thread Soumya Simanta
Try reducing the resources (cores and memory) of each application. 



 On Oct 28, 2014, at 7:05 PM, Josh J joshjd...@gmail.com wrote:
 
 Hi,
 
 How do I run multiple spark applications in parallel? I tried to run on yarn 
 cluster, though the second application submitted does not run.
 
 Thanks,
 Josh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: run multiple spark applications in parallel

2014-10-28 Thread Soumya Simanta
Maybe changing --master yarn-cluster to --master yarn-client help.


On Tue, Oct 28, 2014 at 7:25 PM, Josh J joshjd...@gmail.com wrote:

 Sorry, I should've included some stats with my email

 I execute each job in the following manner

 ./bin/spark-submit --class CLASSNAME --master yarn-cluster --driver-memory
 1g --executor-memory 1g --executor-cores 1 UBER.JAR
 ${ZK_PORT_2181_TCP_ADDR} my-consumer-group1 1


 The box has

 24 CPUs, Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz

 32 GB RAM


 Thanks,

 Josh

 On Tue, Oct 28, 2014 at 4:15 PM, Soumya Simanta soumya.sima...@gmail.com
 wrote:

 Try reducing the resources (cores and memory) of each application.



  On Oct 28, 2014, at 7:05 PM, Josh J joshjd...@gmail.com wrote:
 
  Hi,
 
  How do I run multiple spark applications in parallel? I tried to run on
 yarn cluster, though the second application submitted does not run.
 
  Thanks,
  Josh





Re: scalac crash when compiling DataTypeConversions.scala

2014-10-27 Thread Soumya Simanta
You need to change the Scala compiler from IntelliJ to “sbt incremental
compiler” (see the

screenshot below).



You can access this by going to “preferences” ­ “scala”.

NOTE: This is supported only for certain version of IntelliJ scala plugin.
See this link for details.

http://blog.jetbrains.com/scala/2014/01/30/try
­faster­scala­compiler­in­intellij­idea­13­0­2/

On Mon, Oct 27, 2014 at 9:04 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 guoxu1231, I struggled with the Idea problem for a full week. Same thing
 -- clean builds under MVN/Sbt, but no luck with IDEA. What worked for me
 was the solution posted higher up in this thread -- it's a SO post that
 basically says to delete all iml files anywhere under the project
 directory.

 Let me know if you can't see this mail and I'll locate the exact SO post

 On Mon, Oct 27, 2014 at 5:15 AM, guoxu1231 guoxu1...@gmail.com wrote:

 Hi Stephen,
 I tried it again,
 To avoid the profile impact,  I execute mvn -DskipTests clean package
 with
 Hadoop 1.0.4 by default and open the IDEA and import it as a maven
 project,
 and I didn't choose any profile in the import wizard.
 Then Make project or re-build project in IDEA,  unfortunately the
 DataTypeConversions.scala compile failed agian.


 Any updated guide for using With IntelliJ IDEA?  I'm following the
 Building
 Spark with Maven in website.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/scalac-crash-when-compiling-DataTypeConversions-scala-tp17083p17333.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Spark as Relational Database

2014-10-26 Thread Soumya Simanta
@Peter - as Rick said - Spark's main usage is data analysis and not
storage.

Spark allows you to plugin different storage layers based on your use cases
and quality attribute requirements. So in essence if your relational
database is meeting your storage requirements you should think about how to
use that with Spark. Because even if you decide not to use your
relational database you will have to select some storage layer, most likely
a distributed storage layer.

Another option to think about is - can you possible restructure your data
schema so that you don't have to do that a large number of joins?. If this
is an option then you can potentially think about using stores such as
Cassandra, HBase, HDFS etc.

Spark really excels at processing large volumes of data really fast (given
enough memory) on horizontally scalable commodity hardware. As Rick pointed
out - It will probably outperform a relational star schema if all of your
*working* data set can fit into RAM on your cluster. However, if you data
size is much larger than your cluster memory you don't have a choice but to
select a datastore.

HTH

-Soumya








On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson rick.richard...@gmail.com
 wrote:

 Spark's API definitely covers all of the things that a relational database
 can do. It will probably outperform a relational star schema if all of your
 *working* data set can fit into RAM on your cluster. It will still perform
 quite well if most of the data fits and some has to spill over to disk.

 What are your requirements exactly?
 What is massive amounts of data exactly?
 How big is your cluster?

 Note that Spark is not for data storage, only data analysis. It pulls data
 into working data sets called RDD's.

 As a migration path, you could probably pull the data out of a relational
 database to analyze. But in the long run, I would recommend using a more
 purpose built, huge storage database such as Cassandra. If your data is
 very static, you could also just store it in files.
  On Oct 26, 2014 9:19 AM, Peter Wolf opus...@gmail.com wrote:

 My understanding is the SparkSQL allows one to access Spark data as if it
 were stored in a relational database.  It compiles SQL queries into a
 series of calls to the Spark API.

 I need the performance of a SQL database, but I don't care about doing
 queries with SQL.

 I create the input to MLib by doing a massive JOIN query.  So, I am
 creating a single collection by combining many collections.  This sort of
 operation is very inefficient in Mongo, Cassandra or HDFS.

 I could store my data in a relational database, and copy the query
 results to Spark for processing.  However, I was hoping I could keep
 everything in Spark.

 On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta 
 soumya.sima...@gmail.com wrote:

 1. What data store do you want to store your data in ? HDFS, HBase,
 Cassandra, S3 or something else?
 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?

 One option is to process the data in Spark and then store it in the
 relational database of your choice.




 On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote:

 Hello all,

 We are considering Spark for our organization.  It is obviously a
 superb platform for processing massive amounts of data... how about
 retrieving it?

 We are currently storing our data in a relational database in a star
 schema.  Retrieving our data requires doing many complicated joins across
 many tables.

 Can we use Spark as a relational database?  Or, if not, can we put
 Spark on top of a relational database?

 Note that we don't care about SQL.  Accessing our data via standard
 queries is nice, but we are equally happy (or more happy) to write Scala
 code.

 What is important to us is doing relational queries on huge amounts of
 data.  Is Spark good at this?

 Thank you very much in advance
 Peter






Re: Spark as Relational Database

2014-10-25 Thread Soumya Simanta
1. What data store do you want to store your data in ? HDFS, HBase,
Cassandra, S3 or something else?
2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?

One option is to process the data in Spark and then store it in the
relational database of your choice.




On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote:

 Hello all,

 We are considering Spark for our organization.  It is obviously a superb
 platform for processing massive amounts of data... how about retrieving it?

 We are currently storing our data in a relational database in a star
 schema.  Retrieving our data requires doing many complicated joins across
 many tables.

 Can we use Spark as a relational database?  Or, if not, can we put Spark
 on top of a relational database?

 Note that we don't care about SQL.  Accessing our data via standard
 queries is nice, but we are equally happy (or more happy) to write Scala
 code.

 What is important to us is doing relational queries on huge amounts of
 data.  Is Spark good at this?

 Thank you very much in advance
 Peter



Does start-slave.sh use the values in conf/slaves to launch a worker in Spark standalone cluster mode

2014-10-20 Thread Soumya Simanta
I'm working a cluster where I need to start the workers separately and
connect them to a master.

I'm following the instructions here and using branch-1.1
http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually

and I can start the master using
./sbin/start-master.sh

When I try to start the slave/worker using
./sbin/start-slave.sh it does't work. The logs say that it needs the
master.
when I provide
./sbin/start-slave.sh spark://master-ip:7077 it still doesn't work.

I can start the worker using the following command (as described in the
documentation).

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

Was wondering why start-slave.sh is not working?

Thanks
-Soumya


Convert a org.apache.spark.sql.SchemaRDD[Row] to a RDD of Strings

2014-10-09 Thread Soumya Simanta
I've a SchemaRDD that I want to convert to a RDD that contains String. How
do I convert the Row inside the SchemaRDD to String?


Storing shuffle files on a Tachyon

2014-10-07 Thread Soumya Simanta
Is it possible to store spark shuffle files on Tachyon ?


Creating a feature vector from text before using with MLLib

2014-10-01 Thread Soumya Simanta
I'm trying to understand the intuition behind the features method that
Aaron used in one of his demos. I believe this feature will just work for
detecting the character set (i.e., language used).

Can someone help ?


def featurize(s: String): Vector = {
  val n = 1000
  val result = new Array[Double](n)
  val bigrams = s.sliding(2).toArray

  for (h - bigrams.map(_.hashCode % n)) {
result(h) += 1.0 / bigrams.length
  }

  Vectors.sparse(n, result.zipWithIndex.filter(_._1 != 0).map(_.swap))
}


Setting serializer to KryoSerializer from command line for spark-shell

2014-09-20 Thread Soumya Simanta
Hi,

I want to set the serializer for my spark-shell to Kyro.
spark.serializer to org.apache.spark.serializer.KryoSerializer

Can I do it without setting a new SparkConf?

Thanks
-Soumya


Re: Spark as a Library

2014-09-16 Thread Soumya Simanta
It depends on what you want to do with Spark. The following has worked for
me.
Let the container handle the HTTP request and then talk to Spark using
another HTTP/REST interface. You can use the Spark Job Server for this.

Embedding Spark inside the container is not a great long term solution IMO
because you may see issues when you want to connect with a Spark cluster.



On Tue, Sep 16, 2014 at 11:16 AM, Ruebenacker, Oliver A 
oliver.ruebenac...@altisource.com wrote:



  Hello,



   Suppose I want to use Spark from an application that I already submit to
 run in another container (e.g. Tomcat). Is this at all possible? Or do I
 have to split the app into two components, and submit one to Spark and one
 to the other container? In that case, what is the preferred way for the two
 components to communicate with each other? Thanks!



  Best, Oliver



 Oliver Ruebenacker | Solutions Architect



 Altisource™

 290 Congress St, 7th Floor | Boston, Massachusetts 02210

 P: (617) 728-5582 | ext: 275585

 oliver.ruebenac...@altisource.com | www.Altisource.com




 ***

 This email message and any attachments are intended solely for the use of
 the addressee. If you are not the intended recipient, you are prohibited
 from reading, disclosing, reproducing, distributing, disseminating or
 otherwise using this transmission. If you have received this message in
 error, please promptly notify the sender by reply email and immediately
 delete this message from your system.
 This message and any attachments may contain information that is
 confidential, privileged or exempt from disclosure. Delivery of this
 message to any person other than the intended recipient is not intended to
 waive any right or privilege. Message transmission is not guaranteed to be
 secure or free of software viruses.

 ***



Re: About SpakSQL OR MLlib

2014-09-15 Thread Soumya Simanta
case class Car(id:String,age:Int,tkm:Int,emissions:Int,date:Date, km:Int,
fuel:Int)

1. Create an PairedRDD of (age,Car) tuples (pairedRDD)
2. Create a new function fc

//returns the interval lower and upper bound

def fc(x:Int, interval:Int) : (Int,Int) = {

 val floor = x - (x%interval)

 val ceil = floor + interval

 (floor,ceil)

 }
3. do a groupBy on this RDD (step 1) by passing the function fc

val myrdd = pairedRDD.groupBy( x = fun(x.age, 5) )


On Mon, Sep 15, 2014 at 11:38 PM, boyingk...@163.com boyingk...@163.com
wrote:

  Hi:
 I have a dataset ,the struct [id,driverAge,TotalKiloMeter ,Emissions
 ,date,KiloMeter ,fuel], and the data like this:
  [1-980,34,221926,9,2005-2-8,123,14]
 [1-981,49,271321,15,2005-2-8,181,82]
 [1-982,36,189149,18,2005-2-8,162,51]
 [1-983,51,232753,5,2005-2-8,106,92]
 [1-984,56,45338,8,2005-2-8,156,98]
 [1-985,45,132060,4,2005-2-8,179,98]
 [1-986,40,15751,5,2005-2-8,149,77]
 [1-987,36,167930,17,2005-2-8,121,87]
 [1-988,53,44949,4,2005-2-8,195,72]
 [1-989,34,252867,5,2005-2-8,181,86]
 [1-990,53,152858,11,2005-2-8,130,43]
 [1-991,40,126831,11,2005-2-8,126,47]
 ……

 now ,my requirments is group by driverAge, five is a step,like 20~25 is a
 group,26~30 is a group?
 how should i do ? who can give some code?


 --
  boyingk...@163.com



Re: Spark and Scala

2014-09-12 Thread Soumya Simanta
An RDD is a fault-tolerant distributed structure. It is the primary
abstraction in Spark.

I would strongly suggest that you have a look at the following to get a
basic idea.

http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html
http://spark.apache.org/docs/latest/quick-start.html#basics
https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

On Sat, Sep 13, 2014 at 12:06 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Take for example this:
 I have declared one queue *val queue = Queue.empty[Int]*, which is a pure
 scala line in the program. I actually want the queue to be an RDD but there
 are no direct methods to create RDD which is a queue right? What say do you
 have on this?
 Does there exist something like: *Create and RDD which is a queue *?

 On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan 
 hshreedha...@cloudera.com wrote:

 No, Scala primitives remain primitives. Unless you create an RDD using
 one of the many methods - you would not be able to access any of the RDD
 methods. There is no automatic porting. Spark is an application as far as
 scala is concerned - there is no compilation (except of course, the scala,
 JIT compilation etc).

 On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by
 Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala,
 and that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run
 any Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw
 it out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You








Re: Running Spark shell on YARN

2014-08-16 Thread Soumya Simanta
)

at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)

at org.apache.spark.repl.Main$.main(Main.scala:31)

at org.apache.spark.repl.Main.main(Main.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.net.URISyntaxException: Expected scheme-specific part at
index 5: conf:

at java.net.URI$Parser.fail(URI.java:2829)

at java.net.URI$Parser.failExpecting(URI.java:2835)

at java.net.URI$Parser.parse(URI.java:3038)

at java.net.URI.init(URI.java:753)

at org.apache.hadoop.fs.Path.initialize(Path.java:203)

... 62 more


Spark context available as sc.





On Fri, Aug 15, 2014 at 3:49 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:

 After changing the allocation I'm getting the following in my logs. No
 idea what this means.

 14/08/15 15:44:33 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:34 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:35 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:36 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:37 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:38 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:39 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:40 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:41 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:42 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:43 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:44 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:45 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:46 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:47 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:48 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:49 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:50 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372

  yarnAppState: ACCEPTED


 14/08/15 15:44:51 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:

  appMasterRpcPort: -1

  appStartTime: 1408131861372
 yarnAppState: ACCEPTED


 On Fri, Aug 15, 2014 at 2:47 PM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 We generally recommend setting yarn.scheduler.maximum-allocation-mbto
 the maximum node capacity.

 -Sandy


 On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta 
 soumya.sima...@gmail.com wrote:

 I just checked the YARN config and looks like I need to change this
 value. Should be upgraded

Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
I've been using the standalone cluster all this time and it worked fine.
Recently I'm using another Spark cluster that is based on YARN and I've not
experience with YARN.

The YARN cluster has 10 nodes and a total memory of 480G.

I'm having trouble starting the spark-shell with enough memory.
I'm doing a very simple operation - reading a file 100GB from HDFS and
running a count on it. This fails due to out of memory on the executors.

Can someone point to the command line parameters that I should use for
spark-shell so that it?


Thanks
-Soumya


Re: Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
I just checked the YARN config and looks like I need to change this value.
Should be upgraded to 48G (the max memory allocated to YARN) per node ?

property
nameyarn.scheduler.maximum-allocation-mb/name
value6144/value
sourcejava.io.BufferedInputStream@2e7e1ee/source
/property


On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:

 Andrew,

 Thanks for your response.

 When I try to do the following.

  ./spark-shell --executor-memory 46g --master yarn

 I get the following error.

 Exception in thread main java.lang.Exception: When running with master
 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the
 environment.

 at
 org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166)

 at
 org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61)

 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)

  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 After this I set the following env variable.

 export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/

 The program launches but then halts with the following error.


 *14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104 MB),
 is above the max threshold (6144 MB) of this cluster.*

 I guess this is some YARN setting that is not set correctly.


 Thanks

 -Soumya


 On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or and...@databricks.com wrote:

 Hi Soumya,

 The driver's console output prints out how much memory is actually
 granted to each executor, so from there you can verify how much memory the
 executors are actually getting. You should use the '--executor-memory'
 argument in spark-shell. For instance, assuming each node has 48G of memory,

 bin/spark-shell --executor-memory 46g --master yarn

 We leave a small cushion for the OS so we don't take up all of the entire
 system's memory. This option also applies to the standalone mode you've
 been using, but if you have been using the ec2 scripts, we set
 spark.executor.memory in conf/spark-defaults.conf for you automatically
 so you don't have to specify it each time on the command line. Of course,
 you can also do the same in YARN.

 -Andrew



 2014-08-15 10:45 GMT-07:00 Soumya Simanta soumya.sima...@gmail.com:

 I've been using the standalone cluster all this time and it worked fine.
 Recently I'm using another Spark cluster that is based on YARN and I've
 not experience with YARN.

 The YARN cluster has 10 nodes and a total memory of 480G.

 I'm having trouble starting the spark-shell with enough memory.
 I'm doing a very simple operation - reading a file 100GB from HDFS and
 running a count on it. This fails due to out of memory on the executors.

 Can someone point to the command line parameters that I should use for
 spark-shell so that it?


 Thanks
 -Soumya






Re: Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
After changing the allocation I'm getting the following in my logs. No idea
what this means.

14/08/15 15:44:33 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:34 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:35 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:36 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:37 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:38 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:39 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:40 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:41 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:42 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:43 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:44 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:45 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:46 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:47 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:48 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:49 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:50 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372

 yarnAppState: ACCEPTED


14/08/15 15:44:51 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:

 appMasterRpcPort: -1

 appStartTime: 1408131861372
yarnAppState: ACCEPTED


On Fri, Aug 15, 2014 at 2:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 We generally recommend setting yarn.scheduler.maximum-allocation-mbto the
 maximum node capacity.

 -Sandy


 On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta soumya.sima...@gmail.com
  wrote:

 I just checked the YARN config and looks like I need to change this
 value. Should be upgraded to 48G (the max memory allocated to YARN) per
 node ?

 property
 nameyarn.scheduler.maximum-allocation-mb/name
 value6144/value
 sourcejava.io.BufferedInputStream@2e7e1ee/source
 /property


 On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com
  wrote:

 Andrew,

 Thanks for your response.

 When I try to do the following.

  ./spark-shell --executor-memory 46g --master yarn

 I get the following error.

 Exception in thread main java.lang.Exception: When running with master
 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the
 environment.

 at
 org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166)

 at
 org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61)

 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)

  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 After this I set the following env variable.

 export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/

 The program launches but then halts with the following error.


 *14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104
 MB), is above the max threshold (6144 MB) of this cluster.*

 I guess this is some YARN setting that is not set correctly.


 Thanks

 -Soumya


 On Fri, Aug 15, 2014 at 2:19 PM, Andrew

Script to deploy spark to Google compute engine

2014-08-13 Thread Soumya Simanta
Before I start doing something on my own I wanted to check if someone has
created a script to deploy the latest version of Spark to Google Compute
Engine.

Thanks
-Soumya


Re: Transform RDD[List]

2014-08-11 Thread Soumya Simanta
Try something like this.


scala val a = sc.parallelize(List(1,2,3,4,5))

a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize
at console:12


scala val b = sc.parallelize(List(6,7,8,9,10))

b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize
at console:12


scala val x = a zip b

x: org.apache.spark.rdd.RDD[(Int, Int)] = ZippedRDD[3] at zip at
console:16


scala val f = x.map( x = List(x._1,x._2) )

f: org.apache.spark.rdd.RDD[List[Int]] = MappedRDD[5] at map at console:18


scala f.foreach(println)

List(2, 7)

List(1, 6)

List(5, 10)

List(3, 8)

List(4, 9)


On Tue, Aug 12, 2014 at 12:42 AM, Kevin Jung itsjb.j...@samsung.com wrote:

 Hi
 It may be simple question, but I can not figure out the most efficient way.
 There is a RDD containing list.

 RDD
 (
  List(1,2,3,4,5)
  List(6,7,8,9,10)
 )

 I want to transform this to

 RDD
 (
 List(1,6)
 List(2,7)
 List(3,8)
 List(4,9)
 List(5,10)
 )

 And I want to achieve this without using collect method because realworld
 RDD can have a lot of elements then it may cause out of memory.
 Any ideas will be welcome.

 Best regards
 Kevin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Transform-RDD-List-tp11948.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
Check your executor logs for the output or if your data is not big collect it 
in the driver and print it. 



 On Jul 16, 2014, at 9:21 AM, Sarath Chandra 
 sarathchandra.jos...@algofusiontech.com wrote:
 
 Hi All,
 
 I'm trying to do a simple record matching between 2 files and wrote following 
 code -
 
 import org.apache.spark.sql.SQLContext;
 import org.apache.spark.rdd.RDD
 object SqlTest {
   case class Test(fld1:String, fld2:String, fld3:String, fld4:String, 
 fld4:String, fld5:Double, fld6:String);
   sc.addJar(test1-0.1.jar);
   val file1 = sc.textFile(hdfs://localhost:54310/user/hduser/file1.csv);
   val file2 = sc.textFile(hdfs://localhost:54310/user/hduser/file2.csv);
   val sq = new SQLContext(sc);
   val file1_recs: RDD[Test] = file1.map(_.split(,)).map(l = Test(l(0), 
 l(1), l(2), l(3), l(4), l(5).toDouble, l(6)));
   val file2_recs: RDD[Test] = file2.map(_.split(,)).map(s = Test(s(0), 
 s(1), s(2), s(3), s(4), s(5).toDouble, s(6)));
   val file1_schema = sq.createSchemaRDD(file1_recs);
   val file2_schema = sq.createSchemaRDD(file2_recs);
   file1_schema.registerAsTable(file1_tab);
   file2_schema.registerAsTable(file2_tab);
   val matched = sq.sql(select * from file1_tab l join file2_tab s on 
 l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 and l.fld5=s.fld5 and 
 l.fld2=s.fld2);
   val count = matched.count();
   System.out.println(Found  + matched.count() +  matching records);
 }
 
 When I run this program on a standalone spark cluster, it keeps running for 
 long with no output or error. After waiting for few mins I'm forcibly killing 
 it.
 But the same program is working well when executed from a spark shell.
 
 What is going wrong? What am I missing?
 
 ~Sarath


Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
When you submit your job, it should appear on the Spark UI. Same with the
REPL. Make sure you job is submitted to the cluster properly.


On Wed, Jul 16, 2014 at 10:08 AM, Sarath Chandra 
sarathchandra.jos...@algofusiontech.com wrote:

 Hi Soumya,

 Data is very small, 500+ lines in each file.

 Removed last 2 lines and placed this at the end
 matched.collect().foreach(println);. Still no luck. It's been more than
 5min, the execution is still running.

 Checked logs, nothing in stdout. In stderr I don't see anything going
 wrong, all are info messages.

 What else do I need check?

 ~Sarath

 On Wed, Jul 16, 2014 at 7:23 PM, Soumya Simanta soumya.sima...@gmail.com
 wrote:

 Check your executor logs for the output or if your data is not big
 collect it in the driver and print it.



 On Jul 16, 2014, at 9:21 AM, Sarath Chandra 
 sarathchandra.jos...@algofusiontech.com wrote:

 Hi All,

 I'm trying to do a simple record matching between 2 files and wrote
 following code -

 *import org.apache.spark.sql.SQLContext;*
 *import org.apache.spark.rdd.RDD*
 *object SqlTest {*
 *  case class Test(fld1:String, fld2:String, fld3:String, fld4:String,
 fld4:String, fld5:Double, fld6:String);*
 *  sc.addJar(test1-0.1.jar);*
 *  val file1 =
 sc.textFile(hdfs://localhost:54310/user/hduser/file1.csv);*
 *  val file2 =
 sc.textFile(hdfs://localhost:54310/user/hduser/file2.csv);*
 *  val sq = new SQLContext(sc);*
 *  val file1_recs: RDD[Test] = file1.map(_.split(,)).map(l =
 Test(l(0), l(1), l(2), l(3), l(4), l(5).toDouble, l(6)));*
 *  val file2_recs: RDD[Test] = file2.map(_.split(,)).map(s =
 Test(s(0), s(1), s(2), s(3), s(4), s(5).toDouble, s(6)));*
 *  val file1_schema = sq.createSchemaRDD(file1_recs);*
 *  val file2_schema = sq.createSchemaRDD(file2_recs);*
 *  file1_schema.registerAsTable(file1_tab);*
 *  file2_schema.registerAsTable(file2_tab);*
 *  val matched = sq.sql(select * from file1_tab l join file2_tab s on
 l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 and l.fld5=s.fld5 and
 l.fld2=s.fld2);*
 *  val count = matched.count();*
 *  System.out.println(Found  + matched.count() +  matching records);*
 *}*

 When I run this program on a standalone spark cluster, it keeps running
 for long with no output or error. After waiting for few mins I'm forcibly
 killing it.
 But the same program is working well when executed from a spark shell.

 What is going wrong? What am I missing?

 ~Sarath





Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta


Can you try submitting a very simple job to the cluster. 

 On Jul 16, 2014, at 10:25 AM, Sarath Chandra 
 sarathchandra.jos...@algofusiontech.com wrote:
 
 Yes it is appearing on the Spark UI, and remains there with state as 
 RUNNING till I press Ctrl+C in the terminal to kill the execution.
 
 Barring the statements to create the spark context, if I copy paste the lines 
 of my code in spark shell, runs perfectly giving the desired output.
 
 ~Sarath
 
 On Wed, Jul 16, 2014 at 7:48 PM, Soumya Simanta soumya.sima...@gmail.com 
 wrote:
 When you submit your job, it should appear on the Spark UI. Same with the 
 REPL. Make sure you job is submitted to the cluster properly. 
 
 
 On Wed, Jul 16, 2014 at 10:08 AM, Sarath Chandra 
 sarathchandra.jos...@algofusiontech.com wrote:
 Hi Soumya,
 
 Data is very small, 500+ lines in each file.
 
 Removed last 2 lines and placed this at the end 
 matched.collect().foreach(println);. Still no luck. It's been more than 
 5min, the execution is still running.
 
 Checked logs, nothing in stdout. In stderr I don't see anything going 
 wrong, all are info messages.
 
 What else do I need check?
 
 ~Sarath
 
 On Wed, Jul 16, 2014 at 7:23 PM, Soumya Simanta soumya.sima...@gmail.com 
 wrote:
 Check your executor logs for the output or if your data is not big collect 
 it in the driver and print it. 
 
 
 
 On Jul 16, 2014, at 9:21 AM, Sarath Chandra 
 sarathchandra.jos...@algofusiontech.com wrote:
 
 Hi All,
 
 I'm trying to do a simple record matching between 2 files and wrote 
 following code -
 
 import org.apache.spark.sql.SQLContext;
 import org.apache.spark.rdd.RDD
 object SqlTest {
   case class Test(fld1:String, fld2:String, fld3:String, fld4:String, 
 fld4:String, fld5:Double, fld6:String);
   sc.addJar(test1-0.1.jar);
   val file1 = sc.textFile(hdfs://localhost:54310/user/hduser/file1.csv);
   val file2 = sc.textFile(hdfs://localhost:54310/user/hduser/file2.csv);
   val sq = new SQLContext(sc);
   val file1_recs: RDD[Test] = file1.map(_.split(,)).map(l = Test(l(0), 
 l(1), l(2), l(3), l(4), l(5).toDouble, l(6)));
   val file2_recs: RDD[Test] = file2.map(_.split(,)).map(s = Test(s(0), 
 s(1), s(2), s(3), s(4), s(5).toDouble, s(6)));
   val file1_schema = sq.createSchemaRDD(file1_recs);
   val file2_schema = sq.createSchemaRDD(file2_recs);
   file1_schema.registerAsTable(file1_tab);
   file2_schema.registerAsTable(file2_tab);
   val matched = sq.sql(select * from file1_tab l join file2_tab s on 
 l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 and l.fld5=s.fld5 and 
 l.fld2=s.fld2);
   val count = matched.count();
   System.out.println(Found  + matched.count() +  matching records);
 }
 
 When I run this program on a standalone spark cluster, it keeps running 
 for long with no output or error. After waiting for few mins I'm forcibly 
 killing it.
 But the same program is working well when executed from a spark shell.
 
 What is going wrong? What am I missing?
 
 ~Sarath
 


Re: Client application that calls Spark and receives an MLlib *model* Scala Object, not just result

2014-07-14 Thread Soumya Simanta
Please look at the following.

https://github.com/ooyala/spark-jobserver
http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language
https://github.com/EsotericSoftware/kryo

You can train your model convert it to PMML and return that to your client
OR

You can train your model and write that model (serialized object) to the
file system (local, HDFS, S3 etc) or a datastore and return a location back
to the client on a successful write.





On Mon, Jul 14, 2014 at 4:27 PM, Aris Vlasakakis a...@vlasakakis.com
wrote:

 Hello Spark community,

 I would like to write an application in Scala that i a model server. It
 should have an MLlib Linear Regression model that is already trained on
 some big set of data, and then is able to repeatedly call
 myLinearRegressionModel.predict() many times and return the result.

 Now, I want this client application to submit a job to Spark and tell the
 Spark cluster job to

 1) train its particular MLlib model, which produces a LinearRegression
 model, and then

 2) take the produced Scala
 org.apache.spark.mllib.regression.LinearRegressionModel *object*, serialize
 that object, and return this serialized object over the wire to my calling
 application.

 3) My client application receives the serialized Scala (model) object, and
 can call .predict() on it over and over.

 I am separating the heavy lifting of training the model and doing model
 predictions; the client application will only do predictions using the
 MLlib model it received from the Spark application.

 The confusion I have is that I only know how to submit jobs to Spark by
 using the bin/spark-submit script, and then the only output I receive is
 stdout (as in, text). I want my scala appliction to hopefully submit the
 spark model-training programmatically, and for the Spark application to
 return a SERIALIZED MLLIB OBJECT, not just some stdout text!

 How can I do this? I think my use case of separating long-running jobs to
 Spark and using it's libraries in another application should be a pretty
 common design pattern.

 Thanks!

 --
 Άρης Βλασακάκης
 Aris Vlasakakis



Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
If you are on 1.0.0 release you can also try converting your RDD to a
SchemaRDD and run a groupBy there. The SparkSQL optimizer may yield
better results. It's worth a try at least.


On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:






 Solution 2 is to map the objects into a pair RDD where the
 key is the number of the day in the interval, then group by
 key, collect, and parallelize the resulting grouped data.
 However, I worry collecting large data sets is going to be
 a serious performance bottleneck.


 Why do you have to do a collect ?  You can do a groupBy and then write
 the grouped data to disk again



Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
 I think my best option is to partition my data in directories by day
 before running my Spark application, and then direct
 my Spark application to load RDD's from each directory when
 I want to load a date range. How does this sound?

 If your upstream system can write data by day then it makes perfect sense
to do that and load (into Spark) only the data that is required for
processing. This also saves you the filter step and hopefully time and
memory. If you want to get back the bigger dataset you can always join
multiple days of data (RDDs) together.


Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Soumya Simanta
Do you have a proxy server ? 

If yes you need to set the proxy for twitter4j 

 On Jul 11, 2014, at 7:06 PM, SK skrishna...@gmail.com wrote:
 
 I dont get any exceptions or error messages.
 
 I tried it both with and without VPN and had the same outcome. But  I can
 try again without VPN later today and report back.
 
 thanks.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-training-Spark-Summit-2014-tp9465p9477.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Soumya Simanta
Try running a simple standalone program if you are using Scala and see if
you are getting any data. I use this to debug any connection/twitter4j
issues.


import twitter4j._


//put your keys and creds here
object Util {
  val config = new twitter4j.conf.ConfigurationBuilder()
.setOAuthConsumerKey()
.setOAuthConsumerSecret()
.setOAuthAccessToken()
.setOAuthAccessTokenSecret()
.build
}


/**
 *   Add this to your build.sbt
 *   org.twitter4j % twitter4j-stream % 3.0.3,

 */
object SimpleStreamer extends App {


  def simpleStatusListener = new StatusListener() {
def onStatus(status: Status) {
  println(status.getUserMentionEntities.length)
}

def onDeletionNotice(statusDeletionNotice: StatusDeletionNotice) {}

def onTrackLimitationNotice(numberOfLimitedStatuses: Int) {}

def onException(ex: Exception) {
  ex.printStackTrace
}

def onScrubGeo(arg0: Long, arg1: Long) {}

def onStallWarning(warning: StallWarning) {}
  }

  val keywords = List(dog, cat)
  val twitterStream = new TwitterStreamFactory(Util.config).getInstance
  twitterStream.addListener(simpleStatusListener)
  twitterStream.filter(new FilterQuery().track(keywords.toArray))

}



On Fri, Jul 11, 2014 at 7:19 PM, SK skrishna...@gmail.com wrote:

 I dont have a proxy server.

 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-training-Spark-Summit-2014-tp9465p9481.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Comparative study

2014-07-07 Thread Soumya Simanta


Daniel, 

Do you mind sharing the size of your cluster and the production data volumes ? 

Thanks
Soumya 

 On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote:
 
 From a development perspective, I vastly prefer Spark to MapReduce. The 
 MapReduce API is very constrained; Spark's API feels much more natural to me. 
 Testing and local development is also very easy - creating a local Spark 
 context is trivial and it reads local files. For your unit tests you can just 
 have them create a local context and execute your flow with some test data. 
 Even better, you can do ad-hoc work in the Spark shell and if you want that 
 in your production code it will look exactly the same.
 
 Unfortunately, the picture isn't so rosy when it gets to production. In my 
 experience, Spark simply doesn't scale to the volumes that MapReduce will 
 handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be 
 better, but I haven't had the opportunity to try them. I find jobs tend to 
 just hang forever for no apparent reason on large data sets (but smaller than 
 what I push through MapReduce).
 
 I am hopeful the situation will improve - Spark is developing quickly - but 
 if you have large amounts of data you should proceed with caution.
 
 Keep in mind there are some frameworks for Hadoop which can hide the ugly 
 MapReduce with something very similar in form to Spark's API; e.g. Apache 
 Crunch. So you might consider those as well.
 
 (Note: the above is with Spark 1.0.0.)
 
 
 
 On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote:
 Hello Experts,
 
  
 
 I am doing some comparative study on the below:
 
  
 
 Spark vs Impala
 
 Spark vs MapREduce . Is it worth migrating from existing MR implementation 
 to Spark?
 
  
 
  
 
 Please share your thoughts and expertise.
 
  
 
  
 
 Thanks,
 Santosh
 
 
 
 This message is for the designated recipient only and may contain 
 privileged, proprietary, or otherwise confidential information. If you have 
 received it in error, please notify the sender immediately and delete the 
 original. Any other use of the e-mail by you is prohibited. Where allowed by 
 local law, electronic communications with Accenture and its affiliates, 
 including e-mail and instant messaging (including content), may be scanned 
 by our systems for the purposes of information security and assessment of 
 internal compliance with Accenture policy. 
 __
 
 www.accenture.com
 
 
 
 -- 
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning
 
 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io W: www.velos.io


Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Soumya Simanta
Are these sessions recorded ?


On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote:







 *General Session / Keynotes :
  http://www.ustream.tv/channel/spark-summit-2014
 http://www.ustream.tv/channel/spark-summit-2014 Track A
 : http://www.ustream.tv/channel/track-a1
 http://www.ustream.tv/channel/track-a1Track
 B: http://www.ustream.tv/channel/track-b1
 http://www.ustream.tv/channel/track-b1 Track
 C: http://www.ustream.tv/channel/track-c1
 http://www.ustream.tv/channel/track-c1*


 On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com
 wrote:

 I attended yesterday on ustream.tv, but can't find the links to today's
 streams anywhere. help!

 --
 Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)





Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Soumya Simanta
Awesome. 

Just want to catch up on some sessions from other tracks. 

Learned a ton over the last two days. 

Thanks 
Soumya 






 On Jul 1, 2014, at 8:50 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 Yup, we’re going to try to get the videos up as soon as possible.
 
 Matei
 
 On Jul 1, 2014, at 7:47 PM, Marco Shaw marco.s...@gmail.com wrote:
 
 They are recorded...  For example, 2013: http://spark-summit.org/2013
 
 I'm assuming the 2014 videos will be up in 1-2 weeks.
 
 Marco
 
 
 On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta soumya.sima...@gmail.com 
 wrote:
 Are these sessions recorded ? 
 
 
 On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote:
 General Session / Keynotes :  
 http://www.ustream.tv/channel/spark-summit-2014
 
 Track A : http://www.ustream.tv/channel/track-a1
 
 Track B: http://www.ustream.tv/channel/track-b1
 
 Track C: http://www.ustream.tv/channel/track-c1
 
 
 On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com 
 wrote:
 I attended yesterday on ustream.tv, but can't find the links to today's 
 streams anywhere. help!
 
 -- 
 Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
 


Re: Spark streaming and rate limit

2014-06-18 Thread Soumya Simanta

You can add a back pressured enabled component in front that feeds data into 
Spark. This component can control in input rate to spark. 

 On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier pomperma...@okkam.it wrote:
 
 Hi to all,
 in my use case I'd like to receive events and call an external service as 
 they pass through. Is it possible to limit the number of contemporaneous call 
 to that service (to avoid DoS) using Spark streaming? if so, limiting the 
 rate implies a possible buffer growth...how can I control the buffer of 
 incoming events waiting to be processed?
 
 Best,
 Flavio 


Re: Spark streaming and rate limit

2014-06-18 Thread Soumya Simanta
Flavio - i'm new to Spark as well but I've done stream processing using
other frameworks. My comments below are not spark-streaming specific. Maybe
someone who know more can provide better insights.

I read your post on my phone and I believe my answer doesn't completely
address the issue you have raised.

Do you need to call the external service for every event ? i.e., do you
need to process all events ? Also does order of processing events matter?
Is there is time bound in which each event should be processed ?

Calling an external service means network IO. So you have to buffer events
if your service is rate limited or slower than rate at which you are
processing your event.

Here are some ways of dealing with this situation:

1. Drop events based on a policy (such as buffer/queue size),
2. Tell the event producer to slow down if that's in your control
3. Use a proxy or a set of proxies to distribute the calls to the remote
service, if the rate limit is by user or network node only.

I'm not sure how many of these are implemented directly in Spark streaming
but you can have an external component that can :
control the rate of event and only send events to Spark streams when it's
ready to process more messages.

Hope this helps.

-Soumya




On Wed, Jun 18, 2014 at 6:50 PM, Flavio Pompermaier pomperma...@okkam.it
wrote:

 Thanks for the quick reply soumya. Unfortunately I'm a newbie with
 Spark..what do you mean? is there any reference to how to do that?


 On Thu, Jun 19, 2014 at 12:24 AM, Soumya Simanta soumya.sima...@gmail.com
  wrote:


 You can add a back pressured enabled component in front that feeds data
 into Spark. This component can control in input rate to spark.

  On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier pomperma...@okkam.it
 wrote:
 
  Hi to all,
  in my use case I'd like to receive events and call an external service
 as they pass through. Is it possible to limit the number of contemporaneous
 call to that service (to avoid DoS) using Spark streaming? if so, limiting
 the rate implies a possible buffer growth...how can I control the buffer of
 incoming events waiting to be processed?
 
  Best,
  Flavio




Re: Unable to run a Standalone job

2014-05-22 Thread Soumya Simanta
Try cleaning your maven (.m2) and ivy cache. 



 On May 23, 2014, at 12:03 AM, Shrikar archak shrika...@gmail.com wrote:
 
 Yes I did a sbt publish-local. Ok I will try with Spark 0.9.1.
 
 Thanks,
 Shrikar
 
 
 On Thu, May 22, 2014 at 8:53 PM, Tathagata Das tathagata.das1...@gmail.com 
 wrote:
 How are you getting Spark with 1.0.0-SNAPSHOT through maven? Did you publish 
 Spark locally which allowed you to use it as a dependency?
 
 This is a weird indeed. SBT should take care of all the dependencies of 
 spark.
 
 In any case, you can try the last released Spark 0.9.1 and see if the 
 problem persists.
 
 
 On Thu, May 22, 2014 at 3:59 PM, Shrikar archak shrika...@gmail.com wrote:
 I am running as sbt run. I am running it locally .
 
 Thanks,
 Shrikar
 
 
 On Thu, May 22, 2014 at 3:53 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:
 How are you launching the application? sbt run ? spark-submit? local
 mode or Spark standalone cluster? Are you packaging all your code into
 a jar?
 Looks to me that you seem to have spark classes in your execution
 environment but missing some of Spark's dependencies.
 
 TD
 
 
 
 On Thu, May 22, 2014 at 2:27 PM, Shrikar archak shrika...@gmail.com 
 wrote:
  Hi All,
 
  I am trying to run the network count example as a seperate standalone job
  and running into some issues.
 
  Environment:
  1) Mac Mavericks
  2) Latest spark repo from Github.
 
 
  I have a structure like this
 
  Shrikars-MacBook-Pro:SimpleJob shrikar$ find .
  .
  ./simple.sbt
  ./src
  ./src/main
  ./src/main/scala
  ./src/main/scala/NetworkWordCount.scala
  ./src/main/scala/SimpleApp.scala.bk
 
 
  simple.sbt
  name := Simple Project
 
  version := 1.0
 
  scalaVersion := 2.10.3
 
  libraryDependencies ++= Seq(org.apache.spark %% spark-core %
  1.0.0-SNAPSHOT,
  org.apache.spark %% spark-streaming %
  1.0.0-SNAPSHOT)
 
  resolvers += Akka Repository at http://repo.akka.io/releases/;
 
 
  I am able to run the SimpleApp which is mentioned in the doc but when I 
  try
  to run the NetworkWordCount app I get error like this am I missing
  something?
 
  [info] Running com.shrikar.sparkapps.NetworkWordCount
  14/05/22 14:26:47 INFO spark.SecurityManager: Changing view acls to: 
  shrikar
  14/05/22 14:26:47 INFO spark.SecurityManager: SecurityManager:
  authentication disabled; ui acls disabled; users with view permissions:
  Set(shrikar)
  14/05/22 14:26:48 INFO slf4j.Slf4jLogger: Slf4jLogger started
  14/05/22 14:26:48 INFO Remoting: Starting remoting
  14/05/22 14:26:48 INFO Remoting: Remoting started; listening on addresses
  :[akka.tcp://spark@192.168.10.88:49963]
  14/05/22 14:26:48 INFO Remoting: Remoting now listens on addresses:
  [akka.tcp://spark@192.168.10.88:49963]
  14/05/22 14:26:48 INFO spark.SparkEnv: Registering MapOutputTracker
  14/05/22 14:26:48 INFO spark.SparkEnv: Registering BlockManagerMaster
  14/05/22 14:26:48 INFO storage.DiskBlockManager: Created local directory 
  at
  /var/folders/r2/mbj08pb55n5d_9p8588xk5b0gn/T/spark-local-20140522142648-0a14
  14/05/22 14:26:48 INFO storage.MemoryStore: MemoryStore started with
  capacity 911.6 MB.
  14/05/22 14:26:48 INFO network.ConnectionManager: Bound socket to port 
  49964
  with id = ConnectionManagerId(192.168.10.88,49964)
  14/05/22 14:26:48 INFO storage.BlockManagerMaster: Trying to register
  BlockManager
  14/05/22 14:26:48 INFO storage.BlockManagerInfo: Registering block 
  manager
  192.168.10.88:49964 with 911.6 MB RAM
  14/05/22 14:26:48 INFO storage.BlockManagerMaster: Registered 
  BlockManager
  14/05/22 14:26:48 INFO spark.HttpServer: Starting HTTP Server
  [error] (run-main) java.lang.NoClassDefFoundError:
  javax/servlet/http/HttpServletResponse
  java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
  at org.apache.spark.HttpServer.start(HttpServer.scala:54)
  at
  org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156)
  at
  org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
  at
  org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
  at
  org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
  at
  org.apache.spark.broadcast.BroadcastManager.init(BroadcastManager.scala:35)
  at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
  at org.apache.spark.SparkContext.init(SparkContext.scala:202)
  at
  org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:549)
  at
  org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:561)
  at
  org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:91)
  at 
  com.shrikar.sparkapps.NetworkWordCount$.main(NetworkWordCount.scala:39)
  at com.shrikar.sparkapps.NetworkWordCount.main(NetworkWordCount.scala)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
  

Re: Run Apache Spark on Mini Cluster

2014-05-21 Thread Soumya Simanta
Suggestion - try to get an idea of your hardware requirements by running a
sample on Amazon's EC2 or Google compute engine. It's relatively easy (and
cheap) to get started on the cloud before you invest in your own hardware
IMO.




On Wed, May 21, 2014 at 8:14 PM, Upender Nimbekar upent...@gmail.comwrote:

 Hi,
 I would like to setup apache platform on a mini cluster. Is there any
 recommendation for the hardware that I can buy to set it up. I am thinking
 about processing significant amount of data like in the range of few
 terabytes.

 Thanks
 Upender



Re: Historical Data as Stream

2014-05-17 Thread Soumya Simanta
@Laeeq - please see this example.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala#L47-L49



On Sat, May 17, 2014 at 2:06 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:

 @Soumya Simanta

 Right now its just a prove of concept. Later I will have a real stream.
 Its EEG files of brain. Later it can be used for real time analysis of eeg
 streams.

 @Mayur

 The size is huge yes. SO its better to do in distributed manner and as I
 said above I want to read as stream because later i will have stream data.
 This is a prove a concept.

 Regards,
 Laeeq

   On Saturday, May 17, 2014 7:03 PM, Mayur Rustagi 
 mayur.rust...@gmail.com wrote:
  The real question is why are looking to consume file as a Stream
 1. Too big to load as RDD
 2. Operate in sequential manner.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Sat, May 17, 2014 at 5:12 AM, Soumya Simanta 
 soumya.sima...@gmail.comwrote:

 File is just a steam with a fixed length. Usually streams don't end but in
 this case it would.

 On the other hand if you real your file as a steam may not be able to use
 the entire data in the file for your analysis. Spark (give enough memory)
 can process large amounts of data quickly.

 On May 15, 2014, at 9:52 AM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:

 Hi,

 I have data in a file. Can I read it as Stream in spark? I know it seems
 odd to read file as stream but it has practical applications in real life
 if I can read it as stream. It there any other tools which can give this
 file as stream to Spark or I have to make batches manually which is not
 what I want. Its a coloumn of a million values.

 Regards,
 Laeeq








Re: Proper way to create standalone app with custom Spark version

2014-05-16 Thread Soumya Simanta
Install your custom spark jar to your local maven or ivy repo. Use this custom 
jar in your pom/sbt file. 



 On May 15, 2014, at 3:28 AM, Andrei faithlessfri...@gmail.com wrote:
 
 (Sorry if you have already seen this message - it seems like there were some 
 issues delivering messages to the list yesterday)
 
 We can create standalone Spark application by simply adding spark-core_2.x 
 to build.sbt/pom.xml and connecting it to Spark master. 
 
 We can also build custom version of Spark (e.g. compiled against Hadoop 2.x) 
 from source and deploy it to cluster manually. 
 
 But what is a proper way to use _custom version_ of Spark in _standalone 
 application_? 
 
 
 I'm currently trying to deploy custom version to local Maven repository and 
 add it to SBT project. Another option is to add Spark as local jar to every 
 project. But both of these ways look overcomplicated and in general wrong. 
 
 So what is the implied way to do it? 
 
 Thanks, 
 Andrei


Re: Historical Data as Stream

2014-05-16 Thread Soumya Simanta
File is just a steam with a fixed length. Usually streams don't end but in this 
case it would. 

On the other hand if you real your file as a steam may not be able to use the 
entire data in the file for your analysis. Spark (give enough memory) can 
process large amounts of data quickly. 

 On May 15, 2014, at 9:52 AM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
 
 Hi,
 
 I have data in a file. Can I read it as Stream in spark? I know it seems odd 
 to read file as stream but it has practical applications in real life if I 
 can read it as stream. It there any other tools which can give this file as 
 stream to Spark or I have to make batches manually which is not what I want. 
 Its a coloumn of a million values.
 
 Regards,
 Laeeq
  


Is there a way to load a large file from HDFS faster into Spark

2014-05-11 Thread Soumya Simanta
I've a Spark cluster with 3 worker nodes.


   - *Workers:* 3
   - *Cores:* 48 Total, 48 Used
   - *Memory:* 469.8 GB Total, 72.0 GB Used

I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB
compressed and 11GB uncompressed.
When I try to read the compressed file from HDFS it takes a while (4-5
minutes) load it into an RDD. If I use the .cache operation it takes even
longer. Is there a way to make loading of the RDD from HDFS faster ?

Thanks
-Soumya


Re: Fwd: Is there a way to load a large file from HDFS faster into Spark

2014-05-11 Thread Soumya Simanta
Yep. I figured that out. I uncompressed the file and it looks much faster
now. Thanks.



On Sun, May 11, 2014 at 8:14 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 .gz files are not splittable hence harder to process. Easiest is to move
 to a splittable compression like lzo and break file into multiple blocks to
 be read and for subsequent processing.
 On 11 May 2014 09:01, Soumya Simanta soumya.sima...@gmail.com wrote:



 I've a Spark cluster with 3 worker nodes.


- *Workers:* 3
- *Cores:* 48 Total, 48 Used
- *Memory:* 469.8 GB Total, 72.0 GB Used

 I want a process a single file compressed (*.gz) on HDFS. The file is
 1.5GB compressed and 11GB uncompressed.
 When I try to read the compressed file from HDFS it takes a while (4-5
 minutes) load it into an RDD. If I use the .cache operation it takes even
 longer. Is there a way to make loading of the RDD from HDFS faster ?

 Thanks
  -Soumya





Re: How to use spark-submit

2014-05-11 Thread Soumya Simanta

Will sbt-pack and the maven solution work for the Scala REPL? 

I need the REPL because it save a lot of time when I'm playing with large data 
sets because I load then once, cache them and then try out things interactively 
before putting in a standalone driver. 

I've sbt woking for my own driver program on Spark 0.9. 



 On May 11, 2014, at 3:49 PM, Stephen Boesch java...@gmail.com wrote:
 
 Just discovered sbt-pack: that addresses (quite well) the last item for 
 identifying and packaging the external jars.
 
 
 2014-05-11 12:34 GMT-07:00 Stephen Boesch java...@gmail.com:
 HI Sonal,
 Yes I am working towards that same idea.  How did you go about creating 
 the non-spark-jar dependencies ?  The way I am doing it is a separate 
 straw-man project that does not include spark but has the external third 
 party jars included. Then running sbt compile:managedClasspath and reverse 
 engineering the lib jars from it.  That is obviously not ideal.
 
 The maven run will be useful for other projects built by maven: i will 
 keep in my notes.
 
 AFA sbt run-example, it requires additional libraries to be added for my 
 external dependencies.  I tried several items including  ADD_JARS,  
 --driver-class-path  and combinations of extraClassPath. I have deferred 
 that ad-hoc approach to finding a systematic one.
 
 
 
 
 2014-05-08 5:26 GMT-07:00 Sonal Goyal sonalgoy...@gmail.com:
 
 I am creating a jar with only my dependencies and run spark-submit through 
 my project mvn build. I have configured the mvn exec goal to the location 
 of the script. Here is how I have set it up for my app. The mainClass is my 
 driver program, and I am able to send my custom args too. Hope this helps.
 
 plugin
 groupIdorg.codehaus.mojo/groupId
 artifactIdexec-maven-plugin/artifactId
 executions
 execution
 goals
 goalexec/goal
 /goals
 /execution
 /executions
 configuration
executable/home/sgoyal/spark/bin/spark-submit/executable
 arguments
 argument${jars}/argument
 argument--class/argument
 argument${mainClass}/argument
 argument--arg/argument
 argument${spark.master}/argument
 argument--arg/argument
 argument${my app arg 1}/argument
 argument--arg/argument
 argument${my arg 2}/argument
 /arguments
 /configuration
 /plugin
 
 
 Best Regards,
 Sonal
 Nube Technologies 
 
 
 
 
 
 
 On Wed, May 7, 2014 at 6:57 AM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:
 Doesnt the run-example script work for you? Also, are you on the latest 
 commit of branch-1.0 ?
 
 TD
 
 
 On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.com 
 wrote:
 
 
 Yes, I'm struggling with a similar problem where my class are not found 
 on the worker nodes. I'm using 1.0.0_SNAPSHOT.  I would really appreciate 
 if someone can provide some documentation on the usage of spark-submit.
 
 Thanks
 
  On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote:
 
 
  I have a spark streaming application that uses the external streaming 
  modules (e.g. kafka, mqtt, ..) as well.  It is not clear how to 
  properly invoke the spark-submit script: what are the 
  ---driver-class-path and/or -Dspark.executor.extraClassPath parameters 
  required?
 
   For reference, the following error is proving difficult to resolve:
 
  java.lang.ClassNotFoundException: 
  org.apache.spark.streaming.examples.StreamingExamples
 
 


Caused by: java.lang.OutOfMemoryError: unable to create new native thread

2014-05-05 Thread Soumya Simanta
I just upgraded my Spark version to 1.0.0_SNAPSHOT.


commit f25ebed9f4552bc2c88a96aef06729d9fc2ee5b3

Author: witgo wi...@qq.com

Date:   Fri May 2 12:40:27 2014 -0700


I'm running a standalone cluster with 3 workers.

   - *Workers:* 3
   - *Cores:* 48 Total, 0 Used
   - *Memory:* 469.8 GB Total, 0.0 B Used

However, when I try to run bin/spark-shell I get the following error after
sometime even if I don't perform any operations on the Spark shell.




14/05/05 10:20:52 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
shut down.

Exception in thread main java.lang.reflect.InvocationTargetException

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:622)

at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:256)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:54)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

*Caused by: java.lang.OutOfMemoryError: unable to create new native thread*

at java.lang.Thread.start0(Native Method)

at java.lang.Thread.start(Thread.java:679)

at java.lang.UNIXProcess$1.run(UNIXProcess.java:157)

at java.security.AccessController.doPrivileged(Native Method)

at java.lang.UNIXProcess.init(UNIXProcess.java:119)

at java.lang.ProcessImpl.start(ProcessImpl.java:81)

at java.lang.ProcessBuilder.start(ProcessBuilder.java:470)

at java.lang.Runtime.exec(Runtime.java:612)

at java.lang.Runtime.exec(Runtime.java:485)

at
scala.tools.jline.internal.TerminalLineSettings.exec(TerminalLineSettings.java:178)

at
scala.tools.jline.internal.TerminalLineSettings.exec(TerminalLineSettings.java:168)

at
scala.tools.jline.internal.TerminalLineSettings.stty(TerminalLineSettings.java:163)

at
scala.tools.jline.internal.TerminalLineSettings.get(TerminalLineSettings.java:67)

at
scala.tools.jline.internal.TerminalLineSettings.getProperty(TerminalLineSettings.java:87)

at scala.tools.jline.UnixTerminal.readVirtualKey(UnixTerminal.java:127)

at
scala.tools.jline.console.ConsoleReader.readVirtualKey(ConsoleReader.java:933)

at
org.apache.spark.repl.SparkJLineReader$JLineConsoleReader.readOneKey(SparkJLineReader.scala:54)

at
org.apache.spark.repl.SparkJLineReader.readOneKey(SparkJLineReader.scala:81)

at
scala.tools.nsc.interpreter.InteractiveReader$class.readYesOrNo(InteractiveReader.scala:29)

at
org.apache.spark.repl.SparkJLineReader.readYesOrNo(SparkJLineReader.scala:25)

at org.apache.spark.repl.SparkILoop$$anonfun$1.org
$apache$spark$repl$SparkILoop$$anonfun$$fn$1(SparkILoop.scala:576)

at
org.apache.spark.repl.SparkILoop$$anonfun$1$$anonfun$org$apache$spark$repl$SparkILoop$$anonfun$$fn$1$1.apply$mcZ$sp(SparkILoop.scala:576)

at
scala.tools.nsc.interpreter.InteractiveReader$class.readYesOrNo(InteractiveReader.scala:32)

at
org.apache.spark.repl.SparkJLineReader.readYesOrNo(SparkJLineReader.scala:25)

at org.apache.spark.repl.SparkILoop$$anonfun$1.org
$apache$spark$repl$SparkILoop$$anonfun$$fn$1(SparkILoop.scala:576)

at
org.apache.spark.repl.SparkILoop$$anonfun$1$$anonfun$org$apache$spark$repl$SparkILoop$$anonfun$$fn$1$1.apply$mcZ$sp(SparkILoop.scala:576)

at
scala.tools.nsc.interpreter.InteractiveReader$class.readYesOrNo(InteractiveReader.scala:32)

at
org.apache.spark.repl.SparkJLineReader.readYesOrNo(SparkJLineReader.scala:25)

at org.apache.spark.repl.SparkILoop$$anonfun$1.org
$apache$spark$repl$SparkILoop$$anonfun$$fn$1(SparkILoop.scala:576)

at
org.apache.spark.repl.SparkILoop$$anonfun$1.applyOrElse(SparkILoop.scala:579)

at
org.apache.spark.repl.SparkILoop$$anonfun$1.applyOrElse(SparkILoop.scala:566)

at
scala.runtime.AbstractPartialFunction$mcZL$sp.apply$mcZL$sp(AbstractPartialFunction.scala:33)

at
scala.runtime.AbstractPartialFunction$mcZL$sp.apply(AbstractPartialFunction.scala:33)

at
scala.runtime.AbstractPartialFunction$mcZL$sp.apply(AbstractPartialFunction.scala:25)

at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)

at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)

at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)

at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:983)

at org.apache.spark.repl.Main$.main(Main.scala:31)

at org.apache.spark.repl.Main.main(Main.scala)

... 7 more


Problem with sharing class across worker nodes using spark-shell on Spark 1.0.0

2014-05-05 Thread Soumya Simanta
Hi,

I'm trying to run a simple Spark job that uses a 3rd party class (in this
case twitter4j.Status) in the spark-shell using spark-1.0.0_SNAPSHOT

I'm starting my bin/spark-shell with the following command.

./spark-shell 
*--driver-class-path*$LIBPATH/jodatime2.3/joda-convert-1.2.jar:$LIBPATH/jodatime2.3/joda-time-2.3/joda-time-2.3.jar:$LIBPATH/twitter4j-core-3.0.5.jar
*--jars*
$LIBPATH/jodatime2.3/joda-convert-1.2.jar,$LIBPATH/jodatime2.3/joda-time-2.3/joda-time-2.3.jar,$LIBPATH/twitter4j-core-3.0.5.jar


My code was working fine in 0.9.1 when I used the following options that
were pointing to the same jar above.

export SPARK_CLASSPATH

export ADD_JAR


Now I'm getting a NoClassDefFoundError on each of my worker nodes

14/05/05 14:03:30 INFO TaskSetManager: Loss was due to
java.lang.NoClassDefFoundError: twitter4j/Status [duplicate 40]

14/05/05 14:03:30 INFO TaskSetManager: Starting task 0.0:26 as TID 73 on
executor 2: *worker1.xxx..* (NODE_LOCAL)


What am I missing here?


Thanks

-Soumya


Re: How to use spark-submit

2014-05-05 Thread Soumya Simanta


Yes, I'm struggling with a similar problem where my class are not found on the 
worker nodes. I'm using 1.0.0_SNAPSHOT.  I would really appreciate if someone 
can provide some documentation on the usage of spark-submit. 

Thanks 

 On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com wrote:
 
 
 I have a spark streaming application that uses the external streaming modules 
 (e.g. kafka, mqtt, ..) as well.  It is not clear how to properly invoke the 
 spark-submit script: what are the ---driver-class-path and/or 
 -Dspark.executor.extraClassPath parameters required?  
 
  For reference, the following error is proving difficult to resolve:
 
 java.lang.ClassNotFoundException: 
 org.apache.spark.streaming.examples.StreamingExamples
 


Re: Announcing Spark SQL

2014-03-26 Thread Soumya Simanta
Very nice.
Any plans to make the SQL typesafe using something like Slick (
http://slick.typesafe.com/)

Thanks !



On Wed, Mar 26, 2014 at 5:58 PM, Michael Armbrust mich...@databricks.comwrote:

 Hey Everyone,

 This already went out to the dev list, but I wanted to put a pointer here
 as well to a new feature we are pretty excited about for Spark 1.0.


 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html

 Michael



Help with building and running examples with GraphX from the REPL

2014-02-25 Thread Soumya Simanta
I'm not able to run the GraphX examples from the Scala REPL. Can anyone
point to the correct documentation that talks about the configuration
and/or how to build GraphX for the REPL ?

Thanks