Re: akka error : play framework (2.3.3) and spark (1.0.2)

2014-08-16 Thread Manu Suryavansh
Hi,

I tried the Spark(1.0.0)+Play(2.3.3) example from the Knoldus blog -
http://blog.knoldus.com/2014/06/18/play-with-spark-building-apache-spark-with-play-framework/
and
it worked for me. The project is here -
https://github.com/knoldus/Play-Spark-Scala

Regards,
Manu


On Sat, Aug 16, 2014 at 11:04 PM, Sujee Maniyam  wrote:

> Hi
>
> I am trying to connect to Spark from Play framework. Getting the following
> Akka error...
>
> [ERROR] [08/16/2014 17:12:05.249] [spark-akka.actor.default-dispatcher-3] 
> [ActorSystem(spark)] Uncaught fatal error from thread 
> [spark-akka.actor.default-dispatcher-3] shutting down ActorSystem [spark]
>
> java.lang.AbstractMethodError
>   at 
> akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
>
>   at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
>   at akka.actor.ActorCell.terminate(ActorCell.scala:369)
>
>   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
>   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>
>   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
> full stack trace : https://gist.github.com/sujee/ff14fd602b76314e693d
>
> source code here : https://github.com/sujee/play-spark-test
>
> I have also found this thread mentioning Akka in-compatibility How to run
> Play 2.2.x with Akka 2.3.x?
> 
>
> Stack overflow thread :
> http://stackoverflow.com/questions/25346657/akka-error-play-framework-2-3-3-and-spark-1-0-2
>
> any suggestions?
>
> thanks!
>
> Sujee Maniyam (http://sujee.net | http://www.linkedin.com/in/sujeemaniyam
> )
>



-- 
Manu Suryavansh


akka error : play framework (2.3.3) and spark (1.0.2)

2014-08-16 Thread Sujee Maniyam
Hi

I am trying to connect to Spark from Play framework. Getting the following
Akka error...

[ERROR] [08/16/2014 17:12:05.249]
[spark-akka.actor.default-dispatcher-3] [ActorSystem(spark)] Uncaught
fatal error from thread [spark-akka.actor.default-dispatcher-3]
shutting down ActorSystem [spark]
java.lang.AbstractMethodError
  at 
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
  at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
  at akka.actor.ActorCell.terminate(ActorCell.scala:369)
  at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
  at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
  at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
  at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
  at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
  at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


full stack trace : https://gist.github.com/sujee/ff14fd602b76314e693d

source code here : https://github.com/sujee/play-spark-test

I have also found this thread mentioning Akka in-compatibility How to run
Play 2.2.x with Akka 2.3.x?


Stack overflow thread :
http://stackoverflow.com/questions/25346657/akka-error-play-framework-2-3-3-and-spark-1-0-2

any suggestions?

thanks!

Sujee Maniyam (http://sujee.net | http://www.linkedin.com/in/sujeemaniyam )


Re: Program without doing assembly

2014-08-16 Thread Josh Rosen
If you want to speed up your local development / testing workflow, check
out the "Reducing Build Times" section in the Spark Wiki:

https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools


On Sat, Aug 16, 2014 at 10:56 PM, Deep Pradhan 
wrote:

> Hi,
> I am just playing around with the codes in Spark.
> I am printing out some statements of the codes given in Spark so as to see
> how it looks.
> Every time I change/add something to the code I have to run the command
>
> *SPARK_HADOOP_VERSION=2.3.0 sbt/sbt assembly*
>
> which is tiresome at times.
> Is there any way to check out the codes or even add a new code (a new file
> to the examples directory) to the Spark already available without having to
> do sbt/sbt assembly? Please tell me for a single node as well as a multi
> node cluster.
>
> Thank You
>


Program without doing assembly

2014-08-16 Thread Deep Pradhan
Hi,
I am just playing around with the codes in Spark.
I am printing out some statements of the codes given in Spark so as to see
how it looks.
Every time I change/add something to the code I have to run the command

*SPARK_HADOOP_VERSION=2.3.0 sbt/sbt assembly*

which is tiresome at times.
Is there any way to check out the codes or even add a new code (a new file
to the examples directory) to the Spark already available without having to
do sbt/sbt assembly? Please tell me for a single node as well as a multi
node cluster.

Thank You


Re: Does anyone have a stand alone spark instance running on Windows

2014-08-16 Thread Tushar Khairnar
I am also trying to run on Windows and will post once I am able to launch.

My guess is that "by hand" it probably means manually forming the java
command I.e. class path and java options and then appending right class
name for worker or master.

Spark script follow hierarchy : start-master or workers which calls
start-deamon which calls start-class. Each one building/appending command
line or environment variables. I saw equivalent windows scripts to in
latest build while I was running it on Linux.

Regards,
Tushar
On Aug 17, 2014 8:03 AM, "Manu Suryavansh" 
wrote:

> Hi,
>
> I have built spark-1.0.0 on Windows using Java 7/8 and I have been able to
> run several examples - here are my notes -
> http://ml-nlp-ir.blogspot.com/2014/04/building-spark-on-windows-and-cloudera.html
> on how to build from source and run examples in spark shell.
>
>
> Regards,
> Manu
>
>
> On Sat, Aug 16, 2014 at 12:14 PM, Steve Lewis 
> wrote:
>
>> I want to look at porting a Hadoop problem to Spark - eventually I want
>> to run on a Hadoop 2.0 cluster but while I am learning and porting I want
>> to run small problems in my windows box.
>> I installed scala and sbt.
>> I download Spark and in the spark directory can say
>> mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
>> which succeeds
>> I tried
>> sbt/sbt assembly
>> which fails with errors
>>
>> In the documentation
>> it says
>>
>> *Note:* The launch scripts do not currently support Windows. To run a
>> Spark cluster on Windows, start the master and workers by hand.
>> with no indication of how to do this.
>>
>> I can build and run samples (say JavaWordCount)  to the point where they
>> fail because a master cannot be found (none is running)
>>
>> I want to know how to get a spark master and a slave or two running on my
>> windows box so I can look at the samples and start playing with Spark
>>
>> Does anyone have a windows instance running??
>>  Please DON'T SAY I SHOULD RUN LINUX! if it is supposed to work on
>> windows someone should have tested it and be willing to state how.
>>
>>
>>
>>
>
>
> --
> Manu Suryavansh
>


Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-16 Thread Cui xp
Hi DB,
Thanks for your reply, I saw the slide in slidesshare, and I am studying
it. But one link in the page which is
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16579/consoleFull
reports ERROR 404 NOT FOUND.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-multinomial-logistic-regression-softmax-regression-in-Spark-tp11939p12244.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does anyone have a stand alone spark instance running on Windows

2014-08-16 Thread Manu Suryavansh
Hi,

I have built spark-1.0.0 on Windows using Java 7/8 and I have been able to
run several examples - here are my notes -
http://ml-nlp-ir.blogspot.com/2014/04/building-spark-on-windows-and-cloudera.html
on how to build from source and run examples in spark shell.


Regards,
Manu


On Sat, Aug 16, 2014 at 12:14 PM, Steve Lewis  wrote:

> I want to look at porting a Hadoop problem to Spark - eventually I want to
> run on a Hadoop 2.0 cluster but while I am learning and porting I want to
> run small problems in my windows box.
> I installed scala and sbt.
> I download Spark and in the spark directory can say
> mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
> which succeeds
> I tried
> sbt/sbt assembly
> which fails with errors
>
> In the documentation
> it says
>
> *Note:* The launch scripts do not currently support Windows. To run a
> Spark cluster on Windows, start the master and workers by hand.
> with no indication of how to do this.
>
> I can build and run samples (say JavaWordCount)  to the point where they
> fail because a master cannot be found (none is running)
>
> I want to know how to get a spark master and a slave or two running on my
> windows box so I can look at the samples and start playing with Spark
>
> Does anyone have a windows instance running??
> Please DON'T SAY I SHOULD RUN LINUX! if it is supposed to work on windows
> someone should have tested it and be willing to state how.
>
>
>
>


-- 
Manu Suryavansh


s3:// sequence file startup time

2014-08-16 Thread kmatzen
I have some RDD's stored as s3://-backed sequence files sharded into 1000
parts.  The startup time is pretty long (~10's of minutes).  It's
communicating with S3, but I don't know what it's doing.  Is it just
fetching the metadata from S3 for each part?  Is there a way to pipeline
this with the computation?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/s3-sequence-file-startup-time-tp12242.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: iterating with index in psypark

2014-08-16 Thread Chengi Liu
nevermind folks!!!


On Sat, Aug 16, 2014 at 2:22 PM, Chengi Liu  wrote:

> Hi,
>   I have data like following:
>
> 1,2,3,4
> 1,2,3,4
> 5,6,2,1
>
> and so on..
> I would like to create a new rdd as follows:
> (0,0,1)
> (0,1,2)
> (0,2,3)
> (0,3,4)
> (1,0,1)
> .. and so on..
> How do i do this?
> Thanks
>


Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-16 Thread Debasish Das
Hi Brandon,

Looks very cool...will try it out for ad-hoc analysis of our datasets and
provide more feedback...

Could you please give bit more details about the differences of Spindle
architecture compared to Hue + Spark integration (python stack) and Ooyala
Jobserver ?

Does Spindle allow sharing of spark context over multiple spark jobs like
jobserver ?

Thanks.
Deb


On Sat, Aug 16, 2014 at 12:19 AM, Matei Zaharia 
wrote:

> Thanks for sharing this, Brandon! Looks like a great architecture for
> people to build on.
>
> Matei
>
> On August 15, 2014 at 2:07:06 PM, Brandon Amos (a...@adobe.com) wrote:
>
> Hi Spark community,
>
> At Adobe Research, we're happy to open source a prototype
> technology called Spindle we've been developing over
> the past few months for processing analytics queries with Spark.
> Please take a look at the repository on GitHub at
> https://github.com/adobe-research/spindle,
> and we welcome any feedback. Thanks!
>
> Regards,
> Brandon.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


iterating with index in psypark

2014-08-16 Thread Chengi Liu
Hi,
  I have data like following:

1,2,3,4
1,2,3,4
5,6,2,1

and so on..
I would like to create a new rdd as follows:
(0,0,1)
(0,1,2)
(0,2,3)
(0,3,4)
(1,0,1)
.. and so on..
How do i do this?
Thanks


Does anyone have a stand alone spark instance running on Windows

2014-08-16 Thread Steve Lewis
I want to look at porting a Hadoop problem to Spark - eventually I want to
run on a Hadoop 2.0 cluster but while I am learning and porting I want to
run small problems in my windows box.
I installed scala and sbt.
I download Spark and in the spark directory can say
mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
which succeeds
I tried
sbt/sbt assembly
which fails with errors

In the documentation
it says

*Note:* The launch scripts do not currently support Windows. To run a Spark
cluster on Windows, start the master and workers by hand.
with no indication of how to do this.

I can build and run samples (say JavaWordCount)  to the point where they
fail because a master cannot be found (none is running)

I want to know how to get a spark master and a slave or two running on my
windows box so I can look at the samples and start playing with Spark

Does anyone have a windows instance running??
Please DON'T SAY I SHOULD RUN LINUX! if it is supposed to work on windows
someone should have tested it and be willing to state how.


kryo out of buffer exception

2014-08-16 Thread Mohit Jaggi
Hi All,
I was doing a groupBy and apparently some keys were very frequent making
the serializer fail with buffer overflow exception. I did not need a
groupBy so I switched to combineByKey in this case but would like to know
how to increase the kryo buffer sizes to avoid this error. I hope there is
a way to grow the buffers dynamically based on the size of the data.

Mohit.


Re: Does HiveContext support Parquet?

2014-08-16 Thread Michael Armbrust
>
> Hi to all, sorry for not being fully on topic but I have 2 quick questions
> about Parquet tables registered in Hive/sparq:
>
Using HiveQL to CREATE TABLE will add a table to the metastore / warehouse
exactly as it would in hive.  Registering is a purely temporary operation
that lives with the HiveContext.  In 1.1 we have renamed this function to
registerTempTable to make this more clear.

> 2) If I have multiple hiveContexts (one per application) using the same
> parquet table, is there any problem if inserting concurrently from all
> applications?
>
This is not supported.


RE: Does HiveContext support Parquet?

2014-08-16 Thread Silvio Fiorito
If you're using HiveContext then all metadata is in the Hive metastore as 
defined in hive-site.xml.

Concurrent writes should be fine as long as you're using a concurrent metastore 
db.

From: Flavio Pompermaier
Sent: ‎8/‎16/‎2014 1:26 PM
To: u...@spark.incubator.apache.org
Subject: RE: Does HiveContext support Parquet?


Hi to all, sorry for not being fully on topic but I have 2 quick questions 
about Parquet tables registered in Hive/sparq:

1) where are the created tables stored?
2) If I have multiple hiveContexts (one per application) using the same parquet 
table, is there any problem if inserting concurrently from all applications?

Best,
FP

On Aug 16, 2014 5:29 PM, "lyc" 
mailto:yanchen@huawei.com>> wrote:
Thanks for your help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209p12231.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org



Re: spark streaming - lamda architecture

2014-08-16 Thread Jörn Franke
Hi,
Maybe this helps you. For the "speed" layer I think something like complex
event processing as it is - to some extent - supported by Spark Streaming
can make sense. You process the events as they come in. You store them
afterwards. The Spark Streaming web page gives a nice example: trend
analysis. You detect from a lot of incoming events the current trend with
Spark Streaming and compare them to historical trends (stored in Spark,
which I would compare here with a serving layer). You can use this
information to see if customers are more happy (or more angry) with your
new products than usual.


I hope it helps. Your use case sounds very technical, it would be
interesting if you could share the business use case.

Best regards,

Jörn


On Fri, Aug 15, 2014 at 5:25 AM, salemi  wrote:

> below is what is what I understand under lambda architecture. The batch
> layer
> provides the historical data and the speed layer provides the real-time
> view!
>
> All data entering the system is dispatched to both the batch layer and the
> speed layer for processing.
> The batch layer has two functions:
> (i) managing the master dataset (an immutable, append-only set of raw
> data),
> and
> (ii) to pre-compute the batch views.
>
> The speed layer compensates for the high latency of updates to the serving
> layer and deals with recent data only.
>
> The serving layer indexes the batch views so that they can be queried in
> low-latency, ad-hoc way.
>
> Any incoming query can be answered by merging results from batch views and
> real-time views.
>
> In my system I have events coming in from Kafka sources and currently we
> need to process 10,000 messages per second and write them out to hdfs and
> make them available to be queried by a serving layer.
>
> What would be your suggestion to architecturally solve this issue? How many
> solution with which would approx. be needed for the proposed architecture.
>
> Thanks,
> Ali
>
>
> Tathagata Das wrote
> > Can you be a bit more specific about what you mean by lambda
> architecture?
> >
> >
> > On Thu, Aug 14, 2014 at 2:27 PM, salemi <
>
> > alireza.salemi@
>
> > > wrote:
> >
> >> Hi,
> >>
> >> How would you implement the batch layer of lamda architecture with
> >> spark/spark streaming?
> >>
> >> Thanks,
> >> Ali
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> -
> >> To unsubscribe, e-mail:
>
> > user-unsubscribe@.apache
>
> >> For additional commands, e-mail:
>
> > user-help@.apache
>
> >>
> >>
>
>
> Tathagata Das wrote
> > Can you be a bit more specific about what you mean by lambda
> architecture?
> >
> >
> > On Thu, Aug 14, 2014 at 2:27 PM, salemi <
>
> > alireza.salemi@
>
> > > wrote:
> >
> >> Hi,
> >>
> >> How would you implement the batch layer of lamda architecture with
> >> spark/spark streaming?
> >>
> >> Thanks,
> >> Ali
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> -
> >> To unsubscribe, e-mail:
>
> > user-unsubscribe@.apache
>
> >> For additional commands, e-mail:
>
> > user-help@.apache
>
> >>
> >>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142p12163.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


RE: Does HiveContext support Parquet?

2014-08-16 Thread Flavio Pompermaier
Hi to all, sorry for not being fully on topic but I have 2 quick questions
about Parquet tables registered in Hive/sparq:

1) where are the created tables stored?
2) If I have multiple hiveContexts (one per application) using the same
parquet table, is there any problem if inserting concurrently from all
applications?

Best,
FP
On Aug 16, 2014 5:29 PM, "lyc"  wrote:

> Thanks for your help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209p12231.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Issue with Spark on EC2 using spark-ec2 script

2014-08-16 Thread rkishore999
I'm also getting into same issue and is blocked here. Did any of you were
able to go past this issue? I tried using both ephimeral and
persistent-hdfs. I'm getting the same issue.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-Spark-on-EC2-using-spark-ec2-script-tp11088p12232.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Does HiveContext support Parquet?

2014-08-16 Thread lyc
Thanks for your help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209p12231.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Does HiveContext support Parquet?

2014-08-16 Thread Silvio Fiorito
There's really nothing special besides including that jar on your classpath. 
You just do selects, inserts, etc as you normally would.

The same instructions here apply 
https://cwiki.apache.org/confluence/display/Hive/Parquet


From: lyc
Sent: ‎8/‎16/‎2014 12:56 AM
To: u...@spark.incubator.apache.org
Subject: Re: Does HiveContext support Parquet?

Thank you for your reply.

Do you know where I can find some detailed information about how to use
Parquet in HiveContext?

Any information is appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209p12216.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running Spark shell on YARN

2014-08-16 Thread Soumya Simanta
I followed this thread

http://apache-spark-user-list.1001560.n3.nabble.com/YARN-issues-with-resourcemanager-scheduler-address-td5201.html#a5258

to set SPARK_YARN_USER_ENV  to HADOOP_CONF_DIR
export SPARK_YARN_USER_ENV="CLASSPATH=$HADOOP_CONF_DIR"

and used the following command to share conf directories on all machines.

export SPARK_YARN_DIST_FILES=$(ls $HADOOP_CONF_DIR* | sed 's#^#file://#g'
|tr '\n' ',' )

and then I used the following command to start spark-shell

./spark-shell --master yarn-client --executor-memory 32g

This time I didn't get the "14/08/15 15:44:51 INFO
cluster.YarnClientSchedulerBackend:
Application report from ASM:" errors. but a new exception (see below
java.net.URISyntaxException). Any idea why this is happening ?
Also, although I see the REPL prompt, sc is not available in the REPL.

14/08/16 02:27:52 INFO yarn.Client: Uploading
file:/usr/lib/spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563/lib/spark-assembly-1.0.1.2.1.3.0-563-hadoop2.4.0.2.1.3.0-563.jar
to
hdfs://n001-10ge1:8020/user/ssimanta/.sparkStaging/application_1408130563059_0011/spark-assembly-1.0.1.2.1.3.0-563-hadoop2.4.0.2.1.3.0-563.jar

*java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected
scheme-specific part at index 5: conf:*

at org.apache.hadoop.fs.Path.initialize(Path.java:206)

at org.apache.hadoop.fs.Path.(Path.java:172)

at org.apache.hadoop.fs.Path.(Path.java:94)

at org.apache.spark.deploy.yarn.ClientBase$class.org
$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:161)

at
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:238)

at
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:233)

at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

at
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:233)

at
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:231)

at scala.collection.immutable.List.foreach(List.scala:318)

at
org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:231)

at
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39)

at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74)

at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81)

at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:136)

at org.apache.spark.SparkContext.(SparkContext.scala:318)

at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)

at $iwC$$iwC.(:8)

at $iwC.(:14)

at (:16)

at .(:20)

at .()

at .(:7)

at .()

at $print()

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)

at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)

at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)

at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)

at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)

at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)

at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)

at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)

at
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121)

at
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120)

at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263)

at
org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120)

at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913)

at
org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142)

at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:56)

at
org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:104)

at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:56)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:930)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)

at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)

at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

at org.apache.spark.repl.Spa

Re: Running Spark shell on YARN

2014-08-16 Thread Eric Friedman
+1 for such a document. 


Eric Friedman

> On Aug 15, 2014, at 1:10 PM, Kevin Markey  wrote:
> 
> Sandy and others:
> 
> Is there a single source of Yarn/Hadoop properties that should be set or 
> reset for running Spark on Yarn?
> We've sort of stumbled through one property after another, and (unless 
> there's an update I've not yet seen) CDH5 Spark-related properties are for 
> running the Spark Master instead of Yarn.
> 
> Thanks
> Kevin
> 
>> On 08/15/2014 12:47 PM, Sandy Ryza wrote:
>> We generally recommend setting yarn.scheduler.maximum-allocation-mbto the 
>> maximum node capacity.
>> 
>> -Sandy
>> 
>> 
>>> On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta  
>>> wrote:
>>> I just checked the YARN config and looks like I need to change this value. 
>>> Should be upgraded to 48G (the max memory allocated to YARN) per node ? 
>>> 
>>> 
>>> yarn.scheduler.maximum-allocation-mb
>>> 6144
>>> java.io.BufferedInputStream@2e7e1ee
>>> 
>>> 
>>> 
 On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta  
 wrote:
 Andrew, 
 
 Thanks for your response. 
 
 When I try to do the following. 
  ./spark-shell --executor-memory 46g --master yarn
 
 I get the following error. 
 
 Exception in thread "main" java.lang.Exception: When running with master 
 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the 
 environment.
 
 at 
 org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166)
 
 at 
 org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:61)
 
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)
 
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 
 After this I set the following env variable. 
 
 export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/
 
 The program launches but then halts with the following error. 
 
 
 14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104 MB), 
 is above the max threshold (6144 MB) of this cluster.
 
 I guess this is some YARN setting that is not set correctly. 
 
 Thanks
 
 -Soumya
 
 
 
> On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or  wrote:
> Hi Soumya,
> 
> The driver's console output prints out how much memory is actually 
> granted to each executor, so from there you can verify how much memory 
> the executors are actually getting. You should use the 
> '--executor-memory' argument in spark-shell. For instance, assuming each 
> node has 48G of memory,
> 
> bin/spark-shell --executor-memory 46g --master yarn
> 
> We leave a small cushion for the OS so we don't take up all of the entire 
> system's memory. This option also applies to the standalone mode you've 
> been using, but if you have been using the ec2 scripts, we set 
> "spark.executor.memory" in conf/spark-defaults.conf for you automatically 
> so you don't have to specify it each time on the command line. Of course, 
> you can also do the same in YARN.
> 
> -Andrew
> 
> 
> 
> 2014-08-15 10:45 GMT-07:00 Soumya Simanta :
> 
>> I've been using the standalone cluster all this time and it worked fine. 
>> Recently I'm using another Spark cluster that is based on YARN and I've 
>> not experience with YARN. 
>> 
>> The YARN cluster has 10 nodes and a total memory of 480G. 
>> 
>> I'm having trouble starting the spark-shell with enough memory. 
>> I'm doing a very simple operation - reading a file 100GB from HDFS and 
>> running a count on it. This fails due to out of memory on the executors. 
>> 
>> Can someone point to the command line parameters that I should use for 
>> spark-shell so that it?
>> 
>> 
>> Thanks
>> -Soumya
> 
> - To 
> unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
> commands, e-mail: user-h...@spark.apache.org  


Re: Open source project: Deploy Spark to a cluster with Puppet and Fabric.

2014-08-16 Thread Nicholas Chammas
Hey Brandon,

Thank you for sharing this.

What is the relationship of this project to the spark-ec2 tool that comes
with Spark? Does it provide a superset of the functionality of spark-ec2?

Nick


2014년 8월 13일 수요일, bdamos님이 작성한 메시지:

> Hi Spark community,
>
> We're excited about Spark at Adobe Research and have
> just open sourced a project we use to automatically provision
> a Spark cluster and submit applications.
> The project is on GitHub, and we're happy for any feedback
> from the community:
> https://github.com/adobe-research/spark-cluster-deployment
>
> Regards,
> Brandon.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Open-source-project-Deploy-Spark-to-a-cluster-with-Puppet-and-Fabric-tp12057.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>


Re: Question regarding spark data partition and coalesce. Need info on my use case.

2014-08-16 Thread Mayur Rustagi
Quite a good question, I assume you know the size of the cluster going in,
then you can essentially try to partition the data in some multiples of
that & use rangepartitioner to partition the data roughly equally. Dynamic
partitions are created based on number of blocks on filesystem & hence the
task overhead of scheduling so many tasks mostly kills the performance.

import org.apache.spark.RangePartitioner;
var file=sc.textFile("")
var partitionedFile=file.map(x=>(x,1))
var data= partitionedFile.partitionBy(new RangePartitioner(3, partitionedFile))




Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Sat, Aug 16, 2014 at 7:04 AM, abhiguruvayya 
wrote:

> My use case as mentioned below.
>
> 1. Read input data from local file system using sparkContext.textFile(input
> path).
> 2. partition the input data(80 million records) into partitions using
> RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer
> function. Without using coalesce() or repartition() on the input data spark
> executes really slow and fails with out of memory exception.
>
> The issue i am facing here is in deciding the number of partitions to be
> applied on the input data. *The input data  size varies every time and hard
> coding a particular value is not an option. And spark performs really well
> only when certain optimum partition is applied on the input data for which
> i
> have to perform lots of iteration(trial and error). Which is not an option
> in a production environment.*
>
> My question: Is there a thumb rule to decide the number of partitions
> required depending on the input data size and cluster resources
> available(executors,cores, etc...)? If yes please point me in that
> direction. Any help  is much appreciated.
>
> I am using spark 1.0 on yarn.
>
> Thanks,
> AG
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Question-regarding-spark-data-partition-and-coalesce-Need-info-on-my-use-case-tp12214.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Updating exising JSON files

2014-08-16 Thread Sean Owen
If you mean you want to overwrite the file in-place while you're
reading it, no you can't do that with HDFS. That would be dicey on any
file system. If you just want to append to the file, yes HDFS supports
appends. I am pretty certain Spark does not have a concept that maps
to appending, though I suppose you can put just about anything you
like in a function, including manually reading, computing and
appending to an HDFS file.

I think it will be far easier to write different output files and then
after overwrite the originals with them.

On Sat, Aug 16, 2014 at 12:53 AM, ejb11235  wrote:
> I have a bunch of JSON files stored in HDFS that I want to read in, modify,
> and write back out. I'm new to all this and am not sure if this is even the
> right thing to do.
>
> Basically, my JSON files contain my raw data, and I want to calculate some
> derived data and add is to the existing data.
>
> So first, is my basic approach to the problem flawed? Should I be placing
> derived data somewhere else?
>
> If not, how to I modify the existing JSON files?
>
> Note: I have been able to read the JSON files into an RDD using
> sqlContext.jsonFile, and save them back using RDD.saveAsTextFile(). But this
> creates new files. Is there a way to over write the original files?
>
> Thanks!
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Updating-exising-JSON-files-tp12211.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark on yarn cluster can't launch

2014-08-16 Thread centerqi hu
2014-08-16 15:53 GMT+08:00 Sandy Ryza :

> an occur if the queue do


Thank you for your feedback, I version of spark is 1.0.2, Hadoop 2.4.1.
I find the relevant code

 val queueInfo: QueueInfo = super.getQueueInfo(args.amQueue)
logInfo( """Queue info ... queueName: %s, queueCurrentCapacity: %s,
queueMaxCapacity: %s,
  queueApplicationCount = %s, queueChildQueueCount = %s""".format(
queueInfo.getQueueName,
queueInfo.getCurrentCapacity,
queueInfo.getMaximumCapacity,
queueInfo.getApplications.size,
queueInfo.getChildQueues.size))



if specify the queue name,
Errors will disappear.


for example

--queue sls_queue_1




../bin/spark-submit --class org.apache.spark.examples.JavaWordCount \
--master yarn \
--deploy-mode cluster \
--queue sls_queue_1 \
--verbose \
--num-executors 3 \
--driver-memory 6g \
--executor-memory 6g \
--executor-cores 3 \
../lib/spark-examples*.jar \
/user/www/tmp/audi/*


However, if set to spark-env.sh, this is not valid.

export HADOOP_CONF_DIR=/usr/local/webserver/hadoop-2.4.1/etc/hadoop/
export HADOOP_HOME=/usr/local/webserver/hadoop-2.4.1/
export JAVA_HOME=/usr/local/webserver/jdk1.7.0_67/
export SPARK_YARN_QUEUE=sls_queue_1
export YARN_CONF_DIR=$HADOOP_CONF_DIR


-- 
cente...@gmail.com|齐忠


Re: spark on yarn cluster can't launch

2014-08-16 Thread Sandy Ryza
On closer look, it seems like this can occur if the queue doesn't exist.
 Filed https://issues.apache.org/jira/browse/SPARK-3082.

-Sandy


On Sat, Aug 16, 2014 at 12:49 AM, Sandy Ryza 
wrote:

> Hi,
>
> Do you know what YARN scheduler you're using and what version of YARN?  It
> seems like this would be caused by YarnClient.getQueueInfo returning null,
> though, from browsing the YARN code, I'm not sure how this could happen.
>
> -Sandy
>
>
> On Fri, Aug 15, 2014 at 11:23 AM, Andrew Or  wrote:
>
>> Hi 齐忠,
>>
>> Thanks for reporting this. You're correct that the default deploy mode is
>> "client". However, this seems to be a bug in the YARN integration code; we
>> should not throw null pointer exception in any case. What version of Spark
>> are you using?
>>
>> Andrew
>>
>>
>> 2014-08-15 0:23 GMT-07:00 centerqi hu :
>>
>> The code does not run as follows
>>>
>>> ../bin/spark-submit --class org.apache.spark.examples.SparkPi \
>>>
>>> --master yarn \
>>>
>>> --deploy-mode cluster \
>>>
>>> --verbose \
>>>
>>> --num-executors 3 \
>>>
>>> --driver-memory 4g \
>>>
>>> --executor-memory 2g \
>>>
>>> --executor-cores 1 \
>>>
>>> ../lib/spark-examples*.jar \
>>>
>>> 100
>>>
>>> Exception in thread "main" java.lang.NullPointerException
>>>
>>> at
>>> org.apache.spark.deploy.yarn.Client$anonfun$logClusterResourceDetails$2.apply(Client.scala:109)
>>>
>>> at
>>> org.apache.spark.deploy.yarn.Client$anonfun$logClusterResourceDetails$2.apply(Client.scala:108)
>>>
>>> at org.apache.spark.Logging$class.logInfo(Logging.scala:58)
>>>
>>>
>>> However, when I removed "--deploy-mode cluster \"
>>>
>>> Exception disappear.
>>>
>>> I think with the "deploy-mode cluster" is running in yarn cluster mode,
>>> if not, the default will be run in yarn client mode.
>>>
>>> But why did yarn cluster get Exception?
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>> --
>>> cente...@gmail.com|齐忠
>>>
>>
>>
>


Re: spark on yarn cluster can't launch

2014-08-16 Thread Sandy Ryza
Hi,

Do you know what YARN scheduler you're using and what version of YARN?  It
seems like this would be caused by YarnClient.getQueueInfo returning null,
though, from browsing the YARN code, I'm not sure how this could happen.

-Sandy


On Fri, Aug 15, 2014 at 11:23 AM, Andrew Or  wrote:

> Hi 齐忠,
>
> Thanks for reporting this. You're correct that the default deploy mode is
> "client". However, this seems to be a bug in the YARN integration code; we
> should not throw null pointer exception in any case. What version of Spark
> are you using?
>
> Andrew
>
>
> 2014-08-15 0:23 GMT-07:00 centerqi hu :
>
> The code does not run as follows
>>
>> ../bin/spark-submit --class org.apache.spark.examples.SparkPi \
>>
>> --master yarn \
>>
>> --deploy-mode cluster \
>>
>> --verbose \
>>
>> --num-executors 3 \
>>
>> --driver-memory 4g \
>>
>> --executor-memory 2g \
>>
>> --executor-cores 1 \
>>
>> ../lib/spark-examples*.jar \
>>
>> 100
>>
>> Exception in thread "main" java.lang.NullPointerException
>>
>> at
>> org.apache.spark.deploy.yarn.Client$anonfun$logClusterResourceDetails$2.apply(Client.scala:109)
>>
>> at
>> org.apache.spark.deploy.yarn.Client$anonfun$logClusterResourceDetails$2.apply(Client.scala:108)
>>
>> at org.apache.spark.Logging$class.logInfo(Logging.scala:58)
>>
>>
>> However, when I removed "--deploy-mode cluster \"
>>
>> Exception disappear.
>>
>> I think with the "deploy-mode cluster" is running in yarn cluster mode,
>> if not, the default will be run in yarn client mode.
>>
>> But why did yarn cluster get Exception?
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> --
>> cente...@gmail.com|齐忠
>>
>
>


Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-16 Thread Matei Zaharia
Thanks for sharing this, Brandon! Looks like a great architecture for people to 
build on.

Matei

On August 15, 2014 at 2:07:06 PM, Brandon Amos (a...@adobe.com) wrote:

Hi Spark community, 

At Adobe Research, we're happy to open source a prototype 
technology called Spindle we've been developing over 
the past few months for processing analytics queries with Spark. 
Please take a look at the repository on GitHub at 
https://github.com/adobe-research/spindle, 
and we welcome any feedback. Thanks! 

Regards, 
Brandon. 



-- 
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203.html
 
Sent from the Apache Spark User List mailing list archive at Nabble.com. 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org