Re: spark parquet too many small files ?

2016-07-01 Thread nsalian
Hi Sri,

Thanks for the question.
You can simply start by doing this in the initial stage:

val sqlContext = new SQLContext(sc)
val customerList = sqlContext.read.json(args(0)).coalesce(20) //using a json
example here

where the argument is the path to the file(s). This will reduce the
partitions.
You can proceed with repartitioning the data further on. The goal would be
to reduce the number of files in the end as you do a saveAsParquet.

Hope that helps.



-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27265.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark-SQL with Oozie

2016-06-14 Thread nsalian
Hi,

Thanks for the question.
This would be a good starting point for your Oozie workflow application with
a Spark action.




-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-with-Oozie-tp27167p27168.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What is the minimum value allowed for StreamingContext's Seconds parameter?

2016-05-23 Thread nsalian
Thanks for the question.
What kind of data rate are you expecting to receive?




-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-minimum-value-allowed-for-StreamingContext-s-Seconds-parameter-tp27007p27008.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: yarn-cluster

2016-05-04 Thread nsalian
Hi,

this is a good spot to start for Spark and YARN.
https://spark.apache.org/docs/1.5.0/running-on-yarn.html

specific to the version you are on, you can toggle between pages.



-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/yarn-cluster-tp26846p26882.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Error while running jar using spark-submit on another machine

2016-05-03 Thread nsalian
Thank you for the question.
What is different on this machine as compared to the ones where the job
succeeded?





-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-jar-using-spark-submit-on-another-machine-tp26869p26875.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: a question about --executor-cores

2016-05-03 Thread nsalian
Hello,

Thank you for posting the question.
To begin I do have a few questions.
1) What is size of the YARN installation? How many NodeManagers? 

 
2) Notes to Remember:
Container Virtual CPU Cores
yarn.nodemanager.resource.cpu-vcores
>> Number of virtual CPU cores that can be allocated for containers.

Container Virtual CPU Cores Maximum
yarn.scheduler.maximum-allocation-vcores
>>  The largest number of virtual CPU cores that can be requested for a
>> container.


For executor-cores:
Every Spark executor in an application has the same fixed number of cores
and same fixed heap size. The number of cores can be specified with the
--executor-cores flag when invoking spark-submit, spark-shell, and pyspark
from the command line, or by setting the spark.executor.cores property in
the spark-defaults.conf file or on a SparkConf object. 

Similarly, the heap size can be controlled with the --executor-memory flag
or the spark.executor.memory property. The cores property controls the
number of concurrent tasks an executor can run. --executor-cores 5 means
that each executor can run a maximum of five tasks at the same time. The
memory property impacts the amount of data Spark can cache, as well as the
maximum sizes of the shuffle data structures used for grouping,
aggregations, and joins.


Imagine a cluster with six nodes running NodeManagers, each equipped with 16
cores and 64GB of memory. The NodeManager capacities,
yarn.nodemanager.resource.memory-mb and
yarn.nodemanager.resource.cpu-vcores, should probably be set to 63 * 1024 =
64512 (megabytes) and 15 respectively. We avoid allocating 100% of the
resources to YARN containers because the node needs some resources to run
the OS and Hadoop daemons. In this case, we leave a gigabyte and a core for
these system processes. Cloudera Manager helps by accounting for these and
configuring these YARN properties automatically.

The likely first impulse would be to use --num-executors 6 --executor-cores
15 --executor-memory 63G. However, this is the wrong approach because:

63GB + the executor memory overhead won’t fit within the 63GB capacity of
the NodeManagers.
The application master will take up a core on one of the nodes, meaning that
there won’t be room for a 15-core executor on that node.
15 cores per executor can lead to bad HDFS I/O throughput.
A better option would be to use --num-executors 17 --executor-cores 5
--executor-memory 19G. Why?

This config results in three executors on all nodes except for the one with
the AM, which will have two executors.
--executor-memory was derived as (63/3 executors per node) = 21.  21 * 0.07
= 1.47.  21 – 1.47 ~ 19.


This is covered here:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/





-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/a-question-about-executor-cores-tp26868p26874.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Creating new Spark context when running in Secure YARN fails

2016-05-03 Thread nsalian
Feel free to correct me if I am wrong.
But I believe this isn't a feature yet:
 "create a new Spark context within a single JVM process (driver)"

A few questions for you:

1) Is Kerberos setup correctly for you (the user)
2) Could you please add the command/ code you are executing?
Checking to see if you provide a keytab and principal in your invocation.



-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Creating-new-Spark-context-when-running-in-Secure-YARN-fails-tp25361p26873.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: yarn-cluster

2016-05-03 Thread nsalian
Hello,

Thank you for the question.
The Status UNDEFINED means the application has not been completed and not
been resourced.
Upon getting assignment it will progress to RUNNING and then SUCCEEDED upon
completion.

It isn't a problem that you should worry about.
You should make sure to tune your YARN settings to help this work
appropriately and get the number of containers that the application needs.





-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/yarn-cluster-tp26846p26871.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL StructType error

2016-04-16 Thread nsalian
Hello,

I am parsing a text file and inserting the parsed values into a Hive table.

Code:

files = sc.wholeTextFiles("hdfs://nameservice1:8020/user/root/email.txt",
minPartitions=16, use_unicode=True)
# Putting unicode to False didn't help either


sqlContext.sql("DROP TABLE emails")
sqlContext.sql("CREATE TABLE IF NOT EXISTS emails (subject STRING, email_to
STRING, email_from STRING, date DATE, from_name STRING, to_name STRING)")
#df = sqlContext.sql("SELECT * FROM emails limit 0")

fields = [StructField("subject", StringType(), True),
  StructField("email_to", StringType(), True),
  StructField("email_to", IntegerType(), True),
  StructField("date", DateType(), True),
  StructField("from_name", StringType(), True),
  StructField("to_name", StringType(), True)]

schema = StructType(fields)


for x,v in files.collect():
j = Parser()
headers = j.parsestr(str(v))
subject = str(headers['subject'])
email_to = str(headers['to'])
email_from = str(headers['from'])
date = str(headers['date'])
from_name = str(headers['x-from'])
to_name = str(headers['x-to'])
emaildf =
sqlContext.createDataFrame(Row(subject,email_to,email_from,date,from_name,to_name),schema)\
.registerTempTable("emailTemp")

sc.stop()


Error:
Traceback (most recent call last):
  File "project.py", line 49, in 
emaildf =
sqlContext.createDataFrame(Row(subject,email_to,email_from,date,from_name,to_name),schema)\
  File
"/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py",
line 425, in createDataFrame
  File
"/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py",
line 350, in _createFromLocal
  File
"/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
line 1134, in _verify_type

TypeError: StructType can not accept object 'Hello Subject' in type 

Not sure if this is a bug or something I am doing wrong.
Any thoughts?



-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-StructType-error-tp26792.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: HP customer support @ www.globalpccure.com/Support/Support-for-HP.aspx

2016-03-19 Thread nsalian
Please refrain from posting such messages on this email thread.
This is specific to the Spark ecosystem and not an avenue to advertise an
entity/company.

Thank you.



-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HP-customer-support-www-globalpccure-com-Support-Support-for-HP-aspx-tp26521p26522.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark UI documentaton needed

2016-02-22 Thread nsalian
Hi Ajay,

Feel free to open a JIRA with the fields that you think are missing and what
kind of documentation you wish to see.

It would be best to have it in a JIRA to actually track and triage your
suggestions.

Thank you.



-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-UI-documentaton-needed-tp26300p26301.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Write spark eventLog to both HDFS and local FileSystem

2016-02-13 Thread nsalian
Hi,

Thanks for the question.

1) The core-site.xml holds the parameter for the defaultFS:

fs.defaultFS
hdfs://:8020
  

This will be appended to your value in spark.eventLog.dir. So depending on
which location you intend to write it to, you can point it to either HDFS or
local.

As far as I know (feel free to correct me if I am incorrect), you can write
to one location in the fileSystem depending on which it is.
A script may help you achieve the copy to the local FileSystem if needed.

A caveat would be to make sure the permissions are sound since the user who
submits a job may not be in the correct user group or have permissions to
write to local system.

Hope that helps.




-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Write-spark-eventLog-to-both-HDFS-and-local-FileSystem-tp26203p26217.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark master takes more time with local[8] than local[1]

2016-01-25 Thread nsalian
Hi,

Thanks for the question.
Is it possible for you to elaborate on your application?
The flow of the application will help to understand what could potentially
cause things to slow down?

Do logs give you any idea what goes on? Have you had a chance to look?

Thank you.



-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-master-takes-more-time-with-local-8-than-local-1-tp26052p26061.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Job History Logs for spark jobs submitted on YARN

2016-01-21 Thread nsalian
Hello,

Thanks for the question.
1) Typically the Resource Manager in YARN would print out the Aggregate
Resource Allocation for the application after you have found the specific
application using the application id.

2) As MapReduce, there is a parameter that is part of either the
spark-defaults.conf or the application specific configuration.
spark.eventLog.dir=hdfs://:8020/user/spark/applicationHistory
This is where the Spark History Server gets the information after the
application is completed.

3) In the History server on Spark there are the tabs that allow you to look
at the information that you need:
Jobs
Stages
Storage
Environment
Executors

Especially the Executors will give a bit more detailed information:
Storage Memory  
Disk Used

Hopefully that helps.
Thank you.




-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-History-Logs-for-spark-jobs-submitted-on-YARN-tp25946p26043.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: process of executing a program in a distributed environment without hadoop

2016-01-21 Thread nsalian
Thanks for the question.

The documentation here:
https://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
enlists a variety of submission techniques.
You can vary the Master URLs to suit your needs whether it be local/ yarn or
mesos.





-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/process-of-executing-a-program-in-a-distributed-environment-without-hadoop-tp26015p26039.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-defaults.conf optimal configuration

2015-12-08 Thread nsalian
Hi Chris,

Thank you for posting the question.
Tuning spark configurations is a tricky task since there are a lot factors
to consider.
The configurations that you listed cover the most them.

To understand the situation that can guide you in making a decision about
tuning:
1) What kind of spark applications are you intending to run?
2) What cluster manager have you decided to go with? 
3) How frequent are these applications going to run? (For the sake of
scheduling)
4) Is this used by multiple users? 
5) What else do you have in the cluster that will interact with Spark? (For
the sake of resolving dependencies)
Personally, I would suggest to have these questions  prior to jumping on the
idea of tuning.
A cluster manager like YARN would help understand the settings for cores and
memory since the applications have to be considered for scheduling.

Hope that helps to start off in the right direction.





-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-defaults-conf-optimal-configuration-tp25641p25642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ERROR Executor java.lang.NoClassDefFoundError

2015-08-13 Thread nsalian
If --jars doesn't work,

try --conf spark.executor.extraClassPath=path-to-jar 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-Executor-java-lang-NoClassDefFoundError-tp24244p24256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How does one decide no of executors/cores/memory allocation?

2015-06-17 Thread nsalian
Hello shreesh,

That would be quite a challenge to understand.
A few things that I think should help estimate those numbers:
1) Understanding the cost of the individual transformations in the
application
E.g a flatMap can be more expansive in memory as opposed to a map 

2) The communication patterns can be helpful to understand the cost. The
four types:

None:
 Map, Filter 
All-to-one:
 reduce
One-to-all:
 broadcast
All-to-all:
 reduceByKey, groupyByKey, Join

3) Understand the cost is the beginning. Depending how much data you have,
the partitions need to be created accordingly. The more the partitions in
smaller sizes is good to improve parallelism but you will need a lot more
executors. On the other hand, fewer partitions with larger sizes can be
lower on the executor count but it will need more individual memory.

To begin with, I would strategize the approach for partitions and try a
starting number of partitions and work from there.
Without looking or understanding your use case, it is hard for me to give
you specific numbers. It would be better to start with a basic strategy and
optimize from there.

Hope that helps.

Thank you.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-one-decide-no-of-executors-cores-memory-allocation-tp23326p23369.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Suggestions for Posting on the User Mailing List

2015-06-16 Thread nsalian
As discussed during the meetup, the following information should help while
creating a topic on the User mailing list.

1) Version of Spark and Hadoop should be included to help reproduce the
issue or understand if the issue is a version limitation

2) Explanation about the scenario in as much detail as possible. Specific to
the purpose of the application and also an explanation of the pipeline (if
applicable). 

3) Specific log or stack traces for the issue that you are observing. A
simple message with the error is good but a stack trace can help in
abundance and add a lot of context.

4) Any miscellaneous/additional information about the environment. This is a
broad suggestion and can be anything from hardware, environment setups,
other factors that can possibly be responsible,etc.

Thank you.

Regards,
Neelesh.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Suggestions-for-Posting-on-the-User-Mailing-List-tp23347.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR 1.4.0: read.df() function fails

2015-06-16 Thread nsalian
Hello,

Is the json file in HDFS or local?
/home/esten/ami/usaf.json is this an HDFS path?

Suggestions:
1) Specify file:/home/esten/ami/usaf.json
2) Or move the usaf.json file into HDFS since the application is looking for
the file in HDFS.

Please let me know if that helps.

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-1-4-0-read-df-function-fails-tp2p23346.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark application in production without HDFS

2015-06-15 Thread nsalian
Hi,

Spark on YARN should help in the memory management for Spark jobs.
Here is a good starting point:
https://spark.apache.org/docs/latest/running-on-yarn.html
YARN integrates well with HDFS and should be a good solution for a large
cluster.
What specific features are you looking for that HDFS does not satisfy?

Thank you.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-application-in-production-without-HDFS-tp23260p23320.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Issue running Spark 1.4 on Yarn

2015-06-11 Thread nsalian
Hello,

Since the other queues are fine, I reckon, there may be a limit in the max
apps or memory on this queue in particular.
I don't suspect fairscheduler limits either but on this queue we may be
seeing / hitting a maximum.

Could you try to get the configs for the queue? That should provide more
context.

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-running-Spark-1-4-on-Yarn-tp23211p23285.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Issue running Spark 1.4 on Yarn

2015-06-10 Thread nsalian
Hi,

Thanks for the added information. Helps add more context.

Is that specific queue different from the others?

FairScheduler.xml should have the information needed.Or if you have a
separate allocations.xml.

Something of this format:
allocations
  queue name=sample_queue
minResources1 mb,0vcores/minResources
maxResources9 mb,0vcores/maxResources
maxRunningApps50/maxRunningApps
maxAMShare0.1/maxAMShare
weight2.0/weight
schedulingPolicyfair/schedulingPolicy
queue name=sample_sub_queue
  aclSubmitAppscharlie/aclSubmitApps
  minResources5000 mb,0vcores/minResources
/queue
  /queue

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-running-Spark-1-4-on-Yarn-tp23211p23261.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-09 Thread nsalian
1) Could you share your command?

2) Are the kafka brokers on the same host?

3) Could you run a --describe on the topic to see if the topic is setup
correctly (just to be sure)?






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-ERROR-EndpointWriter-dropping-message-tp23228p23235.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can a Spark App run with spark-submit write pdf files to HDFS

2015-06-09 Thread nsalian
By writing PDF files, do you mean something equivalent to a hadoop fs -put
/path?

I'm not sure how Pdfbox works though, have you tried writing individually
without spark?

We can potentially look if you have established that as a starting point to
see how Spark can be interfaced to write to HDFS.

Moreover, is there a specific need to use Spark in this case?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-a-Spark-App-run-with-spark-submit-write-pdf-files-to-HDFS-tp23233p23237.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Issue running Spark 1.4 on Yarn

2015-06-09 Thread nsalian
I see the other jobs SUCCEEDED without issues.

Could you snapshot the FairScheduler activity as well? 
My guess it, with the single core, it is reaching a NodeManager that is
still busy with other jobs and the job ends up in a waiting state.

Does the job eventually complete?

Could you potentially add another node to the cluster to see if my guess is
right? I just see one Active NM.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-running-Spark-1-4-on-Yarn-tp23211p23236.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-04-22 Thread nsalian
+1 to executor-memory to 5g.
Do check the overhead space for both the driver and the executor as per
Wilfred's suggestion.

Typically, 384 MB should suffice.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark-submit not working when application jar is in hdfs

2015-03-30 Thread nsalian
Client mode would not support HDFS jar extraction.

I tried this:
sudo -u hdfs spark-submit --class org.apache.spark.examples.SparkPi
--deploy-mode cluster --master yarn
hdfs:///user/spark/spark-examples-1.2.0-cdh5.3.2-hadoop2.5.0-cdh5.3.2.jar 10

And it worked.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-hdfs-tp21840p22302.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-30 Thread nsalian
Try running it like this:

sudo -u hdfs spark-submit --class org.apache.spark.examples.SparkPi
--deploy-mode cluster --master yarn
hdfs:///user/spark/spark-examples-1.2.0-cdh5.3.2-hadoop2.5.0-cdh5.3.2.jar 10


Caveats:
1) Make sure the permissions of /user/nick is 775 or 777.
2) No need for hostname, try hdfs://path-to-jar



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-FileNotFoundException-when-using-HDFS-in-cluster-mode-tp22287p22303.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org