Hi Sri,
Thanks for the question.
You can simply start by doing this in the initial stage:
val sqlContext = new SQLContext(sc)
val customerList = sqlContext.read.json(args(0)).coalesce(20) //using a json
example here
where the argument is the path to the file(s). This will reduce the
partitions.
Hi,
Thanks for the question.
This would be a good starting point for your Oozie workflow application with
a Spark action.
-
Neelesh S. Salian
Cloudera
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-with-Oozie-tp27167p27168.html
Sent from
Thanks for the question.
What kind of data rate are you expecting to receive?
-
Neelesh S. Salian
Cloudera
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-minimum-value-allowed-for-StreamingContext-s-Seconds-parameter-tp27007p27008.html
Hi,
this is a good spot to start for Spark and YARN.
https://spark.apache.org/docs/1.5.0/running-on-yarn.html
specific to the version you are on, you can toggle between pages.
-
Neelesh S. Salian
Cloudera
--
View this message in context:
Thank you for the question.
What is different on this machine as compared to the ones where the job
succeeded?
-
Neelesh S. Salian
Cloudera
--
View this message in context:
Hello,
Thank you for posting the question.
To begin I do have a few questions.
1) What is size of the YARN installation? How many NodeManagers?
2) Notes to Remember:
Container Virtual CPU Cores
yarn.nodemanager.resource.cpu-vcores
>> Number of virtual CPU cores that can be allocated for
Feel free to correct me if I am wrong.
But I believe this isn't a feature yet:
"create a new Spark context within a single JVM process (driver)"
A few questions for you:
1) Is Kerberos setup correctly for you (the user)
2) Could you please add the command/ code you are executing?
Checking to
Hello,
Thank you for the question.
The Status UNDEFINED means the application has not been completed and not
been resourced.
Upon getting assignment it will progress to RUNNING and then SUCCEEDED upon
completion.
It isn't a problem that you should worry about.
You should make sure to tune your
Hello,
I am parsing a text file and inserting the parsed values into a Hive table.
Code:
files = sc.wholeTextFiles("hdfs://nameservice1:8020/user/root/email.txt",
minPartitions=16, use_unicode=True)
# Putting unicode to False didn't help either
sqlContext.sql("DROP TABLE emails")
Please refrain from posting such messages on this email thread.
This is specific to the Spark ecosystem and not an avenue to advertise an
entity/company.
Thank you.
-
Neelesh S. Salian
Cloudera
--
View this message in context:
Hi Ajay,
Feel free to open a JIRA with the fields that you think are missing and what
kind of documentation you wish to see.
It would be best to have it in a JIRA to actually track and triage your
suggestions.
Thank you.
-
Neelesh S. Salian
Cloudera
--
View this message in context:
Hi,
Thanks for the question.
1) The core-site.xml holds the parameter for the defaultFS:
fs.defaultFS
hdfs://:8020
This will be appended to your value in spark.eventLog.dir. So depending on
which location you intend to write it to, you can point it to either HDFS or
local.
As far
Hi,
Thanks for the question.
Is it possible for you to elaborate on your application?
The flow of the application will help to understand what could potentially
cause things to slow down?
Do logs give you any idea what goes on? Have you had a chance to look?
Thank you.
-
Neelesh S.
Hello,
Thanks for the question.
1) Typically the Resource Manager in YARN would print out the Aggregate
Resource Allocation for the application after you have found the specific
application using the application id.
2) As MapReduce, there is a parameter that is part of either the
Thanks for the question.
The documentation here:
https://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
enlists a variety of submission techniques.
You can vary the Master URLs to suit your needs whether it be local/ yarn or
mesos.
-
Hi Chris,
Thank you for posting the question.
Tuning spark configurations is a tricky task since there are a lot factors
to consider.
The configurations that you listed cover the most them.
To understand the situation that can guide you in making a decision about
tuning:
1) What kind of spark
If --jars doesn't work,
try --conf spark.executor.extraClassPath=path-to-jar
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-Executor-java-lang-NoClassDefFoundError-tp24244p24256.html
Sent from the Apache Spark User List mailing list archive at
Hello shreesh,
That would be quite a challenge to understand.
A few things that I think should help estimate those numbers:
1) Understanding the cost of the individual transformations in the
application
E.g a flatMap can be more expansive in memory as opposed to a map
2) The communication
As discussed during the meetup, the following information should help while
creating a topic on the User mailing list.
1) Version of Spark and Hadoop should be included to help reproduce the
issue or understand if the issue is a version limitation
2) Explanation about the scenario in as much
Hello,
Is the json file in HDFS or local?
/home/esten/ami/usaf.json is this an HDFS path?
Suggestions:
1) Specify file:/home/esten/ami/usaf.json
2) Or move the usaf.json file into HDFS since the application is looking for
the file in HDFS.
Please let me know if that helps.
Thank you.
--
Hi,
Spark on YARN should help in the memory management for Spark jobs.
Here is a good starting point:
https://spark.apache.org/docs/latest/running-on-yarn.html
YARN integrates well with HDFS and should be a good solution for a large
cluster.
What specific features are you looking for that HDFS
Hello,
Since the other queues are fine, I reckon, there may be a limit in the max
apps or memory on this queue in particular.
I don't suspect fairscheduler limits either but on this queue we may be
seeing / hitting a maximum.
Could you try to get the configs for the queue? That should provide
Hi,
Thanks for the added information. Helps add more context.
Is that specific queue different from the others?
FairScheduler.xml should have the information needed.Or if you have a
separate allocations.xml.
Something of this format:
allocations
queue name=sample_queue
minResources1
1) Could you share your command?
2) Are the kafka brokers on the same host?
3) Could you run a --describe on the topic to see if the topic is setup
correctly (just to be sure)?
--
View this message in context:
By writing PDF files, do you mean something equivalent to a hadoop fs -put
/path?
I'm not sure how Pdfbox works though, have you tried writing individually
without spark?
We can potentially look if you have established that as a starting point to
see how Spark can be interfaced to write to HDFS.
I see the other jobs SUCCEEDED without issues.
Could you snapshot the FairScheduler activity as well?
My guess it, with the single core, it is reaching a NodeManager that is
still busy with other jobs and the job ends up in a waiting state.
Does the job eventually complete?
Could you
+1 to executor-memory to 5g.
Do check the overhead space for both the driver and the executor as per
Wilfred's suggestion.
Typically, 384 MB should suffice.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html
Sent
Client mode would not support HDFS jar extraction.
I tried this:
sudo -u hdfs spark-submit --class org.apache.spark.examples.SparkPi
--deploy-mode cluster --master yarn
hdfs:///user/spark/spark-examples-1.2.0-cdh5.3.2-hadoop2.5.0-cdh5.3.2.jar 10
And it worked.
--
View this message in context:
Try running it like this:
sudo -u hdfs spark-submit --class org.apache.spark.examples.SparkPi
--deploy-mode cluster --master yarn
hdfs:///user/spark/spark-examples-1.2.0-cdh5.3.2-hadoop2.5.0-cdh5.3.2.jar 10
Caveats:
1) Make sure the permissions of /user/nick is 775 or 777.
2) No need for
29 matches
Mail list logo