Hi,
I'm new to Spark. For my application I need to overwrite Hadoop
configurations (Can't change Configurations in Hadoop as it might affect my
regular HDFS), so that Namenode IPs gets automatically resolved.What are
the ways to do so. I tried giving "spark.hadoop.dfs.ha.namenodes.nn",
We've a streaming application running on yarn and we would like to ensure
that is up running 24/7.
Is there a way to tell yarn to automatically restart a specific application
on failure?
There is property yarn.resourcemanager.am.max-attempts which is default set
to 2 setting it to bigger value
Hi,
How do i connect HadoopHA from SPARK. I tried overwriting hadoop
configurations from sparkCong. But Still I'm getting UnknownHostException
with following trace
java.lang.IllegalArgumentException: java.net.UnknownHostException: ABC at
Do you have all of the required HDFS HA config options in your override?
I think these are the minimum required for HA:
dfs.nameservices
dfs.ha.namenodes.{nameservice ID}
dfs.namenode.rpc-address.{nameservice ID}.{name node ID}
On Thu, Oct 1, 2015 at 7:22 AM, Vinoth Sankar
I am using standalone deployment, with spark 1.4.1
When I submit the job, I get no error at the submission terminal. Then I
check the webui, I can find the driver section which has a my driver
submission, with this error: java.io.FileNotFoundException ... which point
the full path of my jar as
You can point to your custom HADOOP_CONF_DIR in your spark-env.sh
Regards
Sab
On 01-Oct-2015 5:22 pm, "Vinoth Sankar" wrote:
> Hi,
>
> I'm new to Spark. For my application I need to overwrite Hadoop
> configurations (Can't change Configurations in Hadoop as it might affect
On top of that you could make the topic part of the key (e.g. keyBy in
.transform or manually emitting a tuple) and use one of the .xxxByKey operators
for the processing.
If you have a stable, domain specific list of topics (e.g. 3-5 named topics)
and the processing is really different, I
You can get the topic for a given partition from the offset range. You can
either filter using that; or just have a single rdd and match on topic when
doing mapPartitions or foreachPartition (which I think is a better idea)
Did you check you kafka broker logs to see what was going on during that
time?
The direct stream will handle normal leader loss / rebalance by retrying
tasks.
But the exception you got indicates that something with kafka was wrong,
such that offsets were being re-used.
ie. your job already
Ya Also I think I need to enable the checkpointing and rather then building
the lineage DAG need to store the RDD data into HDFS.
On 23 September 2015 at 01:04, Adrian Tanase wrote:
> btw I re-read the docs and I want to clarify that reliable receiver + WAL
> gives you at
This also happened to me in extreme recovery scenarios – e.g. Killing 4 out of
a 7 machine cluster.
I’d put my money on recovering from an out of sync replica, although I haven’t
done extensive testing around it.
-adrian
From: Cody Koeninger
Date: Thursday, October 1, 2015 at 5:18 PM
To:
Have you setup HADOOP_CONF_DIR in spark-env.sh correctly ?
Cheers
On Thu, Oct 1, 2015 at 5:22 AM, Vinoth Sankar wrote:
> Hi,
>
> How do i connect HadoopHA from SPARK. I tried overwriting hadoop
> configurations from sparkCong. But Still I'm getting UnknownHostException
>
Hi,
I am using SPARK 1.4.0, Python and Decision Trees to perform machine
learning classification.
I test it by creating the predictions and zip it to the test data, as
following:
*predictions = tree_model.predict(test_data.map(lambda a: a.features))
labels = test_data.map(lambda a:
This happens automatically as long as you submit with cluster mode instead of
client mode. (e.g. ./spark-submit —master yarn-cluster …)
The property you mention would help right after that, although you will need to
set it to a large value (e.g. 1000?) - as there is no “infinite” support.
How are you running the actual application?
I find it slightly odd that you're setting PYSPARK_SUBMIT_ARGS
directly; that's supposed to be an internal env variable used by
Spark. You'd normally pass those parameters in the spark-submit (or
pyspark) command line.
On Thu, Oct 1, 2015 at 8:56 AM,
In your second command, have you tried changing the comma to colon ?
Cheers
On Thu, Oct 1, 2015 at 8:56 AM, YaoPau wrote:
> I'm trying to add multiple SerDe jars to my pyspark session.
>
> I got the first one working by changing my PYSPARK_SUBMIT_ARGS to:
>
> "--master
I'm trying to add multiple SerDe jars to my pyspark session.
I got the first one working by changing my PYSPARK_SUBMIT_ARGS to:
"--master yarn-client --executor-cores 5 --num-executors 5 --driver-memory
3g --executor-memory 5g --jars /home/me/jars/csv-serde-1.1.2.jar"
But when I tried to add a
Hi ,
I have use case where we need to automate start/stop of spark streaming
application.
To stop spark job, we need driver/application id of the job .
For example :
/app/spark-master/bin/spark-class org.apache.spark.deploy.Client kill
spark://10.65.169.242:7077 $driver_id
I am thinking to
My workers are going OOM over time. I am running a streaming job in spark
1.4.0.
Here is the heap dump of workers.
*16,802 instances of "org.apache.spark.deploy.worker.ExecutorRunner",
loaded by "sun.misc.Launcher$AppClassLoader @ 0xdff94088" occupy
488,249,688 (95.80%) bytes. These
Hi, there
When running a Spark job on YARN, 2 executors somehow got lost during the
execution. The message on the history server GUI is “CANNOT find address”. Two
extra executors were launched by YARN and eventually finished the job. Usually
I go to the “Executors” tab on the UI to check the
Hi all,
I need to repeat a couple rows from a dataframe by n times each. To do so, I
plan to create a new Data Frame, but I am being unable to find a way to
accumulate "Rows" somewhere, as this might get huge, I can't accumulate into a
mutable Array, I think?.
Thanks,
Saif
Ted,
Thanks for your reply.
First of all, after sending email to the mailing list, I use yarn logs
applicationId to retrieve the aggregated log
successfully. I found the exceptions I am looking for.
Now as to your suggestion, when I go to the YARN RM UI, I can only see the
"Tracking URL" in
Can you go to YARN RM UI to find all the attempts for this Spark Job ?
The two lost executors should be found there.
On Thu, Oct 1, 2015 at 10:30 AM, Lan Jiang wrote:
> Hi, there
>
> When running a Spark job on YARN, 2 executors somehow got lost during the
> execution. The
Thanks Nicolae ,
So In my case all executers are sending results back to the driver and and "
*shuffle* *is just sending out the textFile to distribute the
partitions", *could
you please elaborate on this ? what exactly is in this file ?
On Wed, Sep 30, 2015 at 9:57 PM, Nicolae Marasoiu <
Hi, there
Here is the problem I ran into when executing a Spark Job (Spark 1.3). The
spark job is loading a bunch of avro files using Spark SQL spark-avro 1.0.0
library. Then it does some filter/map transformation, repartition to 1
partition and then write to HDFS. It creates 2 stages. The total
Hi,
If you just need processing per topic, why not generate N different kafka
direct streams ? when creating a kafka direct stream you have list of topics -
just give one.
Then the reusable part of your computations should be extractable as
transformations/functions and reused between the
Hi,
I am writing a spark streaming job using the direct stream method for kafka
and wanted to handle the case of checkpoint failure when we'll have to
reprocess the entire data from starting. By default for every new
checkpoint it tries to load everything from each partition and that takes a
lot
That depends on your job, your cluster resources, the number of seconds per
batch...
You'll need to do some empirical work to figure out how many messages per
batch a given executor can handle. Divide that by the number of seconds
per batch.
On Thu, Oct 1, 2015 at 3:39 PM, Sourabh Chandak
Hi,
We have python2.6 (default) on cluster and also we have installed
python2.7.
I was looking a way to set python version in spark-submit.
anyone know how to do this ?
Thanks
--
View this message in context:
PYSPARK_PYTHON determines what the worker uses.
PYSPARK_DRIVER_PYTHON is for driver.
See the comment at the beginning of bin/pyspark
FYI
On Thu, Oct 1, 2015 at 1:56 PM, roy wrote:
> Hi,
>
> We have python2.6 (default) on cluster and also we have installed
> python2.7.
>
> I
Hi Pala,
Can you add the full stacktrace of the exception? For now, can you use
create temporary function to workaround the issue?
Thanks,
Yin
On Wed, Sep 30, 2015 at 11:01 AM, Pala M Muthaia <
mchett...@rocketfuelinc.com.invalid> wrote:
> +user list
>
> On Tue, Sep 29, 2015 at 3:43 PM, Pala
Hi
I am trying to better understand shuffle in spark .
Based on my understanding thus far ,
*Shuffle Write* : writes stage output for intermediate stage on local disk
if memory is not sufficient.,
Example , if each worker has 200 MB memory for intermediate results and the
results are 300MB then
Hello,
is it possible to implement custom receiver [1] which will receive messages
from REST calls?
As REST classes in Java(jax-rs) are defined declarative and instantiated by
application server I'm not use if it is possible.
I have tried to implement custom receiver which is inject to REST
Hi Nicolae,
Thanks for the reply. To further clarify things -
sc.textFile is reading from HDFS, now shouldn't the file be read in a way
such that EACH executer works on only the local copy of file part available
, in this case its a ~ 4.64 GB file and block size is 256MB, so approx 19
partitions
Thanks for getting back Yin. I have copied the stack below. The associated
query is just this: "hc.sql("select murmurhash3('abc') from dual")". The
UDF murmurhash3 is already available in our hive metastore.
Regarding temporary function, can i create a temp function with existing
Hive UDF code,
Hi,
I am trying to find a simple example to read a data file on HDFS. The
file has the following format
a , b , c ,,mm
a1,b1,c1,2015,09
a2,b2,c2,2014,08
I would like to read this file and store it in HDFS partitioned by year and
month. Something like this
/path/to/hdfs//mm
I want to
Hi,
So you say " sc.textFile -> flatMap -> Map".
My understanding is like this:
First step is a number of partitions are determined, p of them. You can give
hint on this.
Then the nodes which will load partitions p, that is n nodes (where n<=p).
Relatively at the same time or not, the n nodes
Thanks Robin.
Regards
SM
> On 01-Oct-2015, at 3:15 pm, Robin East wrote:
>
> From the comments in the code:
>
> When called inside a class in the spark package, returns the name of the user
> code class (outside the spark package) that called into Spark, as well as
>
I'm running a standalone Spark cluster of 1 master and 2 slaves.
My slaves file under /conf list the fully qualified domain name of the 2
slave machines
When I look on the Spark webpage ( on :8080), I see my 2 workers, but the
worker ID uses the IP address , like
Additional Info: I am running Spark on YARN.
2015-10-01 15:42 GMT-07:00 Renxia Wang :
> Hi guys,
>
> I know there is a way to set the number of retry of failed tasks, using
> spark.task.maxFailures. what is the default policy for the failed tasks
> retry? Is it exponential
Not sure what you mean by that, I shared the data which I see in spark UI.
Can you point me to a location where I can precisely get the data you need?
When I run the job in fine grained mode, I see tons are tasks created and
destroyed under a mesos "framework". I have about 80k spark tasks which
When you say “receive messages” you mean acting as a REST endpoint, right? If
so, it might be better to use JMS (or Kafka) option for a few reasons:
The receiver will be deployed to any of the available executors, so your REST
clients will need to be made aware of the IP where the receiver is
Bumping it up, its not really a blocking issue.
But fine grain mode eats up uncertain number of resources in mesos and
launches tons of tasks, so I would prefer using the coarse grained mode if
only it didn't run out of memory.
Thanks,
-Utkarsh
On Mon, Sep 28, 2015 at 2:24 PM, Utkarsh Sengar
I've eyeballed the sbt file and it look ok to me
Try
sbt clean package
that should sort it out. If not please supply the full code you are running
-
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
Thanks Robin.
Regards
SM
> On 01-Oct-2015, at 3:15 pm, Robin East wrote:
>
> From the comments in the code:
>
> When called inside a class in the spark package, returns the name of the user
> code class (outside the spark package) that called into Spark, as well as
>
Hi Utkarsh,
I replied earlier asking what is your task assignment like with fine vs
coarse grain mode look like?
Tim
On Thu, Oct 1, 2015 at 4:05 PM, Utkarsh Sengar
wrote:
> Bumping it up, its not really a blocking issue.
> But fine grain mode eats up uncertain number of
Hi,
I wanted to understand what is the purpose of Call Site in Spark Context?
Regards
SM
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Hi anybody tried to save DataFrame in HBase? I have processed data in
DataFrame which I need to store in HBase so that my web ui can access it
from Hbase? Please guide. Thanks in advance.
--
View this message in context:
Yes. You can use create temporary function to create a function based on a
Hive UDF (
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/ReloadFunction
).
Regarding the error, I think the problem is that starting from Spark 1.4,
we have two separate
Hi guys,
I know there is a way to set the number of retry of failed tasks, using
spark.task.maxFailures. what is the default policy for the failed tasks
retry? Is it exponential backoff? My tasks sometimes failed because of
Socket connection timeout/reset, even with retry, some of the tasks will
Here is the log file from the worker node
15/09/30 23:49:37 INFO Worker: Executor app-20150930233113-/8 finished
with state EXITED message Command exited with code 1 exitStatus \
1
15/09/30 23:49:37 INFO Worker: Asked to launch executor
app-20150930233113-/9 for PythonPi
15/09/30 23:49:37
Right, you can use SparkContext and SQLContext in multiple threads. They
are thread safe.
Best Regards,
Shixiong Zhu
2015-10-01 4:57 GMT+08:00 :
> Hi all,
>
> I have a process where I do some calculations on each one of the columns
> of a dataframe.
>
Do you have the log? Looks like some exceptions in your codes make
SparkContext stopped.
Best Regards,
Shixiong Zhu
2015-09-30 17:30 GMT+08:00 tranan :
> Hello All,
>
> I have several Spark Streaming applications running on Standalone mode in
> Spark 1.5. Spark is currently
As I understand, you don't need merge of your historical data RDD with
your RDD_inc, what you need is merge of the computation results of the your
historical RDD with RDD_inc and so on.
IMO, you should consider having an external row store to hold your
computations. I say this because you need
Do you have the log file? It may be because of wrong settings.
Best Regards,
Shixiong Zhu
2015-10-01 7:32 GMT+08:00 markluk :
> I setup a new Spark cluster. My worker node is dying with the following
> exception.
>
> Caused by: java.util.concurrent.TimeoutException: Futures
I am not sure about originally explain of shuffle write.
In the word count example, the shuffle is needed, as Spark has to group by the
word (ReduceBy is more accurate here). Image that you have 2 mappers to read
the data, then each mapper will generate the (word, count) tuple output in
We have limited disk space. So, can we have spark.cleaner.ttl to clean up
the files? Or is there any setting that can cleanup old temp files?
On Mon, Sep 28, 2015 at 7:02 PM, Shixiong Zhu wrote:
> These files are created by shuffle and just some temp files. They are not
>
Hi All,
Spark 1.5.1 is a maintenance release containing stability fixes. This
release is based on the branch-1.5 maintenance branch of Spark. We
*strongly recommend* all 1.5.0 users to upgrade to this release.
The full list of bug fixes is here: http://s.apache.org/spark-1.5.1
Looks like the spark history server should take the lost exectuors into
account by analyzing the output from 'yarn logs applicationId' command.
Cheers
On Thu, Oct 1, 2015 at 11:46 AM, Lan Jiang wrote:
> Ted,
>
> Thanks for your reply.
>
> First of all, after sending email to
Hi spark users and developers,
I'm trying to use spark-submit --packages against private s3 repository.
With sbt, I'm using fm-sbt-s3-resolver with proper aws s3 credentials. I
wonder how can I add this resolver into spark-submit such that --packages
can resolve dependencies from private repo?
Hi
I am trying to call .persist() on a dataframe but once I execute the next line
I am getting
java.util.NoSuchElementException: key not found: ….
I tried to do persist on disk also the same thing.
I am using:
pyspark with python3
spark 1.5
Thanks!
EYAD SIBAI
Risk Engineer
iZettle ®
61 matches
Mail list logo