You can also use --files, which doesn't require the file scheme.
On Wed, Mar 9, 2016 at 11:20 AM Ashic Mahtab wrote:
> Found it.
>
> You can pass in the jvm parameter log4j.configuration. The following works:
>
> -Dlog4j.configuration=file:path/to/log4j.properties
>
> It doesn't
gt; wrote:
>
> Is there any specific reason for caching the RDD? How many passes you make
> over the dataset?
>
> Mohammed
>
> -Original Message-
> From: Matt Narrell [mailto:matt.narr...@gmail.com]
> Sent: Saturday, October 3, 2015 9:50 PM
> To: Mohammed Guller
> Cc
e calling just a map operation and then a
> save operation, I don't see how caching would help.
>
> Mohammed
>
>
> -Original Message-
> From: Matt Narrell [mailto:matt.narr...@gmail.com]
> Sent: Tuesday, October 6, 2015 3:32 PM
> To: Mohammed Guller
> Cc: davidkl; user@
Is there any more information or best practices here? I have the exact same
issues when reading large data sets from HDFS (larger than available RAM) and I
cannot run without setting the RDD persistence level to MEMORY_AND_DISK_SER,
and using nearly all the cluster resources.
Should I
Hey,
Our ML ETL pipeline has several complex steps that I’d like to address with
custom Transformers in an ML Pipeline. Looking at the Tokenizer and HashingTF
transformers I see these handy traits (HasInputCol, HasLabelCol, HasOutputCol,
etc.) but they have strict access modifiers. How can I
I’m looking at various HA scenarios with Spark streaming. We’re currently
running a Spark streaming job that is intended to be long-lived, 24/7. We see
that if we kill node managers that are hosting Spark workers, new node managers
assume execution of the jobs that were running on the
Or setting the HADOOP_CONF_DIR property. Either way, you must have the YARN
configuration available to the submitting application to allow for the use of
“yarn-client” or “yarn-master”
The attached stack trace below doesn’t provide any information as to why the
job failed.
mn
On Nov 27,
I think this IS possible?
You must set the HADOOP_CONF_DIR variable on the machine you’re running the
Java process that creates the SparkContext. The Hadoop configuration specifies
the YARN ResourceManager IPs, and Spark will use that configuration.
mn
On Nov 21, 2014, at 8:10 AM, Prannoy
How do I configure the files to be uploaded to YARN containers. So far, I’ve
only seen --conf spark.yarn.jar=hdfs://….” which allows me to specify the HDFS
location of the Spark JAR, but I’m not sure how to prescribe other files for
uploading (e.g., spark-env.sh)
mn
On Nov 20, 2014, at 4:08
So, Im using Intellij 13.x, and Scala Spark jobs.
Make sure you have singletons (objects, not classes), then simply debug the
main function. You’ll need to set your master to some derivation of “local”,
but thats it. Spark Streaming is kinda wonky when debugging, but data-at-rest
behaves
I’ve been puzzled by this lately. I too would like to use the thrift server to
provide JDBC style access to datasets via SparkSQL. Is this possible? The
examples show temp tables created during the lifetime of a SparkContext. I
assume I can use SparkSQL to query those tables while the
Can this be done? Can I just spin up a SparkContext programmatically, point
this to my yarn-cluster and this works like spark-submit?? Doesn’t (at least)
the application JAR need to be distributed to the workers via HDFS or the like
for the jobs to run?
mn
On Oct 28, 2014, at 2:29 AM,
http://spark.apache.org/docs/latest/streaming-programming-guide.html
http://spark.apache.org/docs/latest/streaming-programming-guide.html
foreachRDD is executed on the driver….
mn
On Oct 20, 2014, at 3:07 AM, Gerard Maas gerard.m...@gmail.com wrote:
Pinging TD -- I'm sure you know :-)
to a
unified place instead of to local disk on a random box on the cluster.
On Thu, Sep 25, 2014 at 1:38 PM, Matt Narrell matt.narr...@gmail.com wrote:
How does this work with a cluster manager like YARN?
mn
On Sep 25, 2014, at 2:23 PM, Andrew Or and...@databricks.com wrote:
Hi Harsha,
You
and JavaPairReceiverInputDStream?
On Wed, Sep 24, 2014 at 7:46 AM, Matt Narrell matt.narr...@gmail.com wrote:
The part that works is the commented out, single receiver stream below the
loop. It seems that when I call KafkaUtils.createStream more than once, I
don’t receive any messages.
I’ll
, there are none left to
do the processing. If I drop the number of partitions/receivers down while
still having multiple unioned receivers, I see messages.
mn
On Sep 25, 2014, at 10:18 AM, Matt Narrell matt.narr...@gmail.com wrote:
I suppose I have other problems as I can’t get the Scala example
Additionally,
If I dial up/down the number of executor cores, this does what I want. Thanks
for the extra eyes!
mn
On Sep 25, 2014, at 12:34 PM, Matt Narrell matt.narr...@gmail.com wrote:
Tim,
I think I understand this now. I had a five node Spark cluster and a five
partition topic
on who's doing RM), you should be able to see if the
receiver has any errors when trying to talk to kafka.
On Tue, Sep 23, 2014 at 3:21 PM, Matt Narrell matt.narr...@gmail.com wrote:
To my eyes, these are functionally equivalent. I’ll try a Scala approach,
but this may cause waves
Yes, this works. Make sure you have HADOOP_CONF_DIR set on your Spark machines
mn
On Sep 24, 2014, at 5:35 AM, Petr Novak oss.mli...@gmail.com wrote:
Hello,
if our Hadoop cluster is configured with HA and fs.defaultFS points to a
namespace instead of a namenode hostname -
This just shows the driver. Click the Executors tab in the Spark UI
mn
On Sep 24, 2014, at 11:25 AM, Raghuveer Chanda raghuveer.cha...@gmail.com
wrote:
Hi,
I'm new to spark and facing problem with running a job in cluster using YARN.
Initially i ran jobs using spark master as --master
Hey,
Spark 1.1.0
Kafka 0.8.1.1
Hadoop (YARN/HDFS) 2.5.1
I have a five partition Kafka topic. I can create a single Kafka receiver via
KafkaUtils.createStream with five threads in the topic map and consume messages
fine. Sifting through the user list and Google, I see that its possible to
new Tuple2(tuple._2().getDeviceId(), 1);
}
});
… and futher Spark functions ...
On Sep 23, 2014, at 2:55 PM, Tim Smith secs...@gmail.com wrote:
Posting your code would be really helpful in figuring out gotchas.
On Tue, Sep 23, 2014 at 9:19 AM, Matt Narrell matt.narr
I came across this: https://github.com/xerial/sbt-pack
Until i found this, I was simply using the sbt-assembly plugin (sbt clean
assembly)
mn
On Sep 4, 2014, at 2:46 PM, Aris arisofala...@gmail.com wrote:
Thanks for answering Daniil -
I have SBT version 0.13.5, is that an old version?
I’ve put my Spark JAR into HDFS, and specify the SPARK_JAR variable to point to
the HDFS location of the jar. I’m not using any specialized configuration
files (like spark-env.sh), but rather setting things either by environment
variable per node, passing application arguments to the job, or
Hello,
I’m using Spark streaming to aggregate data from a Kafka topic in sliding
windows. Usually we want to persist this aggregated data to a MongoDB cluster,
or republish to a different Kafka topic. When I include these 3rd party
drivers, I usually get a NotSerializableException due to the
serialization.
On Tue, Sep 2, 2014 at 5:45 PM, Matt Narrell matt.narr...@gmail.com wrote:
Hello,
I’m using Spark streaming to aggregate data from a Kafka topic in sliding
windows. Usually we want to persist this aggregated data to a MongoDB
cluster, or republish to a different Kafka topic
...@cloudera.com wrote:
Hi Matt,
I checked in the YARN code and I don't see any references to
yarn.resourcemanager.address. Have you made sure that your YARN client
configuration on the node you're launching from contains the right configs?
-Sandy
On Mon, Aug 18, 2014 at 4:07 PM, Matt Narrell
.
On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell matt.narr...@gmail.com wrote:
However, now the Spark jobs running in the ApplicationMaster on a given node
fails to find the active resourcemanager. Below is a log excerpt from one
of the assigned nodes. As all the jobs fail, eventually YARN will move
Ok Marcelo,
Thanks for the quick and thorough replies. I’ll keep an eye on these tickets
and the mailing list to see how things move along.
mn
On Aug 20, 2014, at 1:33 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi,
On Wed, Aug 20, 2014 at 11:59 AM, Matt Narrell matt.narr...@gmail.com
Hello,
I have an HA enabled YARN cluster with two resource mangers. When submitting
jobs via “spark-submit —master yarn-cluster”. It appears that the driver is
looking explicitly for the yarn.resourcemanager.address” property rather than
round robin-ing through the resource managers via the
I’d suggest something like Apache YARN, or Apache Mesos with Marathon or
something similar to allow for management, in particular restart on failure.
mn
On Aug 13, 2014, at 7:15 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Thu, Aug 14, 2014 at 5:49 AM, salemi alireza.sal...@udo.edu
31 matches
Mail list logo