Re: Specify log4j properties file

2016-03-09 Thread Matt Narrell
You can also use --files, which doesn't require the file scheme. On Wed, Mar 9, 2016 at 11:20 AM Ashic Mahtab wrote: > Found it. > > You can pass in the jvm parameter log4j.configuration. The following works: > > -Dlog4j.configuration=file:path/to/log4j.properties > > It doesn't

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Matt Narrell
gt; wrote: > > Is there any specific reason for caching the RDD? How many passes you make > over the dataset? > > Mohammed > > -Original Message- > From: Matt Narrell [mailto:matt.narr...@gmail.com] > Sent: Saturday, October 3, 2015 9:50 PM > To: Mohammed Guller > Cc

Re: laziness in textFile reading from HDFS?

2015-10-06 Thread Matt Narrell
e calling just a map operation and then a > save operation, I don't see how caching would help. > > Mohammed > > > -Original Message- > From: Matt Narrell [mailto:matt.narr...@gmail.com] > Sent: Tuesday, October 6, 2015 3:32 PM > To: Mohammed Guller > Cc: davidkl; user@

Re: laziness in textFile reading from HDFS?

2015-10-03 Thread Matt Narrell
Is there any more information or best practices here? I have the exact same issues when reading large data sets from HDFS (larger than available RAM) and I cannot run without setting the RDD persistence level to MEMORY_AND_DISK_SER, and using nearly all the cluster resources. Should I

[Spark ML] HasInputCol, etc.

2015-07-28 Thread Matt Narrell
Hey, Our ML ETL pipeline has several complex steps that I’d like to address with custom Transformers in an ML Pipeline. Looking at the Tokenizer and HashingTF transformers I see these handy traits (HasInputCol, HasLabelCol, HasOutputCol, etc.) but they have strict access modifiers. How can I

Spark Streaming on YARN with loss of application master

2015-03-30 Thread Matt Narrell
I’m looking at various HA scenarios with Spark streaming. We’re currently running a Spark streaming job that is intended to be long-lived, 24/7. We see that if we kill node managers that are hosting Spark workers, new node managers assume execution of the jobs that were running on the

Re: Spark Job submit

2014-12-01 Thread Matt Narrell
Or setting the HADOOP_CONF_DIR property. Either way, you must have the YARN configuration available to the submitting application to allow for the use of “yarn-client” or “yarn-master” The attached stack trace below doesn’t provide any information as to why the job failed. mn On Nov 27,

Re: Execute Spark programs from local machine on Yarn-hadoop cluster

2014-11-23 Thread Matt Narrell
I think this IS possible? You must set the HADOOP_CONF_DIR variable on the machine you’re running the Java process that creates the SparkContext. The Hadoop configuration specifies the YARN ResourceManager IPs, and Spark will use that configuration. mn On Nov 21, 2014, at 8:10 AM, Prannoy

Re: spark-submit and logging

2014-11-20 Thread Matt Narrell
How do I configure the files to be uploaded to YARN containers. So far, I’ve only seen --conf spark.yarn.jar=hdfs://….” which allows me to specify the HDFS location of the Spark JAR, but I’m not sure how to prescribe other files for uploading (e.g., spark-env.sh) mn On Nov 20, 2014, at 4:08

Re: Scala Spark IDE help

2014-10-28 Thread Matt Narrell
So, Im using Intellij 13.x, and Scala Spark jobs. Make sure you have singletons (objects, not classes), then simply debug the main function. You’ll need to set your master to some derivation of “local”, but thats it. Spark Streaming is kinda wonky when debugging, but data-at-rest behaves

Re: Spark to eliminate full-table scan latency

2014-10-28 Thread Matt Narrell
I’ve been puzzled by this lately. I too would like to use the thrift server to provide JDBC style access to datasets via SparkSQL. Is this possible? The examples show temp tables created during the lifetime of a SparkContext. I assume I can use SparkSQL to query those tables while the

Re: Submiting Spark application through code

2014-10-28 Thread Matt Narrell
Can this be done? Can I just spin up a SparkContext programmatically, point this to my yarn-cluster and this works like spark-submit?? Doesn’t (at least) the application JAR need to be distributed to the workers via HDFS or the like for the jobs to run? mn On Oct 28, 2014, at 2:29 AM,

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-20 Thread Matt Narrell
http://spark.apache.org/docs/latest/streaming-programming-guide.html http://spark.apache.org/docs/latest/streaming-programming-guide.html foreachRDD is executed on the driver…. mn On Oct 20, 2014, at 3:07 AM, Gerard Maas gerard.m...@gmail.com wrote: Pinging TD -- I'm sure you know :-)

Re: SPARK UI - Details post job processiong

2014-09-26 Thread Matt Narrell
to a unified place instead of to local disk on a random box on the cluster. On Thu, Sep 25, 2014 at 1:38 PM, Matt Narrell matt.narr...@gmail.com wrote: How does this work with a cluster manager like YARN? mn On Sep 25, 2014, at 2:23 PM, Andrew Or and...@databricks.com wrote: Hi Harsha, You

Re: Multiple Kafka Receivers and Union

2014-09-25 Thread Matt Narrell
and JavaPairReceiverInputDStream? On Wed, Sep 24, 2014 at 7:46 AM, Matt Narrell matt.narr...@gmail.com wrote: The part that works is the commented out, single receiver stream below the loop. It seems that when I call KafkaUtils.createStream more than once, I don’t receive any messages. I’ll

Re: Multiple Kafka Receivers and Union

2014-09-25 Thread Matt Narrell
, there are none left to do the processing. If I drop the number of partitions/receivers down while still having multiple unioned receivers, I see messages. mn On Sep 25, 2014, at 10:18 AM, Matt Narrell matt.narr...@gmail.com wrote: I suppose I have other problems as I can’t get the Scala example

Re: Multiple Kafka Receivers and Union

2014-09-25 Thread Matt Narrell
Additionally, If I dial up/down the number of executor cores, this does what I want. Thanks for the extra eyes! mn On Sep 25, 2014, at 12:34 PM, Matt Narrell matt.narr...@gmail.com wrote: Tim, I think I understand this now. I had a five node Spark cluster and a five partition topic

Re: Multiple Kafka Receivers and Union

2014-09-24 Thread Matt Narrell
on who's doing RM), you should be able to see if the receiver has any errors when trying to talk to kafka. On Tue, Sep 23, 2014 at 3:21 PM, Matt Narrell matt.narr...@gmail.com wrote: To my eyes, these are functionally equivalent. I’ll try a Scala approach, but this may cause waves

Re: Does Spark Driver works with HDFS in HA mode

2014-09-24 Thread Matt Narrell
Yes, this works. Make sure you have HADOOP_CONF_DIR set on your Spark machines mn On Sep 24, 2014, at 5:35 AM, Petr Novak oss.mli...@gmail.com wrote: Hello, if our Hadoop cluster is configured with HA and fs.defaultFS points to a namespace instead of a namenode hostname -

Re: Spark with YARN

2014-09-24 Thread Matt Narrell
This just shows the driver. Click the Executors tab in the Spark UI mn On Sep 24, 2014, at 11:25 AM, Raghuveer Chanda raghuveer.cha...@gmail.com wrote: Hi, I'm new to spark and facing problem with running a job in cluster using YARN. Initially i ran jobs using spark master as --master

Multiple Kafka Receivers and Union

2014-09-23 Thread Matt Narrell
Hey, Spark 1.1.0 Kafka 0.8.1.1 Hadoop (YARN/HDFS) 2.5.1 I have a five partition Kafka topic. I can create a single Kafka receiver via KafkaUtils.createStream with five threads in the topic map and consume messages fine. Sifting through the user list and Google, I see that its possible to

Re: Multiple Kafka Receivers and Union

2014-09-23 Thread Matt Narrell
new Tuple2(tuple._2().getDeviceId(), 1); } }); … and futher Spark functions ... On Sep 23, 2014, at 2:55 PM, Tim Smith secs...@gmail.com wrote: Posting your code would be really helpful in figuring out gotchas. On Tue, Sep 23, 2014 at 9:19 AM, Matt Narrell matt.narr

Re: Spark Streaming with Kafka, building project with 'sbt assembly' is extremely slow

2014-09-08 Thread Matt Narrell
I came across this: https://github.com/xerial/sbt-pack Until i found this, I was simply using the sbt-assembly plugin (sbt clean assembly) mn On Sep 4, 2014, at 2:46 PM, Aris arisofala...@gmail.com wrote: Thanks for answering Daniil - I have SBT version 0.13.5, is that an old version?

Re: Spark on YARN question

2014-09-02 Thread Matt Narrell
I’ve put my Spark JAR into HDFS, and specify the SPARK_JAR variable to point to the HDFS location of the jar. I’m not using any specialized configuration files (like spark-env.sh), but rather setting things either by environment variable per node, passing application arguments to the job, or

Serialized 3rd party libs

2014-09-02 Thread Matt Narrell
Hello, I’m using Spark streaming to aggregate data from a Kafka topic in sliding windows. Usually we want to persist this aggregated data to a MongoDB cluster, or republish to a different Kafka topic. When I include these 3rd party drivers, I usually get a NotSerializableException due to the

Re: Serialized 3rd party libs

2014-09-02 Thread Matt Narrell
serialization. On Tue, Sep 2, 2014 at 5:45 PM, Matt Narrell matt.narr...@gmail.com wrote: Hello, I’m using Spark streaming to aggregate data from a Kafka topic in sliding windows. Usually we want to persist this aggregated data to a MongoDB cluster, or republish to a different Kafka topic

Re: spark-submit with HA YARN

2014-08-20 Thread Matt Narrell
...@cloudera.com wrote: Hi Matt, I checked in the YARN code and I don't see any references to yarn.resourcemanager.address. Have you made sure that your YARN client configuration on the node you're launching from contains the right configs? -Sandy On Mon, Aug 18, 2014 at 4:07 PM, Matt Narrell

Re: spark-submit with HA YARN

2014-08-20 Thread Matt Narrell
. On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell matt.narr...@gmail.com wrote: However, now the Spark jobs running in the ApplicationMaster on a given node fails to find the active resourcemanager. Below is a log excerpt from one of the assigned nodes. As all the jobs fail, eventually YARN will move

Re: spark-submit with HA YARN

2014-08-20 Thread Matt Narrell
Ok Marcelo, Thanks for the quick and thorough replies. I’ll keep an eye on these tickets and the mailing list to see how things move along. mn On Aug 20, 2014, at 1:33 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi, On Wed, Aug 20, 2014 at 11:59 AM, Matt Narrell matt.narr...@gmail.com

spark-submit with HA YARN

2014-08-18 Thread Matt Narrell
Hello, I have an HA enabled YARN cluster with two resource mangers. When submitting jobs via “spark-submit —master yarn-cluster”. It appears that the driver is looking explicitly for the yarn.resourcemanager.address” property rather than round robin-ing through the resource managers via the

Re: spark streaming : what is the best way to make a driver highly available

2014-08-14 Thread Matt Narrell
I’d suggest something like Apache YARN, or Apache Mesos with Marathon or something similar to allow for management, in particular restart on failure. mn On Aug 13, 2014, at 7:15 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Thu, Aug 14, 2014 at 5:49 AM, salemi alireza.sal...@udo.edu