Self contained Spark application with local master without spark-submit

2022-01-19 Thread Colin Williams
Hello, I noticed I can run spark applications with a local master via sbt run and also via the IDE. I'd like to run a single threaded worker application as a self contained jar. What does sbt run employ that allows it to run a local master? Can I build an uber jar and run without spark-submit?

Re: Corrupt record handling in spark structured streaming and from_json function

2018-12-31 Thread Colin Williams
o go about this, particularly without using multiple streams? On Wed, Dec 26, 2018 at 6:01 PM Colin Williams wrote: > > https://stackoverflow.com/questions/53938967/writing-corrupt-data-from-kafka-json-datasource-in-spark-structured-streaming > > On Wed, Dec 26, 2018 at 2:42 PM C

Re: Corrupt record handling in spark structured streaming and from_json function

2018-12-26 Thread Colin Williams
https://stackoverflow.com/questions/53938967/writing-corrupt-data-from-kafka-json-datasource-in-spark-structured-streaming On Wed, Dec 26, 2018 at 2:42 PM Colin Williams wrote: > > From my initial impression it looks like I'd need to create my own > `from_json` using `jsonT

Re: Corrupt record handling in spark structured streaming and from_json function

2018-12-26 Thread Colin Williams
>From my initial impression it looks like I'd need to create my own `from_json` using `jsonToStructs` as a reference but try to handle ` case : BadRecordException => null ` or similar to try to write the non matching string to a corrupt records column On Wed, Dec 26, 2018 at 1:

Corrupt record handling in spark structured streaming and from_json function

2018-12-26 Thread Colin Williams
Hi, I'm trying to figure out how I can write records that don't match a json read schema via spark structred streaming to an output sink / parquet location. Previously I did this in batch via corrupt column features of batch. But in this spark structured streaming I'm reading from kafka a string a

Re: Packaging kafka certificates in uber jar

2018-12-26 Thread Colin Williams
, 2018 at 5:26 AM Anastasios Zouzias wrote: > > Hi Colin, > > You can place your certificates under src/main/resources and include them in > the uber JAR, see e.g. : > https://stackoverflow.com/questions/40252652/access-files-in-resources-directory-in-jar-from-apache-spark-streamin

Packaging kafka certificates in uber jar

2018-12-24 Thread Colin Williams
I've been trying to read from kafka via a spark streaming client. I found out spark cluster doesn't have certificates deployed. Then I tried using the same local certificates I've been testing with by packing them in an uber jar and getting a File handle from the Classloader resource. But I'm getti

Re: Casting nested columns and updated nested struct fields.

2018-11-23 Thread Colin Williams
Looks like it's been reported already. It's too bad it's been a year but should be released into spark 3: https://issues.apache.org/jira/browse/SPARK-22231 On Fri, Nov 23, 2018 at 8:42 AM Colin Williams wrote: > > Seems like it's worthy of filing a bug against withColum

Re: Casting nested columns and updated nested struct fields.

2018-11-23 Thread Colin Williams
Seems like it's worthy of filing a bug against withColumn On Wed, Nov 21, 2018, 6:25 PM Colin Williams < colin.williams.seat...@gmail.com wrote: > Hello, > > I'm currently trying to update the schema for a dataframe with nested > columns. I would either like to update

Casting nested columns and updated nested struct fields.

2018-11-21 Thread Colin Williams
Hello, I'm currently trying to update the schema for a dataframe with nested columns. I would either like to update the schema itself or cast the column without having to explicitly select all the columns just to cast one. In regards to updating the schema it looks like I would probably need to w

inferred schemas for spark streaming from a Kafka source

2018-11-13 Thread Colin Williams
Does anybody know how to use inferred schemas with structured streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#schema-inference-and-partition-of-streaming-dataframesdatasets I have some code like : object StreamingApp { def launch(config: Config, spa

Dataframe reader does not read microseconds, but TimestampType supports microseconds

2018-07-02 Thread Colin Williams
I'm confused as to why Sparks Dataframe reader does not support reading json or similar with microsecond timestamps to microseconds, but instead reads into millis. This seems strange when the TimestampType supports microseconds. For example create a schema for a json object with a column of Times

Specifying a custom Partitioner on RDD creation in Spark 2

2018-04-10 Thread Colin Williams
've found one such example https://stackoverflow.com/a/25204589 but it's from an older version of Spark. I'm hoping maybe there is something more recent and more in-depth. I don't mind references to books or otherwise. Best, Colin Williams - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams
t;) .load("src/test/resources/*.gz") df1.show(80) On Wed, Mar 28, 2018 at 5:10 PM, Colin Williams wrote: > I've had more success exporting the schema toJson and importing that. > Something like: > > > val df1: DataFrame = session.read > .format("json&qu

Re: spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams
c/test/resources/*.gz") df1.show(80) On Wed, Mar 28, 2018 at 3:25 PM, Colin Williams wrote: > The to String representation look like where "someName" is unique: > > StructType(StructField("someName",StringType,true), > StructField("someName",St

Re: spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams
ructField("someName",StringType,true), StructField("someName",StringType,true), StructField("someName",StringType,true)) The catalogString looks something like where SOME_TABLE_NAME is unique: struct, SOME_TABLE_NAME:struct,SOME_TABLE_NAME:struct, SOME_TABLE_NAME:struct,SOME_TAB

spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams
I've been learning spark-sql and have been trying to export and import some of the generated schemas to edit them. I've been writing the schemas to strings like df1.schema.toString() and df.schema.catalogString But I've been having trouble loading the schemas created. Does anyone know if it's poss

Re: classpath conflict with spark internal libraries and the spark shell.

2016-09-09 Thread Colin Kincaid Williams
My bad, gothos on IRC pointed me to the docs: http://jhz.name/2016/01/10/spark-classpath.html Thanks Gothos! On Fri, Sep 9, 2016 at 9:23 PM, Colin Kincaid Williams wrote: > I'm using the spark shell v1.61 . I have a classpath conflict, where I > have an external library ( not

classpath conflict with spark internal libraries and the spark shell.

2016-09-09 Thread Colin Kincaid Williams
I'm using the spark shell v1.61 . I have a classpath conflict, where I have an external library ( not OSS either :( , can't rebuild it.) using httpclient-4.5.2.jar. I use spark-shell --jars file:/path/to/httpclient-4.5.2.jar However spark is using httpclient-4.3 internally. Then when I try to use

How to get the parameters of bestmodel while using paramgrid and crossvalidator?

2016-08-08 Thread colin
I'm using CrossValidator and paramgrid to find the best parameters of my model. After crossvalidate, I got a CrossValidatorModel but when I want to get the parameters of my pipeline, there's nothing in the parameter list of bestmodel. Here's the code runing on jupyter notebook: sq=SQLContext(sc) d

"object cannot be cast to Double" using pipline with pyspark

2016-08-03 Thread colin
My colleagues use scala and I use python. They save a hive table ,which has doubletype columns. However there's no double in python. When I use /pipline.fit(dataframe)/, there occured an error: java.lang.ClassCastException: [Ljava.lang.Object: cnnot be cast to java.lang.Double.. I guess i

Re: Run times for Spark 1.6.2 compared to 2.1.0?

2016-07-28 Thread Colin Beckingham
On 27/07/16 16:31, Colin Beckingham wrote: I have a project which runs fine in both Spark 1.6.2 and 2.1.0. It calculates a logistic model using MLlib. I compiled the 2.1 today from source and took the version 1 as a precompiled version with Hadoop. The odd thing is that on 1.6.2 the project

Run times for Spark 1.6.2 compared to 2.1.0?

2016-07-27 Thread Colin Beckingham
I have a project which runs fine in both Spark 1.6.2 and 2.1.0. It calculates a logistic model using MLlib. I compiled the 2.1 today from source and took the version 1 as a precompiled version with Hadoop. The odd thing is that on 1.6.2 the project produces an answer in 350 sec and the 2.1.0 ta

Re: Improving performance of a kafka spark streaming app

2016-06-22 Thread Colin Kincaid Williams
Streaming UI tab showing empty events and very different metrics than on 1.5.2 On Thu, Jun 23, 2016 at 5:06 AM, Colin Kincaid Williams wrote: > After a bit of effort I moved from a Spark cluster running 1.5.2, to a > Yarn cluster running 1.6.1 jars. I'm still setting the maxRPP. The

Re: Improving performance of a kafka spark streaming app

2016-06-22 Thread Colin Kincaid Williams
sible my issues were related to running on the Spark 1.5.2 cluster. Also is the missing event count in the completed batches a bug? Should I file an issue? On Tue, Jun 21, 2016 at 9:04 PM, Colin Kincaid Williams wrote: > Thanks @Cody, I will try that out. In the interm, I tried to validate > my

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Colin Kincaid Williams
ion and just measure what your read > performance is by doing something like > > createDirectStream(...).foreach(_.println) > > not take() or print() > > On Tue, Jun 21, 2016 at 3:19 PM, Colin Kincaid Williams > wrote: >> @Cody I was able to bring my processing ti

Re: Improving performance of a kafka spark streaming app

2016-06-21 Thread Colin Kincaid Williams
looking for advice regarding # Kafka Topic Partitions / Streaming Duration / maxRatePerPartition / any other spark settings or code changes that I should make to try to get a better consumption rate. Thanks for all the help so far, this is the first Spark application I have written. On Mon, Jun 2

Re: Improving performance of a kafka spark streaming app

2016-06-20 Thread Colin Kincaid Williams
ocessing time is > 1.16 seconds, you're always going to be falling behind. That would > explain why you've built up an hour of scheduling delay after eight > hours of running. > > On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams > wrote: >> Hi Mich again,

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
c? >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> &g

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
I'm attaching a picture from the streaming UI. On Sat, Jun 18, 2016 at 7:59 PM, Colin Kincaid Williams wrote: > There are 25 nodes in the spark cluster. > > On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh > wrote: >> how many nodes are in your cluster? >> >&g

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
eh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > > > On 18 June 2016 at 20:40, Colin Kincaid Williams wrote: >> >> I updated my app to Spark 1

Re: Improving performance of a kafka spark streaming app

2016-06-18 Thread Colin Kincaid Williams
-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/* \ /home/colin.williams/kafka-hbase.jar "FromTable" "ToTable" "broker1:9092,broker2:9092" On Tue, May 3, 2016 at 8:20 PM, Colin Kincaid Williams wrote: > Thanks Cody, I can see that the partitions are well distributed... > Then

Re: Couldn't find leader offsets

2016-05-19 Thread Colin Hall
Hey Cody, thanks for the response. I looked at connection as a possibility based on your advice and after a lot of digging found a couple of things mentioned on SO and kafka lists about name resolution causing issues. I created an entry in /etc/hosts on the spark host to resolve the broker to

Mllib using model to predict probability

2016-05-04 Thread colin
In 2-class problems, when I use SVM, RondomForest models to do classifications, they predict "0" or "1". And when I use ROC to evaluate the model, sometimes I need a probability that a record belongs to "0" or "1". In scikit-learn, every model can do "predict" and "predict_prob", which the last one

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Colin Kincaid Williams
tributing across partitions evenly). > > On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams wrote: >> Thanks again Cody. Regarding the details 66 kafka partitions on 3 >> kafka servers, likely 8 core systems with 10 disks each. Maybe the >> issue with the receiver was the large n

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Colin Kincaid Williams
t; > Really though, I'd try to start with spark 1.6 and direct streams, or > even just kafkacat, as a baseline. > > > > On Mon, May 2, 2016 at 7:01 PM, Colin Kincaid Williams wrote: >> Hello again. I searched for "backport kafka" in the list archives but

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
ing with 1.3. If you're stuck > on 1.2, I believe there have been some attempts to backport it, search > the mailing list archives. > > On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams > wrote: >> I've written an application to get content from a kafka topic w

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
m the docs. Then maybe I should try creating multiple streams to get more throughput? Thanks, Colin Williams On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger wrote: > Have you tested for read throughput (without writing to hbase, just > deserialize)? > > Are you limited to using

Re: Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
However I should give it a shot if I abandon Spark 1.2, and my current environment. Thanks, Colin Williams On Mon, May 2, 2016 at 6:06 PM, Krieg, David wrote: > Spark 1.2 is a little old and busted. I think most of the advice you'll get is > to try to use Spark 1.3 at least, which i

Improving performance of a kafka spark streaming app

2016-05-02 Thread Colin Kincaid Williams
I've written an application to get content from a kafka topic with 1.7 billion entries, get the protobuf serialized entries, and insert into hbase. Currently the environment that I'm running in is Spark 1.2. With 8 executors and 2 cores, and 2 jobs, I'm only getting between 0-2500 writes / second

MLlib ALS MatrixFactorizationModel.save fails consistently

2016-04-07 Thread Colin Woodbury
Hi all, I've implemented most of a content recommendation system for a client. However, whenever I attempt to save a MatrixFactorizationModel I've trained, I see one of four outcomes: 1. Despite "save" being wrapped in a "try" block, I see a massive stack trace quoting some java.io classes. The M

[MLlib - ALS] Merging two Models?

2016-03-10 Thread Colin Woodbury
their feature matrices be merged? For instance via: 1. Adding feature vectors together for user/product vectors that appear in both models 2. Averaging said vectors instead 3. Some other linear algebra operation Unfortunately, I'm fairly ignorant as to the internal mechanics of ALS itself. Is what I'm asking possible? Thank you, Colin

Running out of memory locally launching multiple spark jobs using spark yarn / submit from shell script.

2016-01-17 Thread Colin Kincaid Williams
I launch around 30-60 of these jobs defined like start-job.sh in the background from a wrapper script. I wait about 30 seconds between launches, then the wrapper monitors yarn to determine when to launch more. There is a limit defined at around 60 jobs, but even if I set it to 30, I run out of memo

Compiling only MLlib?

2016-01-15 Thread Colin Woodbury
y laptop (4g of RAM, Arch Linux) when I try to build with Scala 2.11 support. No matter how I tweak JVM flags to reduce maximum RAM use, the build always crashes. When trying to build Spark 1.6.0 for Scala 2.10 just now, the build had compilation errors. Here is one, as a sample. I've saved

Inconsistent Persistence of DataFrames in Spark 1.5

2015-10-28 Thread Colin Alstad
e is that there is an order of magnitude difference between the count of the join DataFrame and the persisted join DataFrame. Secondly, persisting the same DataFrame into 2 different formats yields different results. Does anyone have any idea on what could be going on here? -- Colin Alstad

SparkR: exported functions

2015-08-25 Thread Colin Gillespie
NAMESPACE file. This is obviously due to the ' missing in the roxygen2 directives. Is this intentional? Thanks Colin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Issue with PySpark UDF on a column of Vectors

2015-06-17 Thread Colin Alstad
kException: Job aborted due to stage failure: Task 5 in stage 31.0 failed 1 times, most recent failure: Lost task 5.0 in stage 31.0 (TID 95, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/colin/src/spark/python/lib/pyspark.zip/py

Re: KafkaUtils and specifying a specific partition

2015-03-12 Thread Colin McQueen
Thanks! :) Colin McQueen *Software Developer* On Thu, Mar 12, 2015 at 3:05 PM, Jeffrey Jedele wrote: > Hi Colin, > my understanding is that this is currently not possible with KafkaUtils. > You would have to write a custom receiver using Kafka's SimpleConsumer API. > > http

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-24 Thread Colin Kincaid Williams
he info in one place. > > On Tue, Feb 24, 2015 at 12:36 PM, Colin Kincaid Williams > wrote: > >> Looks like in my tired state, I didn't mention spark the whole time. >> However, it might be implied by the application log above. Spark log >> aggregation appears to b

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-24 Thread Colin Kincaid Williams
your yarn history server? If so can you share any spark settings? On Tue, Feb 24, 2015 at 8:48 AM, Christophe Préaud < christophe.pre...@kelkoo.com> wrote: > Hi Colin, > > Here is how I have configured my hadoop cluster to have yarn logs > available through both the yarn CLI a

How to get yarn logs to display in the spark or yarn history-server?

2015-02-23 Thread Colin Kincaid Williams
Hi, I have been trying to get my yarn logs to display in the spark history-server or yarn history-server. I can see the log information yarn logs -applicationId application_1424740955620_0009 15/02/23 22:15:14 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to us3sm2hbqa04r07-comp-pr

Re: Short Circuit Local Reads

2014-10-01 Thread Colin McCabe
call for short-circuit reads on cdh5. Part of the reason that hasn't been implemented yet is that one of the main advantages of short-circuit is reduced CPU consumption, and we felt spawning more threads might cut into that. We could implement it pretty easily if people wanted it, but th