Hello,
I noticed I can run spark applications with a local master via sbt run
and also via the IDE. I'd like to run a single threaded worker
application as a self contained jar.
What does sbt run employ that allows it to run a local master?
Can I build an uber jar and run without spark-submit?
o go about this, particularly without using multiple
streams?
On Wed, Dec 26, 2018 at 6:01 PM Colin Williams
wrote:
>
> https://stackoverflow.com/questions/53938967/writing-corrupt-data-from-kafka-json-datasource-in-spark-structured-streaming
>
> On Wed, Dec 26, 2018 at 2:42 PM C
https://stackoverflow.com/questions/53938967/writing-corrupt-data-from-kafka-json-datasource-in-spark-structured-streaming
On Wed, Dec 26, 2018 at 2:42 PM Colin Williams
wrote:
>
> From my initial impression it looks like I'd need to create my own
> `from_json` using `jsonT
>From my initial impression it looks like I'd need to create my own
`from_json` using `jsonToStructs` as a reference but try to handle `
case : BadRecordException => null ` or similar to try to write the non
matching string to a corrupt records column
On Wed, Dec 26, 2018 at 1:
Hi,
I'm trying to figure out how I can write records that don't match a
json read schema via spark structred streaming to an output sink /
parquet location. Previously I did this in batch via corrupt column
features of batch. But in this spark structured streaming I'm reading
from kafka a string a
, 2018 at 5:26 AM Anastasios Zouzias wrote:
>
> Hi Colin,
>
> You can place your certificates under src/main/resources and include them in
> the uber JAR, see e.g. :
> https://stackoverflow.com/questions/40252652/access-files-in-resources-directory-in-jar-from-apache-spark-streamin
I've been trying to read from kafka via a spark streaming client. I
found out spark cluster doesn't have certificates deployed. Then I
tried using the same local certificates I've been testing with by
packing them in an uber jar and getting a File handle from the
Classloader resource. But I'm getti
Looks like it's been reported already. It's too bad it's been a year
but should be released into spark 3:
https://issues.apache.org/jira/browse/SPARK-22231
On Fri, Nov 23, 2018 at 8:42 AM Colin Williams
wrote:
>
> Seems like it's worthy of filing a bug against withColum
Seems like it's worthy of filing a bug against withColumn
On Wed, Nov 21, 2018, 6:25 PM Colin Williams <
colin.williams.seat...@gmail.com wrote:
> Hello,
>
> I'm currently trying to update the schema for a dataframe with nested
> columns. I would either like to update
Hello,
I'm currently trying to update the schema for a dataframe with nested
columns. I would either like to update the schema itself or cast the
column without having to explicitly select all the columns just to
cast one.
In regards to updating the schema it looks like I would probably need
to w
Does anybody know how to use inferred schemas with structured
streaming:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#schema-inference-and-partition-of-streaming-dataframesdatasets
I have some code like :
object StreamingApp {
def launch(config: Config, spa
I'm confused as to why Sparks Dataframe reader does not support reading
json or similar with microsecond timestamps to microseconds, but instead
reads into millis.
This seems strange when the TimestampType supports microseconds.
For example create a schema for a json object with a column of
Times
've found
one such example https://stackoverflow.com/a/25204589 but it's from an
older version of Spark.
I'm hoping maybe there is something more recent and more in-depth. I
don't mind references to books or otherwise.
Best,
Colin Williams
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
t;)
.load("src/test/resources/*.gz")
df1.show(80)
On Wed, Mar 28, 2018 at 5:10 PM, Colin Williams
wrote:
> I've had more success exporting the schema toJson and importing that.
> Something like:
>
>
> val df1: DataFrame = session.read
> .format("json&qu
c/test/resources/*.gz")
df1.show(80)
On Wed, Mar 28, 2018 at 3:25 PM, Colin Williams
wrote:
> The to String representation look like where "someName" is unique:
>
> StructType(StructField("someName",StringType,true),
> StructField("someName",St
ructField("someName",StringType,true),
StructField("someName",StringType,true),
StructField("someName",StringType,true))
The catalogString looks something like where SOME_TABLE_NAME is unique:
struct,
SOME_TABLE_NAME:struct,SOME_TABLE_NAME:struct,
SOME_TABLE_NAME:struct,SOME_TAB
I've been learning spark-sql and have been trying to export and import
some of the generated schemas to edit them. I've been writing the
schemas to strings like df1.schema.toString() and
df.schema.catalogString
But I've been having trouble loading the schemas created. Does anyone
know if it's poss
My bad, gothos on IRC pointed me to the docs:
http://jhz.name/2016/01/10/spark-classpath.html
Thanks Gothos!
On Fri, Sep 9, 2016 at 9:23 PM, Colin Kincaid Williams wrote:
> I'm using the spark shell v1.61 . I have a classpath conflict, where I
> have an external library ( not
I'm using the spark shell v1.61 . I have a classpath conflict, where I
have an external library ( not OSS either :( , can't rebuild it.)
using httpclient-4.5.2.jar. I use spark-shell --jars
file:/path/to/httpclient-4.5.2.jar
However spark is using httpclient-4.3 internally. Then when I try to
use
I'm using CrossValidator and paramgrid to find the best parameters of my
model.
After crossvalidate, I got a CrossValidatorModel but when I want to get the
parameters of my pipeline, there's nothing in the parameter list of
bestmodel.
Here's the code runing on jupyter notebook:
sq=SQLContext(sc)
d
My colleagues use scala and I use python.
They save a hive table ,which has doubletype columns. However there's no
double in python.
When I use /pipline.fit(dataframe)/, there occured an error:
java.lang.ClassCastException: [Ljava.lang.Object: cnnot be cast to
java.lang.Double..
I guess i
On 27/07/16 16:31, Colin Beckingham wrote:
I have a project which runs fine in both Spark 1.6.2 and 2.1.0. It
calculates a logistic model using MLlib. I compiled the 2.1 today from
source and took the version 1 as a precompiled version with Hadoop.
The odd thing is that on 1.6.2 the project
I have a project which runs fine in both Spark 1.6.2 and 2.1.0. It
calculates a logistic model using MLlib. I compiled the 2.1 today from
source and took the version 1 as a precompiled version with Hadoop. The
odd thing is that on 1.6.2 the project produces an answer in 350 sec and
the 2.1.0 ta
Streaming UI tab showing empty events and very different metrics than on 1.5.2
On Thu, Jun 23, 2016 at 5:06 AM, Colin Kincaid Williams wrote:
> After a bit of effort I moved from a Spark cluster running 1.5.2, to a
> Yarn cluster running 1.6.1 jars. I'm still setting the maxRPP. The
sible my issues were related to running on the Spark
1.5.2 cluster. Also is the missing event count in the completed
batches a bug? Should I file an issue?
On Tue, Jun 21, 2016 at 9:04 PM, Colin Kincaid Williams wrote:
> Thanks @Cody, I will try that out. In the interm, I tried to validate
> my
ion and just measure what your read
> performance is by doing something like
>
> createDirectStream(...).foreach(_.println)
>
> not take() or print()
>
> On Tue, Jun 21, 2016 at 3:19 PM, Colin Kincaid Williams
> wrote:
>> @Cody I was able to bring my processing ti
looking for advice regarding # Kafka Topic Partitions / Streaming
Duration / maxRatePerPartition / any other spark settings or code
changes that I should make to try to get a better consumption rate.
Thanks for all the help so far, this is the first Spark application I
have written.
On Mon, Jun 2
ocessing time is
> 1.16 seconds, you're always going to be falling behind. That would
> explain why you've built up an hour of scheduling delay after eight
> hours of running.
>
> On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams
> wrote:
>> Hi Mich again,
c?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
&g
I'm attaching a picture from the streaming UI.
On Sat, Jun 18, 2016 at 7:59 PM, Colin Kincaid Williams wrote:
> There are 25 nodes in the spark cluster.
>
> On Sat, Jun 18, 2016 at 7:53 PM, Mich Talebzadeh
> wrote:
>> how many nodes are in your cluster?
>>
>&g
eh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 18 June 2016 at 20:40, Colin Kincaid Williams wrote:
>>
>> I updated my app to Spark 1
-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/*
\
/home/colin.williams/kafka-hbase.jar "FromTable" "ToTable"
"broker1:9092,broker2:9092"
On Tue, May 3, 2016 at 8:20 PM, Colin Kincaid Williams wrote:
> Thanks Cody, I can see that the partitions are well distributed...
> Then
Hey Cody, thanks for the response. I looked at connection as a possibility
based on your advice and after a lot of digging found a couple of things
mentioned on SO and kafka lists about name resolution causing issues. I
created an entry in /etc/hosts on the spark host to resolve the broker to
In 2-class problems, when I use SVM, RondomForest models to do
classifications, they predict "0" or "1".
And when I use ROC to evaluate the model, sometimes I need a probability
that a record belongs to "0" or "1".
In scikit-learn, every model can do "predict" and "predict_prob", which the
last one
tributing across partitions evenly).
>
> On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams wrote:
>> Thanks again Cody. Regarding the details 66 kafka partitions on 3
>> kafka servers, likely 8 core systems with 10 disks each. Maybe the
>> issue with the receiver was the large n
t;
> Really though, I'd try to start with spark 1.6 and direct streams, or
> even just kafkacat, as a baseline.
>
>
>
> On Mon, May 2, 2016 at 7:01 PM, Colin Kincaid Williams wrote:
>> Hello again. I searched for "backport kafka" in the list archives but
ing with 1.3. If you're stuck
> on 1.2, I believe there have been some attempts to backport it, search
> the mailing list archives.
>
> On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams
> wrote:
>> I've written an application to get content from a kafka topic w
m the docs. Then maybe I should try creating multiple streams to
get more throughput?
Thanks,
Colin Williams
On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger wrote:
> Have you tested for read throughput (without writing to hbase, just
> deserialize)?
>
> Are you limited to using
However I should give it a shot if I abandon Spark 1.2, and my current
environment.
Thanks,
Colin Williams
On Mon, May 2, 2016 at 6:06 PM, Krieg, David
wrote:
> Spark 1.2 is a little old and busted. I think most of the advice you'll get is
> to try to use Spark 1.3 at least, which i
I've written an application to get content from a kafka topic with 1.7
billion entries, get the protobuf serialized entries, and insert into
hbase. Currently the environment that I'm running in is Spark 1.2.
With 8 executors and 2 cores, and 2 jobs, I'm only getting between
0-2500 writes / second
Hi all,
I've implemented most of a content recommendation system for a client.
However, whenever I attempt to save a MatrixFactorizationModel I've
trained, I see one of four outcomes:
1. Despite "save" being wrapped in a "try" block, I see a massive stack
trace quoting some java.io classes. The M
their feature
matrices be merged? For instance via:
1. Adding feature vectors together for user/product vectors that appear in
both models
2. Averaging said vectors instead
3. Some other linear algebra operation
Unfortunately, I'm fairly ignorant as to the internal mechanics of ALS
itself. Is what I'm asking possible?
Thank you,
Colin
I launch around 30-60 of these jobs defined like start-job.sh in the
background from a wrapper script. I wait about 30 seconds between launches,
then the wrapper monitors yarn to determine when to launch more. There is a
limit defined at around 60 jobs, but even if I set it to 30, I run out of
memo
y laptop (4g of RAM, Arch Linux) when
I try to build with Scala 2.11 support. No matter how I tweak JVM flags to
reduce maximum RAM use, the build always crashes.
When trying to build Spark 1.6.0 for Scala 2.10 just now, the build had
compilation errors. Here is one, as a sample. I've saved
e is that there is an order of magnitude difference
between the count of the join DataFrame and the persisted join DataFrame.
Secondly, persisting the same DataFrame into 2 different formats yields
different results.
Does anyone have any idea on what could be going on here?
--
Colin Alstad
NAMESPACE file. This is obviously
due to the ' missing in the roxygen2 directives.
Is this intentional?
Thanks
Colin
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
kException: Job aborted due to stage failure: Task 5
in stage 31.0 failed 1 times, most recent failure: Lost task 5.0 in stage
31.0 (TID 95, localhost): org.apache.spark.api.python.PythonException:
Traceback (most recent call last):
File "/Users/colin/src/spark/python/lib/pyspark.zip/py
Thanks! :)
Colin McQueen
*Software Developer*
On Thu, Mar 12, 2015 at 3:05 PM, Jeffrey Jedele
wrote:
> Hi Colin,
> my understanding is that this is currently not possible with KafkaUtils.
> You would have to write a custom receiver using Kafka's SimpleConsumer API.
>
> http
he info in one place.
>
> On Tue, Feb 24, 2015 at 12:36 PM, Colin Kincaid Williams
> wrote:
>
>> Looks like in my tired state, I didn't mention spark the whole time.
>> However, it might be implied by the application log above. Spark log
>> aggregation appears to b
your
yarn history server? If so can you share any spark settings?
On Tue, Feb 24, 2015 at 8:48 AM, Christophe Préaud <
christophe.pre...@kelkoo.com> wrote:
> Hi Colin,
>
> Here is how I have configured my hadoop cluster to have yarn logs
> available through both the yarn CLI a
Hi,
I have been trying to get my yarn logs to display in the spark
history-server or yarn history-server. I can see the log information
yarn logs -applicationId application_1424740955620_0009
15/02/23 22:15:14 INFO client.ConfiguredRMFailoverProxyProvider: Failing
over to us3sm2hbqa04r07-comp-pr
call for short-circuit reads on
cdh5. Part of the reason that hasn't been implemented yet is that one
of the main advantages of short-circuit is reduced CPU consumption,
and we felt spawning more threads might cut into that. We could
implement it pretty easily if people wanted it, but th
52 matches
Mail list logo