Hi,
I have a continuous rest api stream which keeps spitting out data in form
of json.
I access the stream using python requests.get(url, stream=True,
headers=headers).
I want to receive them using spark and do further processing. I am not sure
which is best way to receive it in spark.
What are
Dears,
How can I download spark 1.2.1? I tried from spark site but version 1.2.1
is not available in site.
I need Spark 1.2.1 for CCA-175 exam.
--
*Regards,*
*Irfan Sayyed*
Unsuscribe
El 12 oct. 2016 11:26 p. m., "Reynold Xin" escribió:
I took a look at all the public APIs we expose in o.a.spark.sql tonight,
and realized we still have a large number of APIs that are marked
experimental. Most of these haven't really changed, except in 2.0 we merged
DataFrame and Dat
I took a look at all the public APIs we expose in o.a.spark.sql tonight,
and realized we still have a large number of APIs that are marked
experimental. Most of these haven't really changed, except in 2.0 we merged
DataFrame and Dataset. I think it's long overdue to mark them stable.
I'm tracking
Hi,
We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We
had written some new code using the Spark DataFrame/DataSet APIs but are
noticing incorrect results on a join after writing and then reading data to
Windows Azure Storage Blobs (The default HDFS location). I've been abl
Look at the presentation and blog post linked from
https://github.com/koeninger/kafka-exactly-once
They refer to the kafka 0.8 version of the direct stream but the basic
idea is the same
On Wed, Oct 12, 2016 at 7:35 PM, Haopu Wang wrote:
> Cody, thanks for the response.
>
>
>
> So Kafka direct
Cody, thanks for the response.
So Kafka direct stream actually has consumer on both the driver and executor?
Can you please provide more details? Thank you very much!
From: Cody Koeninger [mailto:c...@koeninger.org]
Sent: 2016年10月12日 20:10
To: Haopu Wang
Hi,
I was able to resolve the issue with increasing the timeout and reducing
the number of executors and increasing number of cores per executor.
The issue is resolved but I am still not sure why reducing the number of
executors and increasing number of cores per executor fixed issues related
to
That fixed it!. I still had the serializer registered as a workaround
for SPARK-12591.
Thanks so much for your help Ryan!
-Joey
On Wed, Oct 12, 2016 at 2:16 PM, Shixiong(Ryan) Zhu
wrote:
> Oh, OpenHashMapBasedStateMap is serialized using Kryo's
> "com.esotericsoftware.kryo.serializers.JavaSeria
Yes Neil, I use windows 8.1
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spyder-and-SPARK-combination-problem-Please-help-tp27882p27885.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
Oh, OpenHashMapBasedStateMap is serialized using Kryo's
"com.esotericsoftware.kryo.serializers.JavaSerializer". Did you set it for
OpenHashMapBasedStateMap? You don't need to set anything for Spark's
classes in 1.6.2.
On Wed, Oct 12, 2016 at 7:11 AM, Joey Echeverria wrote:
> I tried with 1.6.2
Are you using Windows? Switching over to Linux environment made that error go
away for me.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spyder-and-SPARK-combination-problem-Please-help-tp27882p27884.html
Sent from the Apache Spark User List mailing list a
I am getting excessive memory leak warnings when running multiple mapping and
aggregations and using DataSets. Is there anything I should be looking for
to resolve this or is this a known issue?
WARN [Executor task launch worker-0]
org.apache.spark.memory.TaskMemoryManager - leak 16.3 MB memory f
See https://issues.apache.org/jira/browse/SPARK-17588
On Wed, Oct 12, 2016 at 9:07 PM Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:
> If I drop the last feature on the third model, the error seems to go away.
>
> On Wed, Oct 12, 2016 at 11:52 PM, Meeraj Kunnumpurath <
> mee...@services
If I drop the last feature on the third model, the error seems to go away.
On Wed, Oct 12, 2016 at 11:52 PM, Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:
> Hello,
>
> I have some code trying to compare linear regression coefficients with
> three sets of features, as shown below. On th
Hello,
I have some code trying to compare linear regression coefficients with
three sets of features, as shown below. On the third one, I get an
assertion error.
This is the code,
object MultipleRegression extends App {
val spark = SparkSession.builder().appName("Regression Model
Builder").
Hello,
Does anyone have examples of doing Matrix operations (multiplication,
transpose, inverse etc) using the Spark ML API?
Many thanks
--
*Meeraj Kunnumpurath*
*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
*00 971 50 409 0169mee...@servicesymphony.com *
If using RDD's, you can use saveAsHadoopFile or saveAsNewAPIHadoopFile
with the conf passed in which overrides the keys you need.
For example, you can do :
val saveConf = new Configuration(sc.hadoopConfiguration)
// configure saveConf with overridden s3 config
rdd.saveAsNewAPIHadoopFile(..., conf
Hi folks,
I am an exact beginner to spark and Python environment.
I have installed spark and would like to run a code snippet named
"SpyderSetupForSpark.py":
# -*- coding: utf-8 -*-
"""
Make sure you give execute privileges
--
This is what I do at the moment,
def build(path: String, spark: SparkSession) = {
val toDouble = udf((x: String) => x.toDouble)
val df = spark.read.
option("header", "true").
csv(path).
withColumn("sqft_living", toDouble('sqft_living)).
withColumn("price", toDouble('price)).
Hello,
How do I write a UDF that operate on two columns. For example, how do I
introduce a new column, which is a product of two columns already on the
dataframe.
Many thanks
Meeraj
Cool, just wanted to make sure.
To answer your question about
> Isn't "spark.streaming.backpressure.initialRate" supposed to do this?
that configuration was added well after the integration of the direct
stream with the backpressure code, and was added only to the receiver
code, which the direct
Hi,
We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We
had written some new code using the Spark DataFrame/DataSet APIs but are
noticing incorrect results on a join after writing and then reading data to
Windows Azure Storage Blobs (The default HDFS location). I've been abl
Hi,
I'd like a specific job to fail if there's another instance of it already
running on the cluster (Spark Standalone in my case).
How to achieve this?
Thank you.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
I am 100% sure.
println(conf.get("spark.streaming.backpressure.enabled")) prints true.
On 10/12/2016 05:48 PM, Cody Koeninger wrote:
Just to make 100% sure, did you set
spark.streaming.backpressure.enabled
to true?
On Wed, Oct 12, 2016 at 10:09 AM, Samy Dindane wrote:
On 10/12/2016 04:40
What i am trying to achieve is
Trigger query to get number(i.e.,1,2,3,...n)
for every number i have to trigger another 3 queries.
Thanks,
selvam R
On Wed, Oct 12, 2016 at 4:10 PM, Selvam Raman wrote:
> Hi ,
>
> I am reading parquet file and creating temp table. when i am trying to
> execute q
Hi ,
I am reading parquet file and creating temp table. when i am trying to
execute query outside of foreach function it is working fine.
throws nullpointerexception within data frame.foreach function.
code snippet:
String CITATION_QUERY = "select c.citation_num, c.title, c.publisher from
test
Hi all,
I am using Spark 2.0 to read a CSV file into a Dataset in Java. This works
fine if i define the StructType with the StructField array ordered by hand.
What I would like to do is use a bean class for both the schema and Dataset row
type. For example,
Dataset beanDS = spark.read().sch
How would backpressure know anything about the capacity of your system
on the very first batch?
You should be able to set maxRatePerPartition at a value that makes
sure your first batch doesn't blow things up, and let backpressure
scale from there.
On Wed, Oct 12, 2016 at 8:53 AM, Samy Dindane w
Yes, partitionBy will shuffle unless it happens to be partitioning
with the exact same partitioner the parent rdd had.
On Wed, Oct 12, 2016 at 8:34 AM, Samy Dindane wrote:
> Hey Cody,
>
> I ended up choosing a different way to do things, which is using Kafka to
> commit my offsets.
> It works fin
I tried with 1.6.2 and saw the same behavior.
-Joey
On Tue, Oct 11, 2016 at 5:18 PM, Shixiong(Ryan) Zhu
wrote:
> There are some known issues in 1.6.0, e.g.,
> https://issues.apache.org/jira/browse/SPARK-12591
>
> Could you try 1.6.1?
>
> On Tue, Oct 11, 2016 at 9:55 AM, Joey Echeverria wrote:
>
That's what I was looking for, thank you.
Unfortunately, neither
* spark.streaming.backpressure.initialRate
* spark.streaming.backpressure.enabled
* spark.streaming.receiver.maxRate
* spark.streaming.receiver.initialRate
change how many records I get (I tried many different combinations).
The
Hey Cody,
I ended up choosing a different way to do things, which is using Kafka to
commit my offsets.
It works fine, except it stores the offset in ZK instead of a Kafka topic
(investigating this right now).
I understand your explanations, thank you, but I have one question: when you say
"Re
its set to none for the executors, because otherwise they wont do exactly
what the driver told them to do.
you should be able to set up the driver consumer to determine batches
however you want, though.
On Wednesday, October 12, 2016, Haopu Wang wrote:
> Hi,
>
>
>
> I want to read the existing
Which Spark version?
Are you using RDDs? Or datasets?
What type are the features? If string how large?
Is it spark standalone?
How do you train/configure the algorithm. How do you initially parse the data?
The standard driver and executor logs could be helpful.
> On 12 Oct 2016, at 09:24, 陈哲
On 12 Oct 2016, at 10:49, Aseem Bansal
mailto:asmbans...@gmail.com>> wrote:
Hi
I want to read CSV from one bucket, do some processing and write to a different
bucket. I know the way to set S3 credentials using
jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
jssc.hadoo
Hi
Faced a similar issue. Our solution was to load the model, cache it after
converting it to a model from mllib and then use that instead of ml model.
On Tue, Oct 11, 2016 at 10:22 PM, Sean Owen wrote:
> I don't believe it will ever scale to spin up a whole distributed job to
> serve one reque
Hi
I want to read CSV from one bucket, do some processing and write to a
different bucket. I know the way to set S3 credentials using
jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)
But the prob
Hi,
I want to read the existing Kafka messages and then subscribe new stream
messages.
But I find "auto.offset.reset" property is always set to "none" in
KafkaUtils. Does that mean I cannot specify "earliest" property value
when create direct stream?
Thank you!
Hi
I'm using spark ml to train RandomForest Model . There is about over 200,
000 lines in the training data file and about 100 features. I'm running
spark in local mode and with JAVA_OPTS like: -Xms1024m -Xmx10296m
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps, but OOM error keep coming out,
I t
40 matches
Mail list logo