date:20161012

receiving stream data options

2016-10-12 Thread vr spark

Hi, I have a continuous rest api stream which keeps spitting out data in form of json. I access the stream using python requests.get(url, stream=True, headers=headers). I want to receive them using spark and do further processing. I am not sure which is best way to receive it in spark. What are

download spark 1.2.1

2016-10-12 Thread Irfan Sayyed

Dears, How can I download spark 1.2.1? I tried from spark site but version 1.2.1 is not available in site. I need Spark 1.2.1 for CCA-175 exam. -- *Regards,* *Irfan Sayyed*

Unsuscribe

2016-10-12 Thread R. Revert

Unsuscribe El 12 oct. 2016 11:26 p. m., "Reynold Xin" escribió: I took a look at all the public APIs we expose in o.a.spark.sql tonight, and realized we still have a large number of APIs that are marked experimental. Most of these haven't really changed, except in 2.0 we merged DataFrame and Dat

Mark DataFrame/Dataset APIs stable

2016-10-12 Thread Reynold Xin

I took a look at all the public APIs we expose in o.a.spark.sql tonight, and realized we still have a large number of APIs that are marked experimental. Most of these haven't really changed, except in 2.0 we merged DataFrame and Dataset. I think it's long overdue to mark them stable. I'm tracking

DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-12 Thread shankinson

Hi, We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been abl

Re: Kafka integration: get existing Kafka messages?

2016-10-12 Thread Cody Koeninger

Look at the presentation and blog post linked from https://github.com/koeninger/kafka-exactly-once They refer to the kafka 0.8 version of the direct stream but the basic idea is the same On Wed, Oct 12, 2016 at 7:35 PM, Haopu Wang wrote: > Cody, thanks for the response. > > > > So Kafka direct

RE: Kafka integration: get existing Kafka messages?

2016-10-12 Thread Haopu Wang

Cody, thanks for the response. So Kafka direct stream actually has consumer on both the driver and executor? Can you please provide more details? Thank you very much! From: Cody Koeninger [mailto:c...@koeninger.org] Sent: 2016年10月12日 20:10 To: Haopu Wang

Re: Spark Shuffle Issue

2016-10-12 Thread Ankur Srivastava

Hi, I was able to resolve the issue with increasing the timeout and reducing the number of executors and increasing number of cores per executor. The issue is resolved but I am still not sure why reducing the number of executors and increasing number of cores per executor fixed issues related to

Re: Map with state keys serialization

2016-10-12 Thread Joey Echeverria

That fixed it!. I still had the serializer registered as a workaround for SPARK-12591. Thanks so much for your help Ryan! -Joey On Wed, Oct 12, 2016 at 2:16 PM, Shixiong(Ryan) Zhu wrote: > Oh, OpenHashMapBasedStateMap is serialized using Kryo's > "com.esotericsoftware.kryo.serializers.JavaSeria

Re: Spyder and SPARK combination problem...Please help!

2016-10-12 Thread innocent73

Yes Neil, I use windows 8.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spyder-and-SPARK-combination-problem-Please-help-tp27882p27885.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Map with state keys serialization

2016-10-12 Thread Shixiong(Ryan) Zhu

Oh, OpenHashMapBasedStateMap is serialized using Kryo's "com.esotericsoftware.kryo.serializers.JavaSerializer". Did you set it for OpenHashMapBasedStateMap? You don't need to set anything for Spark's classes in 1.6.2. On Wed, Oct 12, 2016 at 7:11 AM, Joey Echeverria wrote: > I tried with 1.6.2

Re: Spyder and SPARK combination problem...Please help!

2016-10-12 Thread neil90

Are you using Windows? Switching over to Linux environment made that error go away for me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spyder-and-SPARK-combination-problem-Please-help-tp27882p27884.html Sent from the Apache Spark User List mailing list a

Memory leak warnings in Spark 2.0.1

2016-10-12 Thread vonnagy

I am getting excessive memory leak warnings when running multiple mapping and aggregations and using DataSets. Is there anything I should be looking for to resolve this or is this a known issue? WARN [Executor task launch worker-0] org.apache.spark.memory.TaskMemoryManager - leak 16.3 MB memory f

Re: Linear Regression Error

2016-10-12 Thread Sean Owen

See https://issues.apache.org/jira/browse/SPARK-17588 On Wed, Oct 12, 2016 at 9:07 PM Meeraj Kunnumpurath < mee...@servicesymphony.com> wrote: > If I drop the last feature on the third model, the error seems to go away. > > On Wed, Oct 12, 2016 at 11:52 PM, Meeraj Kunnumpurath < > mee...@services

Re: Linear Regression Error

2016-10-12 Thread Meeraj Kunnumpurath

If I drop the last feature on the third model, the error seems to go away. On Wed, Oct 12, 2016 at 11:52 PM, Meeraj Kunnumpurath < mee...@servicesymphony.com> wrote: > Hello, > > I have some code trying to compare linear regression coefficients with > three sets of features, as shown below. On th

Linear Regression Error

2016-10-12 Thread Meeraj Kunnumpurath

Hello, I have some code trying to compare linear regression coefficients with three sets of features, as shown below. On the third one, I get an assertion error. This is the code, object MultipleRegression extends App { val spark = SparkSession.builder().appName("Regression Model Builder").

Matrix Operations

2016-10-12 Thread Meeraj Kunnumpurath

Hello, Does anyone have examples of doing Matrix operations (multiplication, transpose, inverse etc) using the Spark ML API? Many thanks -- *Meeraj Kunnumpurath* *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597* *00 971 50 409 0169mee...@servicesymphony.com *

Re: Reading from and writing to different S3 buckets in spark

2016-10-12 Thread Mridul Muralidharan

If using RDD's, you can use saveAsHadoopFile or saveAsNewAPIHadoopFile with the conf passed in which overrides the keys you need. For example, you can do : val saveConf = new Configuration(sc.hadoopConfiguration) // configure saveConf with overridden s3 config rdd.saveAsNewAPIHadoopFile(..., conf

Spyder and SPARK combination problem...Please help!

2016-10-12 Thread innocent73

Hi folks, I am an exact beginner to spark and Python environment. I have installed spark and would like to run a code snippet named "SpyderSetupForSpark.py": # -*- coding: utf-8 -*- """ Make sure you give execute privileges --

Re: UDF on multiple columns

2016-10-12 Thread Meeraj Kunnumpurath

This is what I do at the moment, def build(path: String, spark: SparkSession) = { val toDouble = udf((x: String) => x.toDouble) val df = spark.read. option("header", "true"). csv(path). withColumn("sqft_living", toDouble('sqft_living)). withColumn("price", toDouble('price)).

UDF on multiple columns

2016-10-12 Thread Meeraj Kunnumpurath

Hello, How do I write a UDF that operate on two columns. For example, how do I introduce a new column, which is a product of two columns already on the dataframe. Many thanks Meeraj

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Cody Koeninger

Cool, just wanted to make sure. To answer your question about > Isn't "spark.streaming.backpressure.initialRate" supposed to do this? that configuration was added well after the integration of the direct stream with the backpressure code, and was added only to the receiver code, which the direct

DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-12 Thread Stephen Hankinson

Hi, We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been abl

How to prevent having more than one instance of a specific job running on the cluster

2016-10-12 Thread Samy Dindane

Hi, I'd like a specific job to fail if there's another instance of it already running on the cluster (Spark Standalone in my case). How to achieve this? Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Samy Dindane

I am 100% sure. println(conf.get("spark.streaming.backpressure.enabled")) prints true. On 10/12/2016 05:48 PM, Cody Koeninger wrote: Just to make 100% sure, did you set spark.streaming.backpressure.enabled to true? On Wed, Oct 12, 2016 at 10:09 AM, Samy Dindane wrote: On 10/12/2016 04:40

Re: Spark-Sql 2.0 nullpointerException

2016-10-12 Thread Selvam Raman

What i am trying to achieve is Trigger query to get number(i.e.,1,2,3,...n) for every number i have to trigger another 3 queries. Thanks, selvam R On Wed, Oct 12, 2016 at 4:10 PM, Selvam Raman wrote: > Hi , > > I am reading parquet file and creating temp table. when i am trying to > execute q

Spark-Sql 2.0 nullpointerException

2016-10-12 Thread Selvam Raman

Hi , I am reading parquet file and creating temp table. when i am trying to execute query outside of foreach function it is working fine. throws nullpointerexception within data frame.foreach function. code snippet: String CITATION_QUERY = "select c.citation_num, c.title, c.publisher from test

Spark 2.0 Encoder().schema() is sorting StructFields

2016-10-12 Thread Paul Stewart

Hi all, I am using Spark 2.0 to read a CSV file into a Dataset in Java. This works fine if i define the StructType with the StructField array ordered by hand. What I would like to do is use a bean class for both the schema and Dataset row type. For example, Dataset beanDS = spark.read().sch

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Cody Koeninger

How would backpressure know anything about the capacity of your system on the very first batch? You should be able to set maxRatePerPartition at a value that makes sure your first batch doesn't blow things up, and let backpressure scale from there. On Wed, Oct 12, 2016 at 8:53 AM, Samy Dindane w

Re: What happens when an executor crashes?

2016-10-12 Thread Cody Koeninger

Yes, partitionBy will shuffle unless it happens to be partitioning with the exact same partitioner the parent rdd had. On Wed, Oct 12, 2016 at 8:34 AM, Samy Dindane wrote: > Hey Cody, > > I ended up choosing a different way to do things, which is using Kafka to > commit my offsets. > It works fin

Re: Map with state keys serialization

2016-10-12 Thread Joey Echeverria

I tried with 1.6.2 and saw the same behavior. -Joey On Tue, Oct 11, 2016 at 5:18 PM, Shixiong(Ryan) Zhu wrote: > There are some known issues in 1.6.0, e.g., > https://issues.apache.org/jira/browse/SPARK-12591 > > Could you try 1.6.1? > > On Tue, Oct 11, 2016 at 9:55 AM, Joey Echeverria wrote: >

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Samy Dindane

That's what I was looking for, thank you. Unfortunately, neither * spark.streaming.backpressure.initialRate * spark.streaming.backpressure.enabled * spark.streaming.receiver.maxRate * spark.streaming.receiver.initialRate change how many records I get (I tried many different combinations). The

Re: What happens when an executor crashes?

2016-10-12 Thread Samy Dindane

Hey Cody, I ended up choosing a different way to do things, which is using Kafka to commit my offsets. It works fine, except it stores the offset in ZK instead of a Kafka topic (investigating this right now). I understand your explanations, thank you, but I have one question: when you say "Re

Re: Kafka integration: get existing Kafka messages?

2016-10-12 Thread Cody Koeninger

its set to none for the executors, because otherwise they wont do exactly what the driver told them to do. you should be able to set up the driver consumer to determine batches however you want, though. On Wednesday, October 12, 2016, Haopu Wang wrote: > Hi, > > > > I want to read the existing

Re: Spark ML OOM problem

2016-10-12 Thread Jörn Franke

Which Spark version? Are you using RDDs? Or datasets? What type are the features? If string how large? Is it spark standalone? How do you train/configure the algorithm. How do you initially parse the data? The standard driver and executor logs could be helpful. > On 12 Oct 2016, at 09:24, 陈哲

Re: Reading from and writing to different S3 buckets in spark

2016-10-12 Thread Steve Loughran

On 12 Oct 2016, at 10:49, Aseem Bansal mailto:asmbans...@gmail.com>> wrote: Hi I want to read CSV from one bucket, do some processing and write to a different bucket. I know the way to set S3 credentials using jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY) jssc.hadoo

Re: mllib model in production web API

2016-10-12 Thread Aseem Bansal

Hi Faced a similar issue. Our solution was to load the model, cache it after converting it to a model from mllib and then use that instead of ml model. On Tue, Oct 11, 2016 at 10:22 PM, Sean Owen wrote: > I don't believe it will ever scale to spin up a whole distributed job to > serve one reque

Reading from and writing to different S3 buckets in spark

2016-10-12 Thread Aseem Bansal

Hi I want to read CSV from one bucket, do some processing and write to a different bucket. I know the way to set S3 credentials using jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY) jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY) But the prob

Kafka integration: get existing Kafka messages?

2016-10-12 Thread Haopu Wang

Hi, I want to read the existing Kafka messages and then subscribe new stream messages. But I find "auto.offset.reset" property is always set to "none" in KafkaUtils. Does that mean I cannot specify "earliest" property value when create direct stream? Thank you!

Spark ML OOM problem

2016-10-12 Thread 陈哲

Hi I'm using spark ml to train RandomForest Model . There is about over 200, 000 lines in the training data file and about 100 features. I'm running spark in local mode and with JAVA_OPTS like: -Xms1024m -Xmx10296m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, but OOM error keep coming out, I t

40 matches

Mail list logo