Re: Why this code is errorfull

2017-12-13 Thread Soheil Pourbafrani
Also I should mention that the `stream` Dstream definition is: JavaInputDStream> stream = KafkaUtils.createDirectStream( ssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(TOPIC, kafkaParams) ); On Thu,

Why this code is errorfull

2017-12-13 Thread Soheil Pourbafrani
The following code is in SparkStreaming : JavaInputDStream results = stream.map(record -> SparkTest.getTime(record.value()) + ":" + Long.toString(System.currentTimeMillis()) + ":" + Arrays.deepToString(SparkTest.finall(record.value())) + ":" +

bulk upsert data batch from Kafka dstream into Postgres db

2017-12-13 Thread salemi
Hi All, we are consuming messages from Kafka using Spark dsteam. Once the processing is done we would like to update/insert the data in bulk fashion into the database. I was wondering what the best solution for this might be. Our Postgres database table is not partitioned. Thank you, Ali

spark streaming with flume: cannot assign requested address error

2017-12-13 Thread Junfeng Chen
I am trying to connect spark streaming with flume with pull mode. I have three machine and each one runs spark and flume agent at the same time, where they are master, slave1, slave2. I have set flume sink to slave1 on port 6689. Telnet slave1 6689 on other two machine works well. In my code, I

How to control logging in testing package com.holdenkarau.spark.testing.

2017-12-13 Thread Marco Mistroni
HI all could anyone advise on how to control logging in com,holdenkarau.spark.testing? there are loads of spark logging statement every time i run a test I tried to disable spark logging using statements below, but with no success import org.apache.log4j.Logger import

is Union or Join Supported for Spark Structured Streaming Queries in 2.2.0?

2017-12-13 Thread kant kodali
Hi All, I have messages in a queue that might be coming in with few different schemas like msg 1 schema 1, msg2 schema2, msg3 schema3, msg 4 schema1 I want to put all of this in one data frame. is it possible with structured streaming? I am using Spark 2.2.0 Thanks!

Re: Apache Spark documentation on mllib's Kmeans doesn't jibe.

2017-12-13 Thread Scott Reynolds
The train method is on the Companion Object https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans$ here is a decent resource on Companion Object usage: https://docs.scala-lang.org/tour/singleton-objects.html On Wed, Dec 13, 2017 at 9:16 AM Michael

Re: Why do I see five attempts on my Spark application

2017-12-13 Thread sanat kumar Patnaik
It should be within your yarn-site.xml config file.The parameter name is yarn.resourcemanager.am.max-attempts. The directory should be /usr/lib/spark/conf/yarn-conf. Try to find this directory on your gateway node if using Cloudera distribution. On Wed, Dec 13, 2017 at 2:33 PM, Subhash Sriram

Re: Why do I see five attempts on my Spark application

2017-12-13 Thread Subhash Sriram
There are some more properties specifically for YARN here: http://spark.apache.org/docs/latest/running-on-yarn.html Thanks, Subhash On Wed, Dec 13, 2017 at 2:32 PM, Subhash Sriram wrote: > http://spark.apache.org/docs/latest/configuration.html > > On Wed, Dec 13,

Re: Why do I see five attempts on my Spark application

2017-12-13 Thread Subhash Sriram
http://spark.apache.org/docs/latest/configuration.html On Wed, Dec 13, 2017 at 2:31 PM, Toy wrote: > Hi, > > Can you point me to the config for that please? > > On Wed, 13 Dec 2017 at 14:23 Marcelo Vanzin wrote: > >> On Wed, Dec 13, 2017 at 11:21 AM,

Re: Why do I see five attempts on my Spark application

2017-12-13 Thread Toy
Hi, Can you point me to the config for that please? On Wed, 13 Dec 2017 at 14:23 Marcelo Vanzin wrote: > On Wed, Dec 13, 2017 at 11:21 AM, Toy wrote: > > I'm wondering why am I seeing 5 attempts for my Spark application? Does > Spark application

Re: Why do I see five attempts on my Spark application

2017-12-13 Thread Marcelo Vanzin
On Wed, Dec 13, 2017 at 11:21 AM, Toy wrote: > I'm wondering why am I seeing 5 attempts for my Spark application? Does Spark > application restart itself? It restarts itself if it fails (up to a limit that can be configured either per Spark application or globally in

Why do I see five attempts on my Spark application

2017-12-13 Thread Toy
Hi, I'm wondering why am I seeing 5 attempts for my Spark application? Does Spark application restart itself?[image: Screen Shot 2017-12-13 at 2.18.03 PM.png]

Different behaviour when querying a spark DataFrame from dynamodb

2017-12-13 Thread Bogdan Cojocar
I am reading some data in a dataframe from a dynamo db table: val data = spark.read.dynamodb("table") data.filter($"field1".like("%hello%")).createOrReplaceTempView("temp") spark.sql("select * from temp").show() When I do the last statement I get results. If however I try to do:

Apache Spark documentation on mllib's Kmeans doesn't jibe.

2017-12-13 Thread Michael Segel
Hi, Just came across this while looking at the docs on how to use Spark’s Kmeans clustering. Note: This appears to be true in both 2.1 and 2.2 documentation. The overview page: https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means Here’ the example contains the following line:

Re: Spark Streaming with Confluent

2017-12-13 Thread Gerard Maas
Hi Arkadiusz, Try 'rooting' your import. It looks like the import is being interpreted as being relative to the previous. 'rooting; is done by adding the '_root_' prefix to your import: import org.apache.spark.streaming.kafka.KafkaUtils import

Spark Streaming with Confluent

2017-12-13 Thread Arkadiusz Bicz
Hi, I try to test spark streaming 2.2.0 version with confluent 3.3.0 I have got lot of error during compilation this is my sbt: lazy val sparkstreaming = (project in file(".")) .settings( name := "sparkstreaming", organization := "org.arek", version := "0.1-SNAPSHOT", scalaVersion

Determine Cook's distance / influential data points

2017-12-13 Thread Richard Siebeling
Hi, would it be possible to determine the Cook's distance using Spark? thanks, Richard

Re: Spark loads data from HDFS or S3

2017-12-13 Thread Jörn Franke
S3 can be realized cheaper than HDFS on Amazon. As you correctly describe it does not support data locality. The data is distributed to the workers. Depending on your use case it can make sense to have HDFS as a temporary “cache” for S3 data. > On 13. Dec 2017, at 09:39, Philip Lee

Re: Spark loads data from HDFS or S3

2017-12-13 Thread Sebastian Nagel
> When Spark loads data from S3 (sc.textFile('s3://...'), how all data will be > spread on Workers? The data is read by workers. Only make sure that the data is splittable, by using a splittable format or by passing a list of files sc.textFile('s3://.../*.txt') to achieve full parallelism.

Spark loads data from HDFS or S3

2017-12-13 Thread Philip Lee
Hi ​ I have a few of questions about a structure of HDFS and S3 when Spark-like loads data from two storage. Generally, when Spark loads data from HDFS, HDFS supports data locality and already own distributed file on datanodes, right? Spark could just process data on workers. What about S3?