Re: TallSkinnyQR

2016-11-07 Thread Sean Owen
Rather than post a large section of code, please post a small example of the input matrix and its decomposition, to illustrate what you're saying is out of order. On Tue, Nov 8, 2016 at 3:50 AM im281 wrote: > I am getting the correct rows but they are out of order. Is

Why active tasks is bigger than cores?

2016-11-07 Thread 涂小刚
Hi, all, I run a spark-streaming application, but the ui showed that the active tasks was bigger than cores. According to my knowledge of spark, one task occupys one core when "spark.task.cpus" is set 1. Some places I understand wrong? ​

Re: spark streaming with kinesis

2016-11-07 Thread Takeshi Yamamuro
I'm not familiar with the kafka implementation though, a kinesis receiver runs in a thread of executors. You can set any value in the interval, but frequent checkpoints cause excess loads in dynamodb. See: http://spark.apache.org/docs/latest/streaming-kinesis-integration.html#kinesis-checkpointing

[ANNOUNCE] Announcing Apache Spark 1.6.3

2016-11-07 Thread Reynold Xin
We are happy to announce the availability of Spark 1.6.3! This maintenance release includes fixes across several areas of Spark and encourage users on the 1.6.x line to upgrade to 1.6.3. Head to the project's download page to download the new version: http://spark.apache.org/downloads.html

TallSkinnyQR

2016-11-07 Thread im281
I am getting the correct rows but they are out of order. Is this a bug or am I doing something wrong? public class CoordinateMatrixDemo { public static void main(String[] args) { //boiler plate needed to run locally SparkConf conf = new

Re: Structured Streaming with Cassandra, Is it supported??

2016-11-07 Thread Tathagata Das
Spark 2.0 supports writing out to files, as well as you can do custom foreach code. We havent yet officially released Sink API for custom connector to be implemented, but hopefully we will be able to do it soon. That said, I will not rule out possibility of connectors written using internal,

Re: Spark Streaming Data loss on failure to write BlockAdditionEvent failure to WAL

2016-11-07 Thread Tathagata Das
For WAL in Spark to work with HDFS, the HDFS version you are running must support file appends. Contact your HDFS package/installation provider to figure out whether this is supported by your HDFS installation. On Mon, Nov 7, 2016 at 2:04 PM, Arijit wrote: > Hello All, > > >

Re: Access_Remote_Kerberized_Cluster_Through_Spark

2016-11-07 Thread Ajay Chander
Did anyone use https://www.codatlas.com/github.com/apache/spark/HEAD/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala to interact with secured Hadoop from Spark ? Thanks, Ajay On Mon, Nov 7, 2016 at 4:37 PM, Ajay Chander wrote: > > Hi Everyone, > > I am

Re: Structured Streaming with Cassandra, Is it supported??

2016-11-07 Thread shyla deshpande
I am using spark-cassandra-connector_2.11. On Mon, Nov 7, 2016 at 3:33 PM, shyla deshpande wrote: > Hi , > > I am trying to do structured streaming with the wonderful SparkSession, > but cannot save the streaming data to Cassandra. > > If anyone has got this working,

VectorUDT and ml.Vector for SVD

2016-11-07 Thread ganeshkrishnan
I am trying to run a SVD on a dataframe and I have used ml TF-IDF which has created a dataframe. Now for Singular Value Decomposition I am trying to use RowMatrix which takes in RDD with mllib.Vector so I have to convert this Dataframe with what I assumed was ml.Vector However the conversion val

Re: Correct SparkLauncher usage

2016-11-07 Thread Marcelo Vanzin
Then you need to look at your logs to figure out why the child app is not working. "startApplication" will by default redirect the child's output to the parent's logs. On Mon, Nov 7, 2016 at 3:42 PM, Mohammad Tariq wrote: > Hi Marcelo, > > Thank you for the prompt response.

Re: Correct SparkLauncher usage

2016-11-07 Thread Mohammad Tariq
Hi Marcelo, Thank you for the prompt response. I tried adding listeners as well, didn't work either. Looks like it isn't starting the job at all. [image: --] Tariq, Mohammad [image: https://]about.me/mti

Re: Correct SparkLauncher usage

2016-11-07 Thread Marcelo Vanzin
On Mon, Nov 7, 2016 at 3:29 PM, Mohammad Tariq wrote: > I have been trying to use SparkLauncher.startApplication() to launch a Spark > app from within java code, but unable to do so. However, same piece of code > is working if I use SparkLauncher.launch(). > > Here are the

Structured Streaming with Cassandra, Is it supported??

2016-11-07 Thread shyla deshpande
Hi , I am trying to do structured streaming with the wonderful SparkSession, but cannot save the streaming data to Cassandra. If anyone has got this working, please help Thanks

Correct SparkLauncher usage

2016-11-07 Thread Mohammad Tariq
Dear fellow Spark users, I have been trying to use *SparkLauncher.startApplication()* to launch a Spark app from within java code, but unable to do so. However, same piece of code is working if I use *SparkLauncher.launch()*. Here are the corresponding code snippets : *SparkAppHandle handle =

RE: Anomalous Spark RDD persistence behavior

2016-11-07 Thread Shreya Agarwal
I don’t think this is correct. Unless you are serializing when caching to memory but not serializing when persisting to disk. Can you check? Also, I have seen the behavior where if I have 100 GB in-memory cache and I use 60 GB to persist something (MEMORY_AND_DISK). Then try to persist another

Re: Using Apache Spark Streaming - how to handle changing data format within stream

2016-11-07 Thread Cody Koeninger
I may be misunderstanding, but you need to take each kafka message, and turn it into multiple items in the transformed rdd? so something like (pseudocode): stream.flatMap { message => val items = new ArrayBuffer var parser = null message.split("\n").foreach { line => if // it's a

Spark ML - Naive Bayes - how to select Threshold values

2016-11-07 Thread Nirav Patel
Few questions about `thresholds` parameter: This is what doc says "Param for Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p

Does DeserializeToObject mean that a Row is deserialized to Java objects?

2016-11-07 Thread Benyi Wang
Below is my test code using Spark 2.0.1. DeserializeToObject doesn’t exist in filter() but in map(). Does it means map() does not Tungsten operation? case class Event(id: Long) val e1 = Seq(Event(1L), Event(2L)).toDSval e2 = Seq(Event(2L), Event(3L)).toDS e1.filter(e=>e.id < 10 && e.id >

Using Apache Spark Streaming - how to handle changing data format within stream

2016-11-07 Thread coolgar
I'm using apache spark streaming with the kafka direct consumer. The data stream I'm receiving is log data that includes a header with each block of messages. Each DStream can therefore have many blocks of messages, each with it's own header. The header is used to know how to interpret the

Anomalous Spark RDD persistence behavior

2016-11-07 Thread Dave Jaffe
I’ve been studying Spark RDD persistence with spark-perf (https://github.com/databricks/spark-perf), especially when the dataset size starts to exceed available memory. I’m running Spark 1.6.0 on YARN with CDH 5.7. I have 10 NodeManager nodes, each with 16 vcores and 32 GB of container memory.

Spark Streaming Data loss on failure to write BlockAdditionEvent failure to WAL

2016-11-07 Thread Arijit
Hello All, We are using Spark 1.6.2 with WAL enabled and encountering data loss when the following exception/warning happens. We are using HDFS as our checkpoint directory. Questions are: 1. Is this a bug in Spark or issue with our configuration? Source looks like the following. Which

Access_Remote_Kerberized_Cluster_Through_Spark

2016-11-07 Thread Ajay Chander
Hi Everyone, I am trying to develop a simple codebase on my machine to read data from secured Hadoop cluster. We have a development cluster which is secured through Kerberos and I want to run a Spark job from my IntelliJ to read some sample data from the cluster. Has anyone done this before ? Can

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-07 Thread Cody Koeninger
There definitely is Kafka documentation indicating that you should use a different consumer group for logically different subscribers, this is really basic to Kafka: http://kafka.apache.org/documentation#intro_consumers As for your comment that "commit async after each RDD, which is not really

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-07 Thread Ivan von Nagy
With our stream version, we update the offsets for only the partition we operating on. We even break down the partition into smaller batches and then update the offsets after each batch within the partition. With Spark 1.6 and Kafka 0.8.x this was not an issue, and as Sean pointed out, this is not

Re: Newbie question - Best way to bootstrap with Spark

2016-11-07 Thread Raghav
Thanks a ton, guys. On Sun, Nov 6, 2016 at 4:57 PM, raghav wrote: > I am newbie in the world of big data analytics, and I want to teach myself > Apache Spark, and want to be able to write scripts to tinker with data. > > I have some understanding of Map Reduce but have

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-07 Thread Cody Koeninger
Someone can correct me, but I'm pretty sure Spark dstreams (in general, not just kafka) have been progressing on to the next batch after a given batch aborts for quite some time now. Yet another reason I put offsets in my database transactionally. My jobs throw exceptions if the offset in the DB

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-07 Thread Sean McKibben
I've been encountering the same kinds of timeout issues as Ivan, using the "Kafka Stream" approach that he is using, except I'm storing my offsets manually from the driver to Zookeeper in the Kafka 8 format. I haven't yet implemented the KafkaRDD approach, and therefore don't have the

RE: Out of memory at 60GB free memory.

2016-11-07 Thread Kürşat Kurt
Spark is not running on mesos, runing only client mode. From: Rodrick Brown [mailto:rodr...@orchardplatform.com] Sent: Monday, November 7, 2016 8:15 PM To: Kürşat Kurt Cc: Sean Owen ; User Subject: Re: Out of memory at 60GB

Re: SparkLauncer 2.0.1 version working incosistently in yarn-client mode

2016-11-07 Thread Marcelo Vanzin
On Sat, Nov 5, 2016 at 2:54 AM, Elkhan Dadashov wrote: > while (appHandle.getState() == null || !appHandle.getState().isFinal()) { > if (appHandle.getState() != null) { > log.info("while: Spark job state is : " + appHandle.getState()); > if

Re: How sensitive is Spark to Swap?

2016-11-07 Thread Sean Owen
Swapping is pretty bad here, especially because a JVM-based won't even feel the memory pressure and try to GC or shrink the heap when the OS faces memory pressure. It's probably relatively worse than in M/R because Spark uses memory more. Enough grinding in swap will cause tasks to fail due to

Re: Spark Streaming backpressure weird behavior/bug

2016-11-07 Thread Michael Segel
Spark inherits its security from the underlying mechanisms in either YARN or MESOS (whichever environment you are launching your cluster/jobs) That said… there is limited support from Ranger. There are three parts to this… 1) Ranger being called when the job is launched… 2) Ranger being

How sensitive is Spark to Swap?

2016-11-07 Thread Michael Segel
This may seem like a silly question, but it really isn’t. In terms of Map/Reduce, its possible to over subscribe the cluster because there is a lack of sensitivity if the servers swap memory to disk. In terms of HBase, which is very sensitive, swap doesn’t just kill performance, but also can

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-07 Thread Carlo . Allocca
I found it just google http://sebastianraschka.com/Articles/2014_about_feature_scaling.html Thanks. Carlo On 7 Nov 2016, at 17:12, carlo allocca > wrote: Hi Masood, Thank you very much for your insight. I am going to scale all my features as you

Re: Out of memory at 60GB free memory.

2016-11-07 Thread Rodrick Brown
You should also set memory overhead i.e. --conf spark.mesos.executor.memoryOverhead=${EXECUTOR_MEM} * .10 On Mon, Nov 7, 2016 at 6:51 AM, Kürşat Kurt wrote: > I understand that i shoud set the executor memory. I tried with the > parameters below but OOM still occures... >

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-07 Thread Carlo . Allocca
Hi Masood, Thank you very much for your insight. I am going to scale all my features as you described. As I am beginners, Is there any paper/book that would explain the suggested approaches? I would love to read. Many Thanks, Best Regards, Carlo On 7 Nov 2016, at 16:27, Masood Krohy

Re: Spark with Ranger

2016-11-07 Thread Jan Hentschel
Hi Mudit, As far as I know Ranger does not provide security for Spark as a repository, but to most of the resources Spark can have access to, such as Hive, Kafka, HBase or HDFS. Best, Jan From: Mudit Kumar Date: Monday, November 7, 2016 at 4:23 PM To:

Re: Upgrading to Spark 2.0.1 broke array in parquet DataFrame

2016-11-07 Thread Michael Armbrust
If you can reproduce the issue with Spark 2.0.2 I'd suggest opening a JIRA. On Fri, Nov 4, 2016 at 5:11 PM, Sam Goodwin wrote: > I have a table with a few columns, some of which are arrays. Since > upgrading from Spark 1.6 to Spark 2.0.1, the array fields are always

Re: NoSuchElementException

2016-11-07 Thread Michael Armbrust
What are you trying to do? It looks like you are mixing multiple SparkContexts together. On Fri, Nov 4, 2016 at 5:15 PM, Lev Tsentsiper wrote: > My code throws an exception when I am trying to create new DataSet from > within SteamWriter sink > > Simplified version

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-07 Thread Robin East
If you have to use SGD then scaling will usually help your algorithm to converge quicker. If possible you should try using Linear Regression in the newer ml library: http://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression

Re: expected behavior of Kafka dynamic topic subscription

2016-11-07 Thread Cody Koeninger
https://issues.apache.org/jira/browse/SPARK-18272 I couldn't speculate on what the issue might be without more info. If you have time to write a test for that ticket, I'd encourage you to do so, I'm not certain how soon I'll be able to get to it. On Sun, Nov 6, 2016 at 7:31 PM, Haopu Wang

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-07 Thread Masood Krohy
If you go down this route (look at actual coefficients/weights), then make sure your features are scaled first and have more or less the same mean when feeding them into the algo. If not, then actual coefficients/weights wouldn't tell you much. In any case, SGD performs badly with unscaled

Spark with Ranger

2016-11-07 Thread Mudit Kumar
Hi, Do ranger provide security to spark(Spark Thrift Server/spark sql)?If yes,then in what capacity. Thanks, Mudit

Spark master shows 0 cores for executors

2016-11-07 Thread Rohit Verma
Facing a strange issue with spark 2.0.1. When creating a spark session with executor properties like 'spark.executor.memory':'3g',\ 'spark.executor.cores':'12',\ Spark master shows 0 cores for executors. Similar issue I found on stack overflow as

VectorUDT and ml.Vector

2016-11-07 Thread Ganesh
I am trying to run a SVD on a dataframe and I have used ml TF-IDF which has created a dataframe. Now for Singular Value Decomposition I am trying to use RowMatrix which takes in RDD with mllib.Vector so I have to convert this Dataframe with what I assumed was ml.Vector However the conversion

RE: Out of memory at 60GB free memory.

2016-11-07 Thread Kürşat Kurt
I understand that i shoud set the executor memory. I tried with the parameters below but OOM still occures... ./spark-submit --class main.scala.Test1 --master local[8] --driver-memory 20g --executor-memory 20g From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, November 7, 2016

Spark Streaming backpressure weird behavior/bug

2016-11-07 Thread Mudit Kumar
Hi, Do ranger provide security to spark?If yes,then in what capacity. Thanks, Mudit

mapWithState with a big initial RDD gets OOM'ed

2016-11-07 Thread Daniel Haviv
Hi, I have a stateful streaming app where I pass a rather large initialState RDD at the beginning. No matter to how many partitions I divide the stateful stream I keep failing on OOM or Java heap space. Is there a way to make it more resilient? how can I control it's storage level? This is

spark optimization

2016-11-07 Thread maitraythaker
Why those two stages in apache spark are computing same thing? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-optimization-tp28034.html Sent