Spark Streaming : requirement failed: numRecords must not be negative

2016-01-22 Thread Afshartous, Nick
Hello, We have a streaming job that consistently fails with the trace below. This is on an AWS EMR 4.2/Spark 1.5.2 cluster. This ticket looks related SPARK-8112 Received block event count through the StreamingListener can be negative although it appears to have been fixed in 1.5.

Re: Spark Streaming : requirement failed: numRecords must not be negative

2016-01-22 Thread Afshartous, Nick
This seems to be a problem with Kafka brokers being in a bad state. We're restarting Kafka to resolve. -- Nick From: Ted Yu Sent: Friday, January 22, 2016 10:38 AM To: Afshartous, Nick Cc: user@spark.apache.org Subject: Re: Spark

Re: StackOverflow when computing MatrixFactorizationModel.recommendProductsForUsers

2016-01-22 Thread Ram VISWANADHA
Any help? Not sure what I am doing wrong. Best Regards, Ram From: Ram VISWANADHA > Date: Friday, January 22, 2016 at 10:25 AM To: user > Subject: StackOverflow when

Re: Use KafkaRDD to Batch Process Messages from Kafka

2016-01-22 Thread Cody Koeninger
Yes, you should query Kafka if you want to know the latest available offsets. There's code to make this straightforward in KafkaCluster.scala, but the interface isnt public. There's an outstanding pull request to expose the api at https://issues.apache.org/jira/browse/SPARK-10963 but frankly

First job is extremely slow due to executor heartbeat timeout (yarn-client)

2016-01-22 Thread Zhong Wang
Hi, I am deploying Spark 1.6.0 using yarn-client mode in our yarn cluster. Everything works fine, except the first job is extremely slow due to executor heartbeat RPC timeout: WARN netty.NettyRpcEndpointRef: Error sending message [message = Heartbeat I think this might be related to our

Re: Use KafkaRDD to Batch Process Messages from Kafka

2016-01-22 Thread Charles Chao
Thanks a lot for the help! I'll definately check out the KafkaCluster.scala. I probably first try use that api from java, and later try to build the subproject. thanks, Charles On Fri, Jan 22, 2016 at 12:26 PM, Cody Koeninger wrote: > Yes, you should query Kafka if you

Re: 10hrs of Scheduler Delay

2016-01-22 Thread Darren Govoni
Thanks for the tip. I will try it. But this is the kind of thing spark is supposed to figure out and handle. Or at least not get stuck forever. Sent from my Verizon Wireless 4G LTE smartphone Original message From: Muthu Jayakumar Date: 01/22/2016

Re: 10hrs of Scheduler Delay

2016-01-22 Thread Muthu Jayakumar
Does increasing the number of partition helps? You could try out something 3 times what you currently have. Another trick i used was to partition the problem into multiple dataframes and run them sequentially and persistent the result and then run a union on the results. Hope this helps. On Fri,

Use KafkaRDD to Batch Process Messages from Kafka

2016-01-22 Thread Charles Chao
Hi, I have been using DirectKafkaInputDStream in Spark Streaming to consumer kafka messages and it's been working very well. Now I have the need to batch process messages from Kafka, for example, retrieve all messages every hour and process them, output to destinations like Hive or HDFS. I

Re: Disable speculative retry only for specific stages?

2016-01-22 Thread Ted Yu
Looked at: https://spark.apache.org/docs/latest/configuration.html I don't think Spark supports per stage speculation. On Fri, Jan 22, 2016 at 10:15 AM, Adam McElwee wrote: > I've used speculative execution a couple times in the past w/ good > results, but I have one stage in

Disable speculative retry only for specific stages?

2016-01-22 Thread Adam McElwee
I've used speculative execution a couple times in the past w/ good results, but I have one stage in my job with a non-idempotent operation in a `forEachPartition` block. I don't see a way to disable speculative retry on certain stages, but does anyone know of any tricks to help out here? Spark

StackOverflow when computing MatrixFactorizationModel.recommendProductsForUsers

2016-01-22 Thread Ram VISWANADHA
Hi, I am getting this StackOverflowError when fetching recommendations from ALS. Any help is much appreciated int features = 100; double alpha = 0.1; double lambda = 0.001; boolean implicit = true; int iterations = 10; ALS als = new ALS()

SparkR works from command line but not from rstudio

2016-01-22 Thread Sandeep Khurana
Hello I installed spark in a folder. I start bin/sparkR on console. Then I execute below command and all work fine. I can see the data as well. hivecontext <<- sparkRHive.init(sc) ; df <- loadDF(hivecontext, "/someHdfsPath", "orc") showDF(df) But when I give same to rstudio, it throws the

spark-streaming with checkpointing: error with sparkOnHBase lib

2016-01-22 Thread vinay gupta
Hi,  I have a spark-streaming application which uses sparkOnHBase lib to do streamBulkPut() Without checkpointing everything works fine.. But recently upon enabling checkpointing I got thefollowing exception -  16/01/22 01:32:35 ERROR executor.Executor: Exception in task 0.0 in stage 39.0 (TID

Re: storing query object

2016-01-22 Thread Ted Yu
There have been optimizations in this area, such as: https://issues.apache.org/jira/browse/SPARK-8125 You can also look at parent issue. Which Spark release are you using ? > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta > wrote: > > > Hi, > > I have a SPARK

Re: storing query object

2016-01-22 Thread Gourav Sengupta
Hi Ted, I am using SPARK 1.5.2 as available currently in AWS EMR 4x. The data is in TSV format. I do not see any affect of the work already done on this for the data stored in HIVE as it takes around 50 mins just to collect the table metadata over a 40 node cluster and the time is much the same

Re: Spark Streaming - Custom ReceiverInputDStream ( Custom Source) In java

2016-01-22 Thread Nagu Kothapalli
Hi Anyone have any idea on *ClassTag in spark context..* On Fri, Jan 22, 2016 at 12:42 PM, Nagu Kothapalli wrote: > Hi All > > Facing an Issuee With CustomInputDStream object in java > > > > *public CustomInputDStream(StreamingContext ssc_, ClassTag classTag)* > *

Trouble dropping columns from a DataFrame that has other columns with dots in their names

2016-01-22 Thread Joshua TAYLOR
(Apologies if this comes through twice; I sent it once before I'd confirmed by mailing list subscription.) I've been having lots of trouble with DataFrames whose columns have dots in their names today. I know that in many places, backticks can be used to quote column names, but the problem I'm

Trouble dropping columns from a DataFrame that has other columns with dots in their names

2016-01-22 Thread Joshua TAYLOR
I've been having lots of trouble with DataFrames whose columns have dots in their names today. I know that in many places, backticks can be used to quote column names, but the problem I'm running into now is that I can't drop a column that has *no* dots in its name when there are *other* columns

Re: Spark Streaming: BatchDuration and Processing time

2016-01-22 Thread Lin Zhao
Hi Silvio, Can you go into a little detail how the back pressure work? Does it block the receiver? Or does it temporarily saves the incoming messages in mem/disk? I have a custom actor receiver that uses store() to save dataa to spark. Would the back pressure make store() call block? On 1/17/16,

Fwd: storing query object

2016-01-22 Thread Gourav Sengupta
Hi, I have a SPARK table (created from hiveContext) with couple of hundred partitions and few thousand files. When I run query on the table then spark spends a lot of time (as seen in the pyspark output) to collect this files from the several partitions. After this the query starts running. Is

[Streaming-Kafka] How to start from topic offset when streamcontext is using checkpoint

2016-01-22 Thread Raju Bairishetti
Hi, I am very new to spark & spark-streaming. I am planning to use spark streaming for real time processing. I have created a streaming context and checkpointing to hdfs directory for recovery purposes in case of executor failures & driver failures. I am creating Dstream with offset map

Re: 10hrs of Scheduler Delay

2016-01-22 Thread Darren Govoni
Me too. I had to shrink my dataset to get it to work. For us at least Spark seems to have scaling issues. Sent from my Verizon Wireless 4G LTE smartphone Original message From: "Sanders, Isaac B" Date: 01/21/2016 11:18 PM (GMT-05:00) To:

spark streaming input rate strange

2016-01-22 Thread patcharee
Hi, I have a streaming application with - 1 sec interval - accept data from a simulation through MulticastSocket The simulation sent out data using multiple clients/threads every 1 sec interval. The input rate accepted by the streaming looks strange. - When clients = 10,000 the event rate

Re: storing query object

2016-01-22 Thread Ted Yu
In SQLConf.scala , I found this: val PARALLEL_PARTITION_DISCOVERY_THRESHOLD = intConf( key = "spark.sql.sources.parallelPartitionDiscovery.threshold", defaultValue = Some(32), doc = "The degree of parallelism for schema merging and partition discovery of " + "Parquet data

?????? retrieve cell value from a rowMatrix.

2016-01-22 Thread zhangjp
Hi Srini, If you want to get value like the following example using scala, some other language also like this " mat.rows.collect().apply(i) val cov = mat.computeCovariance() cov.apply(i, j) " mat is RowMatrix type and the cov is Matrix type.

Application SUCCESS/FAILURE status using spark API

2016-01-22 Thread Raghvendra Singh
Hi, Does any body know how can we get the application status of a spark app using the API ? Currently its giving only the completed status as true/false. I am trying to build a application manager kind of thing where one can see the apps deployed and their status, and do some actions based on

Re: Concurrent Spark jobs

2016-01-22 Thread Eugene Morozov
Emlyn, Have you considered using pools? http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools I haven't tried that by myself, but it looks like pool setting is applied per thread so that means it's possible to configure fair scheduler, so that more, than one job is on a

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-22 Thread Ted Yu
The class path formations on driver and executors are different. Cheers On Fri, Jan 22, 2016 at 3:25 PM, Ajinkya Kale wrote: > Is this issue only when the computations are in distributed mode ? > If I do (pseudo code) : > rdd.collect.call_to_hbase I dont get this error,

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-22 Thread Ajinkya Kale
I tried --jars which supposedly does that but that did not work. On Fri, Jan 22, 2016 at 4:33 PM Ajinkya Kale wrote: > Hi Ted, > Is there a way for the executors to have the hbase-protocol jar on their > classpath ? > > On Fri, Jan 22, 2016 at 4:00 PM Ted Yu

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-22 Thread Ajinkya Kale
Is this issue only when the computations are in distributed mode ? If I do (pseudo code) : rdd.collect.call_to_hbase I dont get this error, but if I do : rdd.call_to_hbase.collect it throws this error. On Wed, Jan 20, 2016 at 6:50 PM Ajinkya Kale wrote: > Unfortunately

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-22 Thread Ajinkya Kale
Hi Ted, Is there a way for the executors to have the hbase-protocol jar on their classpath ? On Fri, Jan 22, 2016 at 4:00 PM Ted Yu wrote: > The class path formations on driver and executors are different. > > Cheers > > On Fri, Jan 22, 2016 at 3:25 PM, Ajinkya Kale

RE: Spark Cassandra clusters

2016-01-22 Thread Mohammed Guller
Vivek, By default, Cassandra uses ¼ of the system memory, so in your case, it will be around 8GB, which is fine. If you have more Cassandra related question, it is better to post it on the Cassandra mailing list. Also feel free to email me directly. Mohammed Author: Big Data Analytics with

How to send a file to database using spark streaming

2016-01-22 Thread Sree Eedupuganti
New to Spark Streaming. My question is i want to load the XML files to database [cassandra] using spark streaming.Any suggestions please.Thanks in Advance. -- Best Regards, Sreeharsha Eedupuganti Data Engineer innData Analytics Private Limited

looking for a spark admin consultant/contractor

2016-01-22 Thread Andy Davidson
I am working on a proof of concept using spark. I set up a small cluster on AWS. I looking for some part time help with administration. Kind Regards Andy

Re: SparkR works from command line but not from rstudio

2016-01-22 Thread Sandeep Khurana
This problem is fixed by restarting R from R studio. Now see 16/01/22 08:08:38 INFO HiveMetaStore: No user is added in admin role, since config is empty16/01/22 08:08:38 ERROR RBackendHandler: on org.apache.spark.sql.hive.HiveContext failedError in value[[3L]](cond) : Spark SQL is not built with

Re: TaskCommitDenied (Driver denied task commit)

2016-01-22 Thread Arun Luthra
Correction. I have to use spark.yarn.am.memoryOverhead because I'm in Yarn client mode. I set it to 13% of the executor memory. Also quite helpful was increasing the total overall executor memory. It will be great when tungsten enhancements make there way into RDDs. Thanks! Arun On Thu, Jan

RE: Date / time stuff with spark.

2016-01-22 Thread Spencer, Alex (Santander)
Hi Andy, Sorry this is in Scala but you may be able to do something similar? I use Joda's DateTime class. I ran into a lot of difficulties with the serializer, but if you are an admin on the box you'll have less issues by adding in some Kryo serializers. import org.joda.time val

Re: [Streaming-Kafka] How to start from topic offset when streamcontext is using checkpoint

2016-01-22 Thread Cody Koeninger
Offsets are stored in the checkpoint. If you want to manage offsets yourself, don't restart from the checkpoint, specify the starting offsets when you create the stream. Have you read / watched the materials linked from https://github.com/koeninger/kafka-exactly-once Regarding the small files

Re: Date / time stuff with spark.

2016-01-22 Thread Durgesh Verma
Option B is good too, have date as timestamp and format later. Thanks, -Durgesh > On Jan 22, 2016, at 9:50 AM, Spencer, Alex (Santander) > wrote: > > Hi Andy, > > Sorry this is in Scala but you may be able to do something similar? I use > Joda's

Re: Spark Streaming : requirement failed: numRecords must not be negative

2016-01-22 Thread Ted Yu
Is it possible to reproduce the condition below with test code ? Thanks On Fri, Jan 22, 2016 at 7:31 AM, Afshartous, Nick wrote: > > Hello, > > > We have a streaming job that consistently fails with the trace below. > This is on an AWS EMR 4.2/Spark 1.5.2 cluster. > >

Re: Date / time stuff with spark.

2016-01-22 Thread Ted Yu
Related thread: http://search-hadoop.com/m/q3RTtSfi342nveex1=RE+NPE+when+using+Joda+DateTime FYI On Fri, Jan 22, 2016 at 6:50 AM, Spencer, Alex (Santander) < alex.spen...@santander.co.uk.invalid> wrote: > Hi Andy, > > Sorry this is in Scala but you may be able to do something similar? I use >

Re: has any one implemented TF_IDF using ML transformers?

2016-01-22 Thread Andy Davidson
Hi Yanbo I recently code up the trivial example from http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classifica tion-1.html I do not get the same results. I’ll put my code up on github over the weekend if anyone is interested Andy From: Yanbo Liang Date:

Tool for Visualization /Plotting of K means cluster

2016-01-22 Thread Ashutosh Kumar
I am looking for any easy to use visualization tool for KMeansModel produced as a result of clustering . Thanks Ashutosh

Re: 10hrs of Scheduler Delay

2016-01-22 Thread Muthu Jayakumar
If you turn on config (like "-XX:+PrintGCDetails -XX:+PrintGCTimeStamps") you would be able to see why some job run for a long time. The tuning guide (http://spark.apache.org/docs/latest/tuning.html) provides some insight on this. Setting up explicit partition helped in my case when I was using

Re: Spark Cassandra clusters

2016-01-22 Thread Ted Yu
Vivek: I searched for 'cassandra gc pause' and found a few hits. e.g. : http://search-hadoop.com/m/qZFqM1c5nrn1Ihwf6=Re+GC+pauses+affecting+entire+cluster+ Keep in mind the effect of GC on shared nodes. FYI On Fri, Jan 22, 2016 at 7:09 PM, Mohammed Guller wrote: > For

Spark LDA

2016-01-22 Thread Ilya Ganelin
Hi all - I'm running the Spark LDA algorithm on a dataset of roughly 3 million terms with a resulting RDD of approximately 20 GB on a 5 node cluster with 10 executors (3 cores each) and 14gb of memory per executor. As the application runs, I'm seeing progressively longer execution times for the

Re: Spark Cassandra clusters

2016-01-22 Thread Durgesh Verma
This may be useful, you can try connectors. https://academy.datastax.com/demos/getting-started-apache-spark-and-cassandra https://spark-summit.org/2015/events/cassandra-and-spark-optimizing-for-data-locality/ Thanks, -Durgesh > On Jan 22, 2016, at 8:37 PM, >

Re: Spark Cassandra clusters

2016-01-22 Thread vivek.meghanathan
+ spark standalone cluster On Sat, Jan 23, 2016 at 7:33 am, Vivek Meghanathan (WT01 - NEP) > wrote: We have the setup on Google cloud platform. Each node has 8 CPU + 30GB memory. 10 nodes for spark another 9nodes for Cassandra.

Re: Spark Cassandra clusters

2016-01-22 Thread Ted Yu
>From your description, putting Cassandra daemon on Spark cluster should be feasible. One aspect to be measured is how much locality can be achieved in this setup - Cassandra is distributed NoSQL store. Cheers On Fri, Jan 22, 2016 at 6:13 PM, wrote: > + spark

Re: Spark Cassandra clusters

2016-01-22 Thread vivek.meghanathan
Thanks Ted, also what is the suggested memory setting for Cassandra process? Regards Vivek On Sat, Jan 23, 2016 at 7:57 am, Ted Yu > wrote: >From your description, putting Cassandra daemon on Spark cluster should be >feasible. One aspect to be

Re: Spark Cassandra clusters

2016-01-22 Thread Ted Yu
I am not Cassandra developer :-) Can you use http://search-hadoop.com/ or ask on Cassandra mailing list. Cheers On Fri, Jan 22, 2016 at 6:35 PM, wrote: > Thanks Ted, also what is the suggested memory setting for Cassandra > process? > > Regards > Vivek > On Sat,

RE: Date / time stuff with spark.

2016-01-22 Thread Mohammed Guller
Hi Andrew, Here is another option. You can define custom schema to specify the correct type for the time column as shown below: import org.apache.spark.sql.types._ val customSchema = StructType( StructField("a", IntegerType, false) :: StructField("b", LongType, false) ::

Re: Spark Cassandra clusters

2016-01-22 Thread Ted Yu
Can you give us a bit more information ? How much memory does each node have ? What's the current heap allocation for Cassandra process and executor ? Spark / Cassandra release you are using Thanks On Fri, Jan 22, 2016 at 5:37 PM, wrote: > Hi All, > What is the

Spark Cassandra clusters

2016-01-22 Thread vivek.meghanathan
Hi All, What is the right spark Cassandra cluster setup - having Cassandra cluster and spark cluster in different nodes or they should be on same nodes. We are having them in different nodes and performance test shows very bad result for the spark streaming jobs. Please let us know. Regards