Sparse vs. Dense vector memory usage

2021-08-02 Thread Gerard Maas
Dear Spark folks, Is there somewhere a guideline on the density tipping point when it makes more sense to use a spark ml dense vector vs. a sparse vector with regards to the memory usage on fairly large (image processing) vectors? My google-foo didn't deliver me anything useful. Thanks in

Re: [spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread Gerard Maas
Hi Srinivas, Reading from different brokers is possible but you need to connect to each Kafka cluster separately. Trying to mix connections to two different Kafka clusters in one subscriber is not supported. (I'm sure that it would give all kind of weird errors) The "kafka.bootstrap.servers"

Re: Spark Streaming not working

2020-04-14 Thread Gerard Maas
Hi, Could you share the code that you're using to configure the connection to the Kafka broker? This is a bread-and-butter feature. My first thought is that there's something in your particular setup that prevents this from working. kind regards, Gerard. On Fri, Apr 10, 2020 at 7:34 PM

Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas
Ooops - linked the wrong JIRA ticket: (that other one is related) https://issues.apache.org/jira/browse/SPARK-28025 On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas wrote: > Hi! > I would like to socialize this issue we are currently facing: > The Structured Streaming default CheckpointFi

[StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas
Hi! I would like to socialize this issue we are currently facing: The Structured Streaming default CheckpointFileManager leaks .crc files by leaving them behind after users of this class (like HDFSBackedStateStoreProvider) apply their cleanup methods. This results in an unbounded creation of tiny

Re: The following Java MR code works for small dataset but throws(arrayindexoutofBound) error for large dataset

2019-05-09 Thread Gerard Maas
Hi, I'm afraid you sent this email to the wrong Mailing list. This is the Spark users mailing list. We could probably tell you how to do this with Spark, but I think that's not your intention :) kr, Gerard. On Thu, May 9, 2019 at 11:03 AM Balakumar iyer S wrote: > Hi All, > > I am trying to

Re: Convert RDD[Iterrable[MyCaseClass]] to RDD[MyCaseClass]

2018-12-03 Thread Gerard Maas
James, How do you create an instance of `RDD[Iterable[MyCaseClass]]` ? Is it in that first code snippet? > new SparkContext(sc).parallelize(seq)? kr, Gerard On Fri, Nov 30, 2018 at 3:02 PM James Starks wrote: > When processing data, I create an instance of RDD[Iterable[MyCaseClass]] > and

Re: Re: About the question of Spark Structured Streaming window output

2018-08-27 Thread Gerard Maas
08-27 09:53:00"? > When I define the window, the starttime is not set. > 2、why the agg result of time "2018-08-27 09:53:00 " is not output when > the batch1 data is comming? > > Thanks a lot! > > > > -- > z...@zjdex.com >

Re: How to convert Spark Streaming to Static Dataframe on the fly and pass it to a ML Model as batch

2018-08-14 Thread Gerard Maas
Hi Aakash, In Spark Streaming, forEachRDD provides you access to the data in each micro batch. You can transform that RDD into a DataFrame and implement the flow you describe. eg.: var historyRDD:RDD[mytype] = sparkContext.emptyRDD // create Kafka Dstream ... dstream.foreachRDD{ rdd => val

Re: [STRUCTURED STREAM] Join static dataset in state function (flatMapGroupsWithState)

2018-07-19 Thread Gerard Maas
Hi Chris, Could you show the code you are using? When you mention "I like to use a static datasource (JDBC) in the state function" do you refer to a DataFrame from a JDBC source or an independent JDBC connection? The key point to consider is that the flatMapGroupsWithState function must be

Re: How to reduceByKeyAndWindow in Structured Streaming?

2018-06-28 Thread Gerard Maas
Hi, In Structured Streaming lingo, "ReduceByKeyAndWindow" would be a window aggregation with a composite key. Something like: stream.groupBy($"key", window($"timestamp", "5 minutes")) .agg(sum($"value") as "total") The aggregate could be any supported SQL function. Is this what you

Re: [Spark Streaming] Measure latency

2018-06-26 Thread Gerard Maas
Hi Daniele, A pragmatic approach to do that would be to execute the computations in the scope of a foreachRDD, surrounded by wall-clock timers. For example: dstream.foreachRDD{ rdd => val t0 = System.currentTimeMillis() val aggregates = rdd. // make sure you get a result here, not

Re: Advice on multiple streaming job

2018-05-07 Thread Gerard Maas
Dhaval, Which Streaming API are you using? In Structured Streaming, you are able to start several streaming queries within the same context. kind regards, Gerard. On Sun, May 6, 2018 at 7:59 PM, Dhaval Modi wrote: > Hi Susan, > > Thanks for your response. > > Will try

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Gerard Maas
Aakash, There are two issues here. The issue with the code on the first question is that the first query blocks and the code for the second does not get executed. Panagiotis pointed this out correctly. In the updated code, the issue is related to netcat (nc) and the way structured streaming

[Structured Streaming] File source, Parquet format: use of the mergeSchema option.

2018-04-12 Thread Gerard Maas
Hi, I'm looking into the Parquet format support for the File source in Structured Streaming. The docs mention the use of the option 'mergeSchema' to merge the schemas of the part files found.[1] What would be the practical use of that in a streaming context? In its batch counterpart,

Re: Scala - Spark for beginners

2018-03-18 Thread Gerard Maas
This is a good start: https://github.com/deanwampler/JustEnoughScalaForSpark And the corresponding talk: https://www.youtube.com/watch?v=LBoSgiLV_NQ There're many more resources if you search for it. -kr, Gerard. On Sun, Mar 18, 2018 at 11:15 AM, Mahender Sarangam <

Re: Spark StreamingContext Question

2018-03-07 Thread Gerard Maas
Hi, You can run as many jobs in your cluster as you want, provided you have enough capacity. The one streaming context constrain is per job. You can submit several jobs for Flume and some other for Twitter, Kafka, etc... If you are getting started with Streaming with Spark, I'd recommend you to

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread Gerard Maas
Hi, You can monitor a filesystem directory as streaming source as long as the files placed there are atomically copied/moved into the directory. Updating the files is not supported. kr, Gerard. On Mon, Jan 15, 2018 at 11:41 PM, kant kodali wrote: > Hi All, > > I am

Re: Spark Streaming with Confluent

2017-12-13 Thread Gerard Maas
Hi Arkadiusz, Try 'rooting' your import. It looks like the import is being interpreted as being relative to the previous. 'rooting; is done by adding the '_root_' prefix to your import: import org.apache.spark.streaming.kafka.KafkaUtils import

Re: Union of RDDs Hung

2017-12-12 Thread Gerard Maas
Can you show us the code? On Tue, Dec 12, 2017 at 9:02 AM, Vikash Pareek wrote: > Hi All, > > I am unioning 2 rdds(each of them having 2 records) but this union it is > getting hang. > I found a solution to this that is caching both the rdds before performing > union

Re: Do I need to do .collect inside forEachRDD

2017-12-06 Thread Gerard Maas
executor log but let me run again to > make sure. > > @Gerard Thanks much! but would your answer on .collect() change depending > on running the spark app in client vs cluster mode? > > Thanks! > > On Tue, Dec 5, 2017 at 1:54 PM, Gerard Maas <gerard.m...@gmail.com> wrot

Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread Gerard Maas
The general answer to your initial question is that "it depends". If the operation in the rdd.foreach() closure can be parallelized, then you don't need to collect first. If it needs some local context (e.g. a socket connection), then you need to do rdd.collect first to bring the data locally,

Re: Issue Storing offset in Kafka for Spark Streaming Application

2017-10-13 Thread Gerard Maas
Hi Arpan, The error suggests that the streaming context has been started with streamingContext.start() and after that statement, some other dstream operations have been attempted. A suggested pattern to manage the offsets is the following: var offsetRanges: Array[OffsetRanger] = _ //create

Re: Spark Streaming - Multiple Spark Contexts (SparkSQL) Performance

2017-10-01 Thread Gerard Maas
Hammad, The recommended way to implement this logic would be to: Create a SparkSession. Create a Streaming Context using the SparkContext embedded in the SparkSession Use the single SparkSession instance for the SQL operations within the foreachRDD. It's important to note that spark operations

Re: [StructuredStreaming] multiple queries of the socket source: only one query works.

2017-08-13 Thread Gerard Maas
t; An alternative to the socket source issue would be to open a new free >> socket, but then the user has to figure out where the source is listening. >> >> I second Gerard's request for additional information, and confirmation of >> my theories! >> >> Thanks, &g

[StructuredStreaming] multiple queries of the socket source: only one query works.

2017-08-11 Thread Gerard Maas
Hi, I've been investigating this SO question: https://stackoverflow.com/questions/45618489/executing-separate-streaming-queries-in-spark-structured-streaming TL;DR: when using the Socket source, trying to create multiple queries does not work properly, only one the first query in the start order

Re: Need Spark(Scala) Performance Tuning tips

2017-06-09 Thread Gerard Maas
also, read the newest book of Holden on High-Performance Spark: http://shop.oreilly.com/product/0636920046967.do On Fri, Jun 9, 2017 at 5:38 PM, Alonso Isidoro Roman wrote: > a quick search on google: > > https://www.cloudera.com/documentation/enterprise/5-9- >

Re: How to perform clean-up after stateful streaming processes an RDD?

2017-06-06 Thread Gerard Maas
It looks like the clean up should go into the foreachRDD function: stateUpdateStream.foreachRdd(...) { rdd => // do stuff with the rdd stateUpdater.cleanupExternalService// should work in this position } Code within the foreachRDD(*) executes on the driver, so you can keep the state of

[StackOverflow] Size exceeds Integer.MAX_VALUE When Joining 2 Large DFs

2016-11-25 Thread Gerard Maas
This question seems to deserve an scalation from Stack Overflow: http://stackoverflow.com/questions/40803969/spark-size-exceeds-integer-max-value-when-joining-2-large-dfs Looks like an important limitation. -kr, Gerard. Meta:PS: What do you think would be the best way to scalate from SO?

Re: HiveContext standalone => without a Hive metastore

2016-05-30 Thread Gerard Maas
din.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 26 May 2016 at 19:09, Ge

Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Gerard Maas
t could be is Hive dependency collisions from the >> classpath, but that shouldn’t be an issue since you said it’s standalone >> (not a Hadoop distro right?). >> >> >> >> Thanks, >> >> Silvio >> >> >> >> *From: *Gerard Maas <

HiveContext standalone => without a Hive metastore

2016-05-26 Thread Gerard Maas
Hi, I'm helping some folks setting up an analytics cluster with Spark. They want to use the HiveContext to enable the Window functions on DataFrames(*) but they don't have any Hive installation, nor they need one at the moment (if not necessary for this feature) When we try to create a Hive

Re: Create one DB connection per executor

2016-03-24 Thread Gerard Maas
Hi Manas, The approach is correct, with one caveat: You may have several tasks executing in parallel in one executor. Having one single connection per JVM will either fail, if the connection is not thread-safe or become a bottleneck b/c all task will be competing for the same resource. The best

Re: Evaluating spark streaming use case

2016-02-21 Thread Gerard Maas
It sounds like another window operation on top of the 30-min window will achieve the desired objective. Just keep in mind that you'll need to set the clean TTL (spark.cleaner.ttl) to a long enough value and you will require enough resources (mem & disk) to keep the required data. -kr, Gerard.

Hadoop credentials missing in some tasks?

2016-02-05 Thread Gerard Maas
Hi, We're facing a situation where simple queries to parquet files stored in Swift through a Hive Metastore sometimes fail with this exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 58.0 failed 4 times, most recent failure: Lost task 6.3 in stage 58.0

Re: Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Gerard Maas
What are you trying to achieve? Looks like you want to provide offsets but you're not managing them and I'm assuming you're using the direct stream approach. In that case, use the simpler constructor that takes the kafka config and the topics. Let it figure it out the offsets (it will contact

Re: Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Gerard Maas
he partition > and provide those to direct API. > > So my question is should i consider passing all the partition from command > line and query kafka and find and provide , what is the correct approach. > > Ashish > > On Mon, Jan 25, 2016 at 11:38 AM, Gerard Maas <gerard.m...@gmail

Re: Inconsistent data in Cassandra

2015-12-13 Thread Gerard Maas
Hi Padma, Have you considered reducing the dataset before writing it to Cassandra? Looks like this consistency problem could be avoided by cleaning the dataset of unnecessary records before persisting it: val onlyMax = rddByPrimaryKey.reduceByKey{case (x,y) => Max(x,y)} // your max function

Re: flatMap function in Spark

2015-12-08 Thread Gerard Maas
http://stackoverflow.com/search?q=%5Bapache-spark%5D+flatmap -kr, Gerard. On Tue, Dec 8, 2015 at 12:04 PM, Sateesh Karuturi < sateesh.karutu...@gmail.com> wrote: > Guys... I am new to Spark.. > Please anyone please explain me how flatMap function works with a little > sample example... > Thanks

Re: Spark DStream Data stored out of order in Cassandra

2015-11-30 Thread Gerard Maas
Spark Streaming will consumer and process data in parallel. So the order of the output will depend not only on the order of the input but also in the time it takes for each task to process. Different options, like repartitions, sorts and shuffles at Spark level will also affect ordering, so the

Re: streaming: missing data. does saveAsTextFile() append or replace?

2015-11-08 Thread Gerard Maas
Andy, Using the rdd.saveAsTextFile(...) will overwrite the data if your target is the same file. If you want to save to HDFS, DStream offers dstream.saveAsTextFiles(prefix, suffix) where a new file will be written at each streaming interval. Note that this will result in a saved file for each

Re: How to check whether the RDD is empty or not

2015-10-21 Thread Gerard Maas
As TD mentions, there's no such thing as an 'empty DStream'. Some intervals of a DStream could be empty, in which case the related RDD will be empty. This means that you should express such condition based on the RDD's of the DStream. Translated in code: dstream.foreachRDD{ rdd => if

Re: Is there a way to create multiple streams in spark streaming?

2015-10-20 Thread Gerard Maas
You can create as many functional derivates of your original stream by using transformations. That's exactly the model that Spark Streaming offers. In your example, that would become something like: val stream = ssc.socketTextStream("localhost", ) val stream1 = stream.map(fun1) val stream2 =

Node afinity for Kafka-Direct Stream

2015-10-14 Thread Gerard Maas
In the receiver-based kafka streaming model, given that each receiver starts as a long-running task, one can rely in a certain degree of data locality based on the kafka partitioning: Data published on a given topic/partition will land on the same spark streaming receiving node until the receiver

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Gerard Maas
e not very reliable, regardless of > what consumer you use. Even if you have locality preferences, and locality > wait turned up really high, you still have to account for losing executors. > > On Wed, Oct 14, 2015 at 8:23 AM, Gerard Maas <gerard.m...@gmail.com> > wrote: > >&g

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Gerard Maas
to the Kafka partitions host . So if Kafka and Spark are co hosted >> probably this will work. If not, I am not sure how to get data locality for >> a partition. >> Others, >> correct me if there is a way. >> >> On Wed, Oct 14, 2015 at 3:08 PM, Gerard Maas <gerard.m...@g

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-07 Thread Gerard Maas
cs, nothing currently. There is an info-level log >>>> message every time a kafka rdd iterator is instantiated, >>>> >>>> log.info(s"Computing topic ${part.topic}, partition >>>> ${part.partition} " + >>>> >>>

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Gerard Maas
ions. > > On Tue, Oct 6, 2015 at 11:45 AM, Gerard Maas <gerard.m...@gmail.com> > wrote: > >> Hi, >> >> We recently migrated our streaming jobs to the direct kafka receiver. Our >> initial migration went quite fine but now we are seeing a weird zig-zag >

Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Gerard Maas
Hi, We recently migrated our streaming jobs to the direct kafka receiver. Our initial migration went quite fine but now we are seeing a weird zig-zag performance pattern we cannot explain. In alternating fashion, one task takes about 1 second to finish and the next takes 7sec for a stable

Re: Kafka Direct Stream

2015-10-03 Thread Gerard Maas
dd = rdd.collect { case (t, data) if t == topic => data } > CassandraHelper.saveDataToCassandra(topic, filteredRdd) > } > updateOffsetsinZk(rdd) > } > > } > > On Fri, Oct 2, 2015 at 11:58 PM, Gerard Maas <gerard.m...@gmail.com> > wrote: > >> So

Re: Kafka Direct Stream

2015-10-02 Thread Gerard Maas
Something like this? I'm making the assumption that your topic name equals your keyspace for this filtering example. dstream.foreachRDD{rdd => val topics = rdd.map(_._1).distinct.collect topics.foreach{topic => val filteredRdd = rdd.collect{case (t, data) if t == topic => data}.

Re: unoin streams not working for streams > 3

2015-09-14 Thread Gerard Maas
How many cores are you assigning to your spark streaming job? On Mon, Sep 14, 2015 at 10:33 PM, Василец Дмитрий wrote: > hello > I have 4 streams from kafka and streaming not working. > without any errors or logs > but with 3 streams everything perfect. > make sense

Re: [streaming] Using org.apache.spark.Logging will silently break task execution

2015-09-06 Thread Gerard Maas
You need to take into consideration 'where' things are executing. The closure of the 'forEachRDD' executes in the driver. Therefore, the log statements printed during the execution of that part will be found in the driver logs. In contrast, the foreachPartition closure executes on the worker

Re: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread Gerard Maas
(removing dev from the to: as not relevant) it would be good to see some sample data and the cassandra schema to have a more concrete idea of the problem space. Some thoughts: reduceByKey could still be used to 'pick' one element. example of arbitrarily choosing the first one: reduceByKey{case

Re: Spark Streaming

2015-07-29 Thread Gerard Maas
A side question: Any reason why you're using window(Seconds(10), Seconds(10)) instead of new StreamingContext(conf, Seconds(10)) ? Making the micro-batch interval 10 seconds instead of 1 will provide you the same 10-second window with less complexity. Of course, this might just be a test for the

Re:

2015-07-07 Thread Gerard Maas
Anand, AFAIK, you will need to change two settings: spark.streaming.unpersist = false // in order for SStreaming to not drop the raw RDD data spark.cleaner.ttl = some reasonable value in seconds Also be aware that the lineage of your union RDD will grow with each batch interval. You will need

Re:

2015-07-07 Thread Gerard Maas
to the Batch RDD and then using dstream.window() equal to that frequency Can you also elaborate why you consider the dstream.window approach more “reliable” *From:* Gerard Maas [mailto:gerard.m...@gmail.com] *Sent:* Tuesday, July 7, 2015 12:56 PM *To:* Anand Nalya *Cc:* spark users

Re: Time is ugly in Spark Streaming....

2015-06-26 Thread Gerard Maas
Are you sharing the SimpleDateFormat instance? This looks a lot more like the non-thread-safe behaviour of SimpleDateFormat (that has claimed many unsuspecting victims over the years), than any 'ugly' Spark Streaming. Try writing the timestamps in millis to Kafka and compare. -kr, Gerard. On

Re: Spark Streaming reads from stdin or output from command line utility

2015-06-12 Thread Gerard Maas
Would using the socketTextStream and `yourApp | nc -lk port` work?? Not sure how resilient the socket receiver is though. I've been playing with it for a little demo and I don't understand yet its reconnection behavior. Although I would think that putting some elastic buffer in between would be a

Re: Cassandra Submit

2015-06-08 Thread Gerard Maas
? = ip address of your cassandra host On Mon, Jun 8, 2015 at 10:12 AM, Yasemin Kaya godo...@gmail.com wrote: Hi , How can I find spark.cassandra.connection.host? And what should I change ? Should I change cassandra.yaml ? Error says me *Exception in thread main java.io.IOException:

Re: [Streaming] Configure executor logging on Mesos

2015-05-29 Thread Gerard Maas
wrote: -- Forwarded message -- From: Tim Chen t...@mesosphere.io Date: Thu, May 28, 2015 at 10:49 AM Subject: Re: [Streaming] Configure executor logging on Mesos To: Gerard Maas gerard.m...@gmail.com Hi Gerard, The log line you referred to is not Spark logging but Mesos own

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Gerard Maas
Hi, tl;dr At the moment (with a BIG disclaimer *) elastic scaling of spark streaming processes is not supported. *Longer version.* I assume that you are talking about Spark Streaming as the discussion is about handing Kafka streaming data. Then you have two things to consider: the Streaming

[Streaming] Configure executor logging on Mesos

2015-05-28 Thread Gerard Maas
Hi, I'm trying to control the verbosity of the logs on the Mesos executors with no luck so far. The default behaviour is INFO on stderr dump with an unbounded growth that gets too big at some point. I noticed that when the executor is instantiated, it locates a default log configuration in the

Re: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Gerard Maas
is the phrase “error during Transport Initialization” – so all these stuff points out in the direction of Infrastructure or Configuration issues – check you Casandra service and how you connect to it etc mate *From:* Gerard Maas [mailto:gerard.m...@gmail.com] *Sent:* Sunday, May 10, 2015 11:33

Re: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Gerard Maas
) at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) ... 3 more 15/05/10 12:20:08 ERROR ControlConnection: [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds Thanks! 2015-05-10 0:58 GMT+02:00 Gerard Maas gerard.m...@gmail.com: Hola

Re: Map one RDD into two RDD

2015-05-07 Thread Gerard Maas
Hi Bill, I just found weird that one would use parallel threads to 'filter', as filter is lazy in Spark, and multithreading wouldn't have any effect unless the action triggering the execution of the lineage containing such filter is executed on a separate thread. One must have very specific

Re: Map one RDD into two RDD

2015-05-07 Thread Gerard Maas
Hi Bill, Could you show a snippet of code to illustrate your choice? -Gerard. On Thu, May 7, 2015 at 5:55 PM, Bill Q bill.q@gmail.com wrote: Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it seems to be

Re: How to deal with code that runs before foreach block in Apache Spark?

2015-05-04 Thread Gerard Maas
I'm not familiar with the Solr API but provided that ' SolrIndexerDriver' is a singleton, I guess that what's going on when running on a cluster is that the call to: SolrIndexerDriver.solrInputDocumentList.add(elem) is happening on different singleton instances of the SolrIndexerDriver on

Re: How to do dispatching in Streaming?

2015-04-16 Thread Gerard Maas
From experience, I'd recommend using the dstream.foreachRDD method and doing the filtering within that context. Extending the example of TD, something like this: dstream.foreachRDD { rdd = rdd.cache() messageType.foreach (msgTyp = val selection = rdd.filter(msgTyp.match(_))

Re: Writing Spark Streaming Programs

2015-03-19 Thread Gerard Maas
Try writing this Spark Streaming idiom in Java and you'll choose Scala soon enough: dstream.foreachRDD{rdd = rdd.foreachPartition( partition = ) } When deciding between Java and Scala for Spark, IMHO Scala has the upperhand. If you're concerned with readability, have a look at the Scala

Re: Partitioning

2015-03-13 Thread Gerard Maas
In spark-streaming, the consumers will fetch data and put it into 'blocks'. Each block becomes a partition of the rdd generated during that batch interval. The size of each is block controlled by the conf: 'spark.streaming.blockInterval'. That is, the amount of data the consumer can collect in

Re: Unable to saveToCassandra while cassandraTable works fine

2015-03-12 Thread Gerard Maas
This: java.lang.NoSuchMethodError almost always indicates a version conflict somewhere. It looks like you are using Spark 1.1.1 with the cassandra-spark connector 1.2.0. Try aligning those. Those metrics were introduced recently in the 1.2.0 branch of the cassandra connector. Either upgrade your

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-17 Thread Gerard Maas
+1 for TypeSafe config Our practice is to include all spark properties under a 'spark' entry in the config file alongside job-specific configuration: A config file would look like: spark { master = cleaner.ttl = 123456 ... } job { context { src = foo action =

Re: Writing RDD to a csv file

2015-02-03 Thread Gerard Maas
this is more of a scala question, so probably next time you'd like to address a Scala forum eg. http://stackoverflow.com/questions/tagged/scala val optArrStr:Option[Array[String]] = ??? optArrStr.map(arr = arr.mkString(,)).getOrElse() // empty string or whatever default value you have for this.

Re: Spark (Streaming?) holding on to Mesos resources

2015-01-29 Thread Gerard Maas
://issues.apache.org/jira/browse/MESOS-1688 On Mon, Jan 26, 2015 at 9:17 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs

Re: how to split key from RDD for compute UV

2015-01-27 Thread Gerard Maas
Hi, Did you try asking this on StackOverflow? http://stackoverflow.com/questions/tagged/apache-spark I'd also suggest adding some sample data to help others understanding your logic. -kr, Gerard. On Tue, Jan 27, 2015 at 1:14 PM, 老赵 laozh...@sina.cn wrote: Hello All, I am writing a simple

Spark (Streaming?) holding on to Mesos resources

2015-01-26 Thread Gerard Maas
Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs running on the Mesos cluster. For example: This is a job that has spark.cores.max = 4 and spark.executor.memory=3g

Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Gerard Maas
(looks like the list didn't like a HTML table on the previous email. My excuses for any duplicates) Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources and not releasing them, resulting in resource starvation to all jobs running on the

Re: Spark (Streaming?) holding on to Mesos Resources

2015-01-26 Thread Gerard Maas
janv. 2015 22:28, Gerard Maas gerard.m...@gmail.com a écrit : (looks like the list didn't like a HTML table on the previous email. My excuses for any duplicates) Hi, We are observing with certain regularity that our Spark jobs, as Mesos framework, are hoarding resources

Re: Discourse: A proposed alternative to the Spark User list

2015-01-23 Thread Gerard Maas
+1 On Fri, Jan 23, 2015 at 5:58 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That sounds good to me. Shall I open a JIRA / PR about updating the site community page? On 2015년 1월 23일 (금) at 오전 4:37 Patrick Wendell patr...@databricks.com wrote: Hey Nick, So I think we what can

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
and post the code (if possible). In a nutshell, your processing time batch interval, resulting in an ever-increasing delay that will end up in a crash. 3 secs to process 14 messages looks like a lot. Curious what the job logic is. -kr, Gerard. On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Gerard Maas
I've have been contributing to SO for a while now. Here're few observations I'd like to contribute to the discussion: The level of questions on SO is often of more entry-level. Harder questions (that require expertise in a certain area) remain unanswered for a while. Same questions here on the

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
://www.virdata.com/tuning-spark/#Partitions) -kr, Gerard. On Thu, Jan 22, 2015 at 4:28 PM, Gerard Maas gerard.m...@gmail.com wrote: So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761 seconds. I think there's evidence that setup costs are quite high in this case

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761 seconds. I think there's evidence that setup costs are quite high in this case and increasing the batch interval is helping. On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Ashic

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Gerard Maas
line, that is to fix the number of streams and change the input and output channels dynamically. But could not make it work (seems that the receiver is not allowing any change in the config after it started). thanks, On Wed, Jan 21, 2015 at 10:49 AM, Gerard Maas gerard.m...@gmail.com wrote

Re: dynamically change receiver for a spark stream

2015-01-21 Thread Gerard Maas
One possible workaround could be to orchestrate launch/stopping of Streaming jobs on demand as long as the number of jobs/streams stay within the boundaries of the resources (cores) you've available. e.g. if you're using Mesos, Marathon offers a REST interface to manage job lifecycle. You will

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2015-01-21 Thread Gerard Maas
Hi Mukesh, How are you creating your receivers? Could you post the (relevant) code? -kr, Gerard. On Wed, Jan 21, 2015 at 9:42 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hello Guys, I've re partitioned my kafkaStream so that it gets evenly distributed among the executors and the results

Re: Registering custom metrics

2015-01-08 Thread Gerard Maas
) bytes } .saveAsTextFile(text) Is there a way to achieve this with the MetricSystem? ᐧ On Mon, Jan 5, 2015 at 10:24 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi, Yes, I managed to create a register custom metrics by creating an implementation

Re: Join RDDs with DStreams

2015-01-08 Thread Gerard Maas
You are looking for dstream.transform(rdd = rdd.op(otherRdd)) The docs contain an example on how to use transform. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams -kr, Gerard. On Thu, Jan 8, 2015 at 5:50 PM, Asim Jalis asimja...@gmail.com

Re: KafkaUtils not consuming all the data from all partitions

2015-01-07 Thread Gerard Maas
Hi, Could you add the code where you create the Kafka consumer? -kr, Gerard. On Wed, Jan 7, 2015 at 3:43 PM, francois.garil...@typesafe.com wrote: Hi Mukesh, If my understanding is correct, each Stream only has a single Receiver. So, if you have each receiver consuming 9 partitions, you

Re: KafkaUtils not consuming all the data from all partitions

2015-01-07 Thread Gerard Maas
), kafkaStreams); On Wed, Jan 7, 2015 at 8:21 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, Could you add the code where you create the Kafka consumer? -kr, Gerard. On Wed, Jan 7, 2015 at 3:43 PM, francois.garil...@typesafe.com wrote: Hi Mukesh, If my understanding is correct, each

Re: RDD for Storm Streaming in Spark

2014-12-23 Thread Gerard Maas
Hi, I'm not sure what you are asking: Whether we can use spouts and bolts in Spark (= no) or whether we can do streaming in Spark: http://spark.apache.org/docs/latest/streaming-programming-guide.html -kr, Gerard. On Tue, Dec 23, 2014 at 9:03 AM, Ajay ajay.ga...@gmail.com wrote: Hi, Can

Re: RDD for Storm Streaming in Spark

2014-12-23 Thread Gerard Maas
Streaming). The idea is to use Spark as a in-memory computation engine and static data coming from Cassandra/Hbase and streaming data from Storm. Thanks Ajay On Tue, Dec 23, 2014 at 2:03 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, I'm not sure what you are asking: Whether we

Tuning Spark Streaming jobs

2014-12-22 Thread Gerard Maas
Hi, After facing issues with the performance of some of our Spark Streaming jobs, we invested quite some effort figuring out the factors that affect the performance characteristics of a Streaming job. We defined an empirical model that helps us reason about Streaming jobs and applied it to tune

Re: Tuning Spark Streaming jobs

2014-12-22 Thread Gerard Maas
mode? I'm making changes to the spark mesos scheduler and I think we can propose a best way to achieve what you mentioned. Tim Sent from my iPhone On Dec 22, 2014, at 8:33 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi, After facing issues with the performance of some of our Spark

Re: Does Spark 1.2.0 support Scala 2.11?

2014-12-19 Thread Gerard Maas
Check out the 'compiling for Scala 2.11' instructions: http://spark.apache.org/docs/1.2.0/building-spark.html#building-for-scala-211 -kr, Gerard. On Fri, Dec 19, 2014 at 12:00 PM, Jonathan Chayat jonatha...@supersonic.com wrote: The following ticket:

Re: Scala Lazy values and partitions

2014-12-19 Thread Gerard Maas
It will be instantiated once per VM, which translates to once per executor. -kr, Gerard. On Fri, Dec 19, 2014 at 12:21 PM, Ashic Mahtab as...@live.com wrote: Hi Guys, Are scala lazy values instantiated once per executor, or once per partition? For example, if I have: object Something =

Re: spark streaming kafa best practices ?

2014-12-17 Thread Gerard Maas
Patrick, I was wondering why one would choose for rdd.map vs rdd.foreach to execute a side-effecting function on an RDD. -kr, Gerard. On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell pwend...@gmail.com wrote: The second choice is better. Once you call collect() you are pulling all of the

Re: Get the value of DStream[(String, Iterable[String])]

2014-12-17 Thread Gerard Maas
You can create a DStream that contains the count, transforming the grouped windowed RDD, like this: val errorCount = grouping.map{case (k,v) = v.size } If you need to preserve the key: val errorCount = grouping.map{case (k,v) = (k,v.size) } or you if you don't care about the content of the

Re: Data Loss - Spark streaming

2014-12-16 Thread Gerard Maas
Hi Jeniba, The second part of this meetup recording has a very good answer to your question. TD explains the current behavior and the on-going work in Spark Streaming to fix HA. https://www.youtube.com/watch?v=jcJq3ZalXD8 -kr, Gerard. On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson

  1   2   >