Re: Communication between Driver and Executors

2014-11-17 Thread Tobias Pfeiffer
Hi, so I didn't manage to get the Broadcast variable with a new value distributed to my executors in YARN mode. In local mode it worked fine, but when running on YARN either nothing happened (when unpersist() was called on the driver) or I got a TimeoutException (when called on the executor). I

Re: Is there setup and cleanup function in spark?

2014-11-17 Thread Tobias Pfeiffer
Hi, On Fri, Nov 14, 2014 at 2:49 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, then we need another trick. let's have an *implicit lazy var connection/context* around our code. And setup() will trigger the eval and initialization. Due to lazy evaluation, I think having

Re: Communication between Driver and Executors

2014-11-16 Thread Tobias Pfeiffer
Hi, On Fri, Nov 14, 2014 at 3:20 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: I wonder if SparkConf is dynamically updated on all worker nodes or only during initialization. It can be used to piggyback information. Otherwise I guess you are stuck with Broadcast. Primarily I have had

Re: Communication between Driver and Executors

2014-11-16 Thread Tobias Pfeiffer
Hi again, On Mon, Nov 17, 2014 at 8:16 AM, Tobias Pfeiffer t...@preferred.jp wrote: I have been trying to mis-use broadcast as in - create a class with a boolean var, set to true - query this boolean on the executors as a prerequisite to process the next item - when I want to shutdown, I

StreamingContext does not stop

2014-11-13 Thread Tobias Pfeiffer
Hi, I am processing a bunch of HDFS data using the StreamingContext (Spark 1.1.0) which means that all files that exist in the directory at start() time are processed in the first batch. Now when I try to stop this stream processing using `streamingContext.stop(false, false)` (that is, even with

Re: StreamingContext does not stop

2014-11-13 Thread Tobias Pfeiffer
Hi, I guess I found part of the issue: I said dstream.transform(rdd = { rdd.foreachPartition(...); rdd }) instead of dstream.transform(rdd = { rdd.mapPartitions(...) }), that's why stop() would not stop the processing. Now with the new version a non-graceful shutdown works in the sense that

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Tobias Pfeiffer
Bill, However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the code. Do you see any suspicious messages in the log output? Tobias

Re: Strange behavior of spark-shell while accessing hdfs

2014-11-11 Thread Tobias Pfeiffer
Hi, On Tue, Nov 11, 2014 at 2:04 PM, hmxxyy hmx...@gmail.com wrote: If I run bin/spark-shell without connecting a master, it can access a hdfs file on a remote cluster with kerberos authentication. [...] However, if I start the master and slave on the same host and using bin/spark-shell

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Tobias Pfeiffer
Hi, also there is Spindle https://github.com/adobe-research/spindle which was introduced on this list some time ago. I haven't looked into it deeply, but you might gain some valuable insights from their architecture, they are also using Spark to fulfill requests coming from the web. Tobias

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Tobias Pfeiffer
Hi, On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote: But getLang() is one of the methods of twitter4j.Status since version 3.0.6 according to the doc at: http://twitter4j.org/javadoc/twitter4j/Status.html#getLang-- What version of twitter4j does Spark Streaming use?

Re: convert ListString to dstream

2014-11-10 Thread Tobias Pfeiffer
Josh, On Tue, Nov 11, 2014 at 7:43 AM, Josh J joshjd...@gmail.com wrote: I have some data generated by some utilities that returns the results as a ListString. I would like to join this with a Dstream of strings. How can I do this? I tried the following though get scala compiler errors val

Re: Mapping SchemaRDD/Row to JSON

2014-11-10 Thread Tobias Pfeiffer
Akshat On Tue, Nov 11, 2014 at 4:12 AM, Akshat Aranya aara...@gmail.com wrote: Does there exist a way to serialize Row objects to JSON. I can't think of any other way than the one you proposed. A Row is more or less an Array[Object], so you need to read JSON key and data type from the

Re: netty on classpath when using spark-submit

2014-11-09 Thread Tobias Pfeiffer
Hi, On Wed, Nov 5, 2014 at 10:23 AM, Tobias Pfeiffer wrote: On Tue, Nov 4, 2014 at 8:33 PM, M. Dale wrote: From http://spark.apache.org/docs/latest/configuration.html it seems that there is an experimental property: spark.files.userClassPathFirst Thank you very much, I didn't know

Re: netty on classpath when using spark-submit

2014-11-04 Thread Tobias Pfeiffer
Markus, thanks for your help! On Tue, Nov 4, 2014 at 8:33 PM, M. Dale medal...@yahoo.com.invalid wrote: Tobias, From http://spark.apache.org/docs/latest/configuration.html it seems that there is an experimental property: spark.files.userClassPathFirst Thank you very much, I didn't

Re: hadoop_conf_dir when running spark on yarn

2014-11-03 Thread Tobias Pfeiffer
Hi, On Mon, Nov 3, 2014 at 1:29 PM, Amey Chaugule ambr...@gmail.com wrote: I thought that only applied when you're trying to run a job using spark-submit or in the shell... And how are you starting your Yarn job, if not via spark-submit? Tobias

Re: different behaviour of the same code

2014-11-03 Thread Tobias Pfeiffer
Hi, On Fri, Oct 31, 2014 at 4:31 PM, lieyan lie...@yahoo.com wrote: The code are here: LogReg.scala http://apache-spark-user-list.1001560.n3.nabble.com/file/n17803/LogReg.scala Then I click the Run button of the IDEA, and I get the following error message errlog.txt

netty on classpath when using spark-submit

2014-11-03 Thread Tobias Pfeiffer
Hi, I tried hard to get a version of netty into my jar file created with sbt assembly that works with all my libraries. Now I managed that and was really happy, but it seems like spark-submit puts an older version of netty on the classpath when submitting to a cluster, such that my code ends up

Re: NonSerializable Exception in foreachRDD

2014-10-30 Thread Tobias Pfeiffer
Harold, just mentioning it in case you run into it: If you are in a separate thread, there are apparently stricter limits to what you can and cannot serialize: val someVal future { // be very careful with defining RDD operations using someVal here val myLocalVal = someVal // use myLocalVal

spark-submit results in NoClassDefFoundError

2014-10-29 Thread Tobias Pfeiffer
Hi, I am trying to get my Spark application to run on YARN and by now I have managed to build a fat jar as described on http://markmail.org/message/c6no2nyaqjdujnkq (which is the only really usable manual on how to get such a jar file). My code runs fine using sbt test and sbt run, but when

Re: spark-submit results in NoClassDefFoundError

2014-10-29 Thread Tobias Pfeiffer
Hi again, On Thu, Oct 30, 2014 at 11:50 AM, Tobias Pfeiffer t...@preferred.jp wrote: Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: com/typesafe/scalalogging/slf4j/Logger It turned out scalalogging

Re: Processing order in Spark

2014-10-13 Thread Tobias Pfeiffer
Sean, thanks, I didn't know about repartitionAndSortWithinPartitions, that seems very helpful! Tobias

Combined HDFS/Kafka Processing

2014-10-09 Thread Tobias Pfeiffer
Hi, I have a setting where data arrives in Kafka and is stored to HDFS from there (maybe using Camus or Flume). I want to write a Spark Streaming app where - first all files in a that HDFS directory are processed, - and then the stream from Kafka is processed, starting with the first item

Re: Record-at-a-time model for Spark Streaming

2014-10-07 Thread Tobias Pfeiffer
Jianneng, On Wed, Oct 8, 2014 at 8:44 AM, Jianneng Li jiannen...@berkeley.edu wrote: I understand that Spark Streaming uses micro-batches to implement streaming, while traditional streaming systems use the record-at-a-time processing model. The performance benefit of the former is throughput,

Re: dynamic sliding window duration

2014-10-07 Thread Tobias Pfeiffer
Hi, On Wed, Oct 8, 2014 at 4:50 AM, Josh J joshjd...@gmail.com wrote: I have a source which fluctuates in the frequency of streaming tuples. I would like to process certain batch counts, rather than batch window durations. Is it possible to either 1) define batch window sizes Cf.

Re: Using GraphX with Spark Streaming?

2014-10-05 Thread Tobias Pfeiffer
Arko, On Sat, Oct 4, 2014 at 1:40 AM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Apologies if this is a stupid question but I am trying to understand why this can or cannot be done. As far as I understand that streaming algorithms need to be different from batch algorithms as

All-time stream re-processing

2014-09-24 Thread Tobias Pfeiffer
Hi, I have a setup (in mind) where data is written to Kafka and this data is persisted in HDFS (e.g., using camus) so that I have an all-time archive of all stream data ever received. Now I want to process that all-time archive and when I am done with that, continue with the live stream, using

Re: All-time stream re-processing

2014-09-24 Thread Tobias Pfeiffer
Hi, On Wed, Sep 24, 2014 at 7:23 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: So you have a single Kafka topic which has very high retention period ( that decides the storage capacity of a given Kafka topic) and you want to process all historical data first using Camus and

Re: rsync problem

2014-09-24 Thread Tobias Pfeiffer
. But, there is no job execution taking place. After sometime, one by one, each goes down and the cluster shuts down. On Fri, Sep 19, 2014 at 2:15 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri, Sep 19, 2014 at 5:17 PM, rapelly kartheek kartheek.m...@gmail.com wrote: , * you have copied

Re: how long does it take executing ./sbt/sbt assembly

2014-09-23 Thread Tobias Pfeiffer
Hi, http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Kafka-building-project-with-sbt-assembly-is-extremely-slow-td13152.html -- Maybe related to this? Tobias

Re: rsync problem

2014-09-19 Thread Tobias Pfeiffer
Hi, On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek kartheek.m...@gmail.com wrote: This worked perfectly. But, I wanted to simultaneously rsync all the slaves. So, added the other slaves as following: rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname

spark-submit: fire-and-forget mode?

2014-09-18 Thread Tobias Pfeiffer
Hi, I am wondering: Is it possible to run spark-submit in a mode where it will start an application on a YARN cluster (i.e., driver and executors run on the cluster) and then forget about it in the sense that the Spark application is completely independent from the host that ran the spark-submit

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Tobias Pfeiffer
Hi, thanks for everyone's replies! On Thu, Sep 18, 2014 at 7:37 AM, Sandy Ryza sandy.r...@cloudera.com wrote: YARN cluster mode should have the behavior you're looking for. The client process will stick around to report on things, but should be able to be killed without affecting the

Re: diamond dependency tree

2014-09-18 Thread Tobias Pfeiffer
Hi, On Thu, Sep 18, 2014 at 8:55 PM, Victor Tso-Guillen v...@paxata.com wrote: Is it possible to express a diamond DAG and have the leaf dependency evaluate only once? Well, strictly speaking your graph is not a tree, and also the meaning of leaf is not totally clear, I'd say. So say data

Re: Multi-tenancy for Spark (Streaming) Applications

2014-09-11 Thread Tobias Pfeiffer
Hi, by now I understood maybe a bit better how spark-submit and YARN play together and how Spark driver and slaves play together on YARN. Now for my usecase, as described on https://spark.apache.org/docs/latest/submitting-applications.html, I would probably have a end-user-facing gateway that

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Tobias Pfeiffer
Hi, On Fri, Sep 12, 2014 at 9:12 AM, Patrick Wendell pwend...@gmail.com wrote: I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! Great,

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-08 Thread Tobias Pfeiffer
Hi, On Mon, Sep 8, 2014 at 4:39 PM, Sean Owen so...@cloudera.com wrote: if (rdd.take (1).size == 1) { rdd foreachPartition { iterator = I was wondering: Since take() is an output operation, isn't it computed twice (once for the take(1), once during the

Re: Multi-tenancy for Spark (Streaming) Applications

2014-09-08 Thread Tobias Pfeiffer
Hi, On Thu, Sep 4, 2014 at 10:33 AM, Tathagata Das tathagata.das1...@gmail.com wrote: In the current state of Spark Streaming, creating separate Java processes each having a streaming context is probably the best approach to dynamically adding and removing of input sources. All of these

Re: Spark streaming for synchronous API

2014-09-08 Thread Tobias Pfeiffer
Ron, On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid wrote: I’m trying to figure out how I can run Spark Streaming like an API. The goal is to have a synchronous REST API that runs the spark data flow on YARN. I guess I *may* develop something similar in the

Re: Spark streaming for synchronous API

2014-09-08 Thread Tobias Pfeiffer
Hi, On Tue, Sep 9, 2014 at 12:59 PM, Ron's Yahoo! zlgonza...@yahoo.com wrote: I want to create a synchronous REST API that will process some data that is passed in as some request. I would imagine that the Spark Streaming Job on YARN is a long running job that waits on requests from

Re: Spark streaming for synchronous API

2014-09-08 Thread Tobias Pfeiffer
Hi, On Tue, Sep 9, 2014 at 2:02 PM, Ron's Yahoo! zlgonza...@yahoo.com wrote: So I guess where I was coming from was the assumption that starting up a new job to be listening on a particular queue topic could be done asynchronously. No, with the current state of Spark Streaming, all data

Re: Recursion

2014-09-07 Thread Tobias Pfeiffer
Hi, On Fri, Sep 5, 2014 at 6:16 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Does Spark support recursive calls? Can you give an example of which kind of recursion you would like to use? Tobias

Re: How to list all registered tables in a sql context?

2014-09-07 Thread Tobias Pfeiffer
Hi, On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Err... there's no such feature? The problem is that the SQLContext's `catalog` member is protected, so you can't access it from outside. If you subclass SQLContext, and make sure that `catalog` is always a

Re: advice on spark input development - python or scala?

2014-09-04 Thread Tobias Pfeiffer
Hi, On Thu, Sep 4, 2014 at 11:49 PM, Johnny Kelsey jkkel...@semblent.com wrote: As a concrete example, we have a python class (part of a fairly large class library) which, as part of its constructor, also creates a record of itself in the cassandra key space. So we get an initialised class a

Re: RDDs

2014-09-03 Thread Tobias Pfeiffer
Hello, On Wed, Sep 3, 2014 at 6:02 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Can someone tell me what kind of operations can be performed on a replicated rdd?? What are the use-cases of a replicated rdd. I suggest you read

Multi-tenancy for Spark (Streaming) Applications

2014-09-03 Thread Tobias Pfeiffer
Hi, I am not sure if multi-tenancy is the right word, but I am thinking about a Spark application where multiple users can, say, log into some web interface and specify a data processing pipeline with streaming source, processing steps, and output. Now as far as I know, there can be only one

Re: Spark Streaming - how to implement multiple calculation using the same data set

2014-09-02 Thread Tobias Pfeiffer
Hi, On Wed, Sep 3, 2014 at 6:54 AM, salemi alireza.sal...@udo.edu wrote: I was able to calculate the individual measures separately and know I have to merge them and spark streaming doesn't support outer join yet. Can't you assign some dummy key (e.g., index) before your processing and then

Re: Trying to run SparkSQL over Spark Streaming

2014-08-25 Thread Tobias Pfeiffer
Hi, On Mon, Aug 25, 2014 at 7:11 PM, praveshjain1991 praveshjain1...@gmail.com wrote: If you want to issue an SQL statement on streaming data, you must have both the registerAsTable() and the sql() call *within* the foreachRDD(...) block, or -- as you experienced -- the table name will be

Re: Trying to run SparkSQL over Spark Streaming

2014-08-25 Thread Tobias Pfeiffer
Hi again, On Tue, Aug 26, 2014 at 10:13 AM, Tobias Pfeiffer t...@preferred.jp wrote: On Mon, Aug 25, 2014 at 7:11 PM, praveshjain1991 praveshjain1...@gmail.com wrote: If you want to issue an SQL statement on streaming data, you must have both the registerAsTable() and the sql() call

Re: Spark Stream + HDFS Append

2014-08-24 Thread Tobias Pfeiffer
Hi, On Mon, Aug 25, 2014 at 9:56 AM, Dean Chen deanch...@gmail.com wrote: We are using HDFS for log storage where logs are flushed to HDFS every minute, with a new file created for each hour. We would like to consume these logs using spark streaming. The docs state that new HDFS will be

Re: multiple windows from the same DStream ?

2014-08-24 Thread Tobias Pfeiffer
Hi, computations are triggered by an output operation. No output operation, no computation. Therefore in your code example, On Thu, Aug 21, 2014 at 11:58 PM, Josh J joshjd...@gmail.com wrote: JavaPairReceiverInputDStreamString, String messages =

Re: Trying to run SparkSQL over Spark Streaming

2014-08-21 Thread Tobias Pfeiffer
Hi, On Thu, Aug 21, 2014 at 3:11 PM, praveshjain1991 praveshjain1...@gmail.com wrote: The part that you mentioned */the variable `result ` is of type DStream[Row]. That is, the meta-information from the SchemaRDD is lost and, from what I understand, there is then no way to learn about the

Re: Trying to run SparkSQL over Spark Streaming

2014-08-20 Thread Tobias Pfeiffer
Hi, On Thu, Aug 21, 2014 at 2:19 PM, praveshjain1991 praveshjain1...@gmail.com wrote: Using Spark SQL with batch data works fine so I'm thinking it has to do with how I'm calling streamingcontext.start(). Any ideas what is the issue? Here is the code: Please have a look at

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Tobias Pfeiffer
Hi, On Tue, Aug 19, 2014 at 7:01 PM, Patrick McGloin mcgloin.patr...@gmail.com wrote: I think the type of the data contained in your RDD needs to be a known case class and not abstract for createSchemaRDD. This makes sense when you think it needs to know about the fields in the object to

Re: Data loss - Spark streaming and network receiver

2014-08-18 Thread Tobias Pfeiffer
Hi Wei, On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu wei@stellarloyalty.com wrote: Since our application cannot tolerate losing customer data, I am wondering what is the best way for us to address this issue. 1) We are thinking writing application specific logic to address the data loss. To

Re: [Spar Streaming] How can we use consecutive data points as the features ?

2014-08-17 Thread Tobias Pfeiffer
Hi, On Sat, Aug 16, 2014 at 3:29 AM, Yan Fang yanfang...@gmail.com wrote: If all consecutive data points are in one batch, it's not complicated except that the order of data points is not guaranteed in the batch and so I have to use the timestamp in the data point to reach my goal. However,

Re: spark streaming : what is the best way to make a driver highly available

2014-08-13 Thread Tobias Pfeiffer
Hi, On Thu, Aug 14, 2014 at 5:49 AM, salemi alireza.sal...@udo.edu wrote: what is the best way to make a spark streaming driver highly available. I would also be interested in that. In particular for Streaming applications where the Spark driver is running for a long time, this might be

Fwd: Task closures and synchronization

2014-08-12 Thread Tobias Pfeiffer
Uh, for some reason I don't seem to automatically reply to the list any more. Here is again my message to Tom. -- Forwarded message -- Tom, On Wed, Aug 13, 2014 at 5:35 AM, Tom Vacek minnesota...@gmail.com wrote: This is a back-to-basics question. How do we know when Spark

Re: Spark Streaming example on your mesos cluster

2014-08-12 Thread Tobias Pfeiffer
Hi, On Wed, Aug 13, 2014 at 4:24 AM, Zia Syed xia.s...@gmail.com wrote: I dont particularly see any errors on my logs, either on console, or on slaves. I see slave downloads the spark-1.0.2-bin-hadoop1.tgz file and unpacks them as well. Mesos Master shows quiet alot of Tasks created and

Re: [spark-streaming] kafka source and flow control

2014-08-11 Thread Tobias Pfeiffer
Hi, On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers gwenhael.pasqui...@ericsson.com wrote: We intend to apply other operations on the data later in the same spark context, but our first step is to archive it. Our goal is somth like this Step 1 : consume kafka Step 2 : archive to

Fwd: Trying to make sense of the actual executed code

2014-08-06 Thread Tobias Pfeiffer
(Forgot to include the mailing list in my reply. Here it is.) Hi, On Thu, Aug 7, 2014 at 7:55 AM, Tom thubregt...@gmail.com wrote: When I look at the output, I see that there are several stages, and several tasks per stage. The tasks have a TID, I do not see such a thing for a stage. They

Re: How true is this about spark streaming?

2014-07-29 Thread Tobias Pfeiffer
Hi, that quoted statement doesn't make too much sense for me, either. Maybe if you had a link for us that shows the context (Google doesn't reveal anything but this conversation), we could evaluate that statement better. Tobias On Tue, Jul 29, 2014 at 5:53 PM, Sean Owen so...@cloudera.com

Re: Spark as a application library vs infra

2014-07-27 Thread Tobias Pfeiffer
Mayur, I don't know if I exactly understand the context of what you are asking, but let me just mention issues I had with deploying. * As my application is a streaming application, it doesn't read any files from disk, so therefore I have no Hadoop/HDFS in place and I there is no need for it,

Re: Get Spark Streaming timestamp

2014-07-23 Thread Tobias Pfeiffer
Bill, Spark Streaming's DStream provides overloaded methods for transform() and foreachRDD() that allow you to access the timestamp of a batch: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.DStream I think the timestamp is the end of the batch, not

Re: How to do an interactive Spark SQL

2014-07-22 Thread Tobias Pfeiffer
Hi, as far as I know, after the Streaming Context has started, the processing pipeline (e.g., filter.map.join.filter) cannot be changed. As your SQL statement is transformed into RDD operations when the Streaming Context starts, I think there is no way to change the statement that is executed on

Re: spark streaming rate limiting from kafka

2014-07-21 Thread Tobias Pfeiffer
On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, are you saying, after repartition(400), you have 400 partitions on one host and the other hosts receive nothing of the data? Tobias On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com wrote

Re: Distribute data from Kafka evenly on cluster

2014-07-18 Thread Tobias Pfeiffer
that? On Fri, Jul 4, 2014 at 11:11 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, unfortunately, when I go the above approach, I run into this problem: http://mail-archives.apache.org/mod_mbox/kafka-users/201401.mbox/%3ccabtfevyxvtaqvnmvwmh7yscfgxpw5kmrnw_gnq72cy4oa1b...@mail.gmail.com%3E

Re: Include permalinks in mail footer

2014-07-17 Thread Tobias Pfeiffer
On Jul 17, 2014, at 12:59 PM, Nick Chammas nicholas.cham...@gmail.com wrote: I often find myself wanting to reference one thread from another, or from a JIRA issue. Right now I have to google the thread subject and find the link that way. +1

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tobias Pfeiffer
1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote: * Is there a way to control how far Kafka Dstream can read on topic-partition (via offset for example). By setting this to a small number

Re: can't print DStream after reduce

2014-07-15 Thread Tobias Pfeiffer
Hi, thanks for creating the issue. It feels like in the last week, more or less half of the questions about Spark Streaming rooted in setting the master to local ;-) Tobias On Wed, Jul 16, 2014 at 11:03 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Aah, right, copied from the wrong

Re: Announcing Spark 1.0.1

2014-07-14 Thread Tobias Pfeiffer
Hi, congratulations on the release! I'm always pleased to see how features pop up in new Spark versions that I had added for myself in a very hackish way before (such as JSON support for Spark SQL). I am wondering if there is any good way to learn early about what is going to be in upcoming

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-13 Thread Tobias Pfeiffer
Hi, I experienced exactly the same problems when using SparkContext with local[1] master specification, because in that case one thread is used for receiving data, the others for processing. As there is only one thread running, no processing will take place. Once you shut down the connection, the

Re: Some question about SQL and streaming

2014-07-10 Thread Tobias Pfeiffer
Hi, I think it would be great if we could do the string parsing only once and then just apply the transformation for each interval (reducing the processing overhead for short intervals). Also, one issue with the approach above is that transform() has the following signature: def

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Tobias Pfeiffer
:26 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, do the additional 100 nodes receive any tasks at all? (I don't know which cluster you use, but with Mesos you could check client logs in the web interface.) You might want to try something like repartition(N) or repartition(N*2) (with N

Re: Some question about SQL and streaming

2014-07-09 Thread Tobias Pfeiffer
Siyuan, I do it like this: // get data from Kafka val ssc = new StreamingContext(...) val kvPairs = KafkaUtils.createStream(...) // we need to wrap the data in a case class for registerAsTable() to succeed val lines = kvPairs.map(_._2).map(s = StringWrapper(s)) val result = lines.transform((rdd,

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Tobias Pfeiffer
, 2014 at 5:51 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, I haven't worked with Yarn, but I would try adding a repartition() call after you receive your data from Kafka. I would be surprised if that didn't help. On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com

Re: Spark-streaming-kafka error

2014-07-08 Thread Tobias Pfeiffer
Bill, have you packaged org.apache.spark % spark-streaming-kafka_2.10 % 1.0.0 into your application jar? If I remember correctly, it's not bundled with the downloadable compiled version of Spark. Tobias On Wed, Jul 9, 2014 at 8:18 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I

Re: Use Spark Streaming to update result whenever data come

2014-07-08 Thread Tobias Pfeiffer
. Tobias On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, Thanks for the suggestion. I have tried to add more nodes from 300 to 400. It seems the running time did not get improved. On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-07 Thread Tobias Pfeiffer
Juan, I am doing something similar, just not insert into SQL database, but issue some RPC call. I think mapPartitions() may be helpful to you. You could do something like dstream.mapPartitions(iter = { val db = new DbConnection() // maybe only do the above if !iter.isEmpty iter.map(item =

Re: Distribute data from Kafka evenly on cluster

2014-07-04 Thread Tobias Pfeiffer
use the approach of multiple kafkaStreams, I don't get this error, but also work is never distributed in my cluster... Thanks Tobias On Thu, Jul 3, 2014 at 11:58 AM, Tobias Pfeiffer t...@preferred.jp wrote: Thank you very much for the link, that was very helpful! So, apparently the `topics

Re: Kafka - streaming from multiple topics

2014-07-03 Thread Tobias Pfeiffer
Sergey, On Fri, Jul 4, 2014 at 1:06 AM, Sergey Malov sma...@collective.com wrote: On the other hand, under the hood KafkaInputDStream which is create with this KafkaUtils call, calls ConsumerConnector.createMessageStream which returns a Map[String, List[KafkaStream] keyed by topic. It is,

Visualize task distribution in cluster

2014-07-02 Thread Tobias Pfeiffer
Hi, I am using Mesos to run my Spark tasks. I would be interested to see how Spark distributes the tasks in the cluster (nodes, partitions) and which nodes are more or less active and do what kind of tasks, and how long the transfer of data and jobs takes. Is there any way to get this information

Re: spark streaming rate limiting from kafka

2014-07-01 Thread Tobias Pfeiffer
Hi, On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote: * Is there a way to control how far Kafka Dstream can read on topic-partition (via offset for example). By setting this to a small number, it will force DStream to read less data initially. Please see the post at

Re: Could not compute split, block not found

2014-06-30 Thread Tobias Pfeiffer
the processing time and solve the problem of data piling up. Thanks! Bill On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer t...@preferred.jp wrote: ​​If your batch size is one minute and it takes more than one minute to process, then I guess that's what causes your problem

Distribute data from Kafka evenly on cluster

2014-06-27 Thread Tobias Pfeiffer
Hi, I have a number of questions using the Kafka receiver of Spark Streaming. Maybe someone has some more experience with that and can help me out. I have set up an environment for getting to know Spark, consisting of - a Mesos cluster with 3 only-slaves and 3 master-and-slaves, - 2 Kafka nodes,

Re: Changing log level of spark

2014-06-25 Thread Tobias Pfeiffer
I have a log4j.xml in src/main/resources with ?xml version=1.0 encoding=UTF-8 ? !DOCTYPE log4j:configuration SYSTEM log4j.dtd log4j:configuration xmlns:log4j=http://jakarta.apache.org/log4j/; [...] root priority value =warn / appender-ref ref=Console / /root

Re: Kafka client - specify offsets?

2014-06-24 Thread Tobias Pfeiffer
is different than what is documented, but then it's good for you (and me) because it allows to specify I want all that I can get or I want to start reading right now, even if there is an offset stored in Zookeeper. Tobias On Sun, Jun 15, 2014 at 11:27 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi

Re: Spark SQL: No function to evaluate expression

2014-06-17 Thread Tobias Pfeiffer
The error message *means* that there is no column called c_address. However, maybe it's a bug with Spark SQL not understanding the a.c_address syntax. Can you double-check the column name is correct? Thanks Tobias On Wed, Jun 18, 2014 at 5:02 AM, Zuhair Khayyat zuhair.khay...@gmail.com wrote:

Re: Kafka client - specify offsets?

2014-06-16 Thread Tobias Pfeiffer
Hi, there are apparently helpers to tell you the offsets https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example#id-0.8.0SimpleConsumerExample-FindingStartingOffsetforReads, but I have no idea how to pass that to the Kafka stream consumer. I am interested in that as well.

Re: Spark 1.0.0 Maven dependencies problems.

2014-06-09 Thread Tobias Pfeiffer
Hi, I remembered I saw this as well and found this ugly comment in my build.sbt file: On Mon, Jun 9, 2014 at 11:37 PM, Sean Owen so...@cloudera.com wrote: Looks like this crept in again from the shaded Akka dependency. I'll propose a PR to remove it. I believe that remains the way we have to

Re: Spark 1.0.0 Maven dependencies problems.

2014-06-09 Thread Tobias Pfeiffer
/javax.servlet/jars/javax.servlet-3.0.0.v201112011016.jar * file manually. */ Tobias On Tue, Jun 10, 2014 at 10:47 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, I remembered I saw this as well and found this ugly comment in my build.sbt file: On Mon, Jun 9, 2014 at 11:37 PM, Sean Owen so

Re: Spark Kafka streaming - ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaReceiver

2014-06-08 Thread Tobias Pfeiffer
Gaurav, I am not sure that the * expands to what you expect it to do. Normally the bash expands * to a space-separated string, not colon-separated. Try specifying all the jars manually, maybe? Tobias On Thu, Jun 5, 2014 at 6:45 PM, Gaurav Dasgupta gaurav.d...@gmail.com wrote: Hi, I have

Re: Are scala.MatchError messages a problem?

2014-06-08 Thread Tobias Pfeiffer
Jeremy, On Mon, Jun 9, 2014 at 10:22 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: When you use match, the match must be exhaustive. That is, a match error is thrown if the match fails. Ahh, right. That makes sense. Scala is applying its strong typing rules here instead of no

Re: Classpath errors with Breeze

2014-06-08 Thread Tobias Pfeiffer
Hi, I had a similar problem; I was using `sbt assembly` to build a jar containing all my dependencies, but since my file system has a problem with long file names (due to disk encryption), some class files (which correspond to functions in Scala) where not included in the jar I uploaded.

How to shut down Spark Streaming with Kafka properly?

2014-06-05 Thread Tobias Pfeiffer
Hi, I am trying to use Spark Streaming with Kafka, which works like a charm -- except for shutdown. When I run my program with sbt run-main, sbt will never exit, because there are two non-daemon threads left that don't die. I created a minimal example at

Re: How to shut down Spark Streaming with Kafka properly?

2014-06-05 Thread Tobias Pfeiffer
Sean, your patch fixes the issue, thank you so much! (This is the second time within one week I run into network libraries not shutting down threads properly, I'm really glad your code fixes the issue.) I saw your pull request is closed, but not merged yet. Can I do anything to get your fix into

Re: A single build.sbt file to start Spark REPL?

2014-06-03 Thread Tobias Pfeiffer
Hi, I guess it should be possible to dig through the scripts bin/spark-shell, bin/spark-submit etc. and convert them to a long sbt command that you can run. I just tried sbt run-main org.apache.spark.deploy.SparkSubmit spark-shell --class org.apache.spark.repl.Main but that fails with Failed

Spark Streaming on Mesos, various questions

2014-05-22 Thread Tobias Pfeiffer
Hi, with the hints from Gerard I was able to get my locally working Spark code running on Mesos. Thanks! Basically, on my local dev machine, I use sbt assembly to create a fat jar (which is actually not so fat since I use ... % 'provided' in my sbt file for the Spark dependencies), upload it to

ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Tobias Pfeiffer
Hi, I have set up a cluster with Mesos (backed by Zookeeper) with three master and three slave instances. I set up Spark (git HEAD) for use with Mesos according to this manual: http://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html Using the spark-shell, I can connect to this

Re: I want to filter a stream by a subclass.

2014-05-21 Thread Tobias Pfeiffer
On Thu, May 22, 2014 at 8:07 AM, Tathagata Das tathagata.das1...@gmail.com wrote: records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] } I think a Scala-ish way would be records.flatMap(_ match { case i: Int= Some(i) case _ = None })

Re: RDD union of a window in Dstream

2014-05-21 Thread Tobias Pfeiffer
Hi, On Wed, May 21, 2014 at 9:42 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: I want to do union of all RDDs in each window of DStream. A window *is* a union of all RDDs in the respective time interval. The documentation says a DStream is represented as a sequence of RDDs. However, data from a

<    1   2   3   >