Hi,
so I didn't manage to get the Broadcast variable with a new value
distributed to my executors in YARN mode. In local mode it worked fine, but
when running on YARN either nothing happened (when unpersist() was called
on the driver) or I got a TimeoutException (when called on the executor).
I
Hi,
On Fri, Nov 14, 2014 at 2:49 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Ok, then we need another trick.
let's have an *implicit lazy var connection/context* around our code. And
setup() will trigger the eval and initialization.
Due to lazy evaluation, I think having
Hi,
On Fri, Nov 14, 2014 at 3:20 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
I wonder if SparkConf is dynamically updated on all worker nodes or only
during initialization. It can be used to piggyback information.
Otherwise I guess you are stuck with Broadcast.
Primarily I have had
Hi again,
On Mon, Nov 17, 2014 at 8:16 AM, Tobias Pfeiffer t...@preferred.jp wrote:
I have been trying to mis-use broadcast as in
- create a class with a boolean var, set to true
- query this boolean on the executors as a prerequisite to process the
next item
- when I want to shutdown, I
Hi,
I am processing a bunch of HDFS data using the StreamingContext (Spark
1.1.0) which means that all files that exist in the directory at start()
time are processed in the first batch. Now when I try to stop this stream
processing using `streamingContext.stop(false, false)` (that is, even with
Hi,
I guess I found part of the issue: I said
dstream.transform(rdd = { rdd.foreachPartition(...); rdd })
instead of
dstream.transform(rdd = { rdd.mapPartitions(...) }),
that's why stop() would not stop the processing.
Now with the new version a non-graceful shutdown works in the sense that
Bill,
However, when I am currently using Spark 1.1.0. the Spark streaming job
cannot receive any messages from Kafka. I have not made any change to the
code.
Do you see any suspicious messages in the log output?
Tobias
Hi,
On Tue, Nov 11, 2014 at 2:04 PM, hmxxyy hmx...@gmail.com wrote:
If I run bin/spark-shell without connecting a master, it can access a hdfs
file on a remote cluster with kerberos authentication.
[...]
However, if I start the master and slave on the same host and using
bin/spark-shell
Hi,
also there is Spindle https://github.com/adobe-research/spindle which was
introduced on this list some time ago. I haven't looked into it deeply, but
you might gain some valuable insights from their architecture, they are
also using Spark to fulfill requests coming from the web.
Tobias
Hi,
On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote:
But getLang() is one of the methods of twitter4j.Status since version 3.0.6
according to the doc at:
http://twitter4j.org/javadoc/twitter4j/Status.html#getLang--
What version of twitter4j does Spark Streaming use?
Josh,
On Tue, Nov 11, 2014 at 7:43 AM, Josh J joshjd...@gmail.com wrote:
I have some data generated by some utilities that returns the results as
a ListString. I would like to join this with a Dstream of strings. How
can I do this? I tried the following though get scala compiler errors
val
Akshat
On Tue, Nov 11, 2014 at 4:12 AM, Akshat Aranya aara...@gmail.com wrote:
Does there exist a way to serialize Row objects to JSON.
I can't think of any other way than the one you proposed. A Row is more or
less an Array[Object], so you need to read JSON key and data type from the
Hi,
On Wed, Nov 5, 2014 at 10:23 AM, Tobias Pfeiffer wrote:
On Tue, Nov 4, 2014 at 8:33 PM, M. Dale wrote:
From http://spark.apache.org/docs/latest/configuration.html it seems
that there is an experimental property:
spark.files.userClassPathFirst
Thank you very much, I didn't know
Markus,
thanks for your help!
On Tue, Nov 4, 2014 at 8:33 PM, M. Dale medal...@yahoo.com.invalid wrote:
Tobias,
From http://spark.apache.org/docs/latest/configuration.html it seems
that there is an experimental property:
spark.files.userClassPathFirst
Thank you very much, I didn't
Hi,
On Mon, Nov 3, 2014 at 1:29 PM, Amey Chaugule ambr...@gmail.com wrote:
I thought that only applied when you're trying to run a job using
spark-submit or in the shell...
And how are you starting your Yarn job, if not via spark-submit?
Tobias
Hi,
On Fri, Oct 31, 2014 at 4:31 PM, lieyan lie...@yahoo.com wrote:
The code are here: LogReg.scala
http://apache-spark-user-list.1001560.n3.nabble.com/file/n17803/LogReg.scala
Then I click the Run button of the IDEA, and I get the following error
message
errlog.txt
Hi,
I tried hard to get a version of netty into my jar file created with sbt
assembly that works with all my libraries. Now I managed that and was
really happy, but it seems like spark-submit puts an older version of netty
on the classpath when submitting to a cluster, such that my code ends up
Harold,
just mentioning it in case you run into it: If you are in a separate
thread, there are apparently stricter limits to what you can and cannot
serialize:
val someVal
future {
// be very careful with defining RDD operations using someVal here
val myLocalVal = someVal
// use myLocalVal
Hi,
I am trying to get my Spark application to run on YARN and by now I have
managed to build a fat jar as described on
http://markmail.org/message/c6no2nyaqjdujnkq (which is the only really
usable manual on how to get such a jar file). My code runs fine using sbt
test and sbt run, but when
Hi again,
On Thu, Oct 30, 2014 at 11:50 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Exception in thread main java.lang.NoClassDefFoundError:
com/typesafe/scalalogging/slf4j/Logger
It turned out scalalogging
Sean,
thanks, I didn't know about repartitionAndSortWithinPartitions, that seems
very helpful!
Tobias
Hi,
I have a setting where data arrives in Kafka and is stored to HDFS from
there (maybe using Camus or Flume). I want to write a Spark Streaming app
where
- first all files in a that HDFS directory are processed,
- and then the stream from Kafka is processed, starting
with the first item
Jianneng,
On Wed, Oct 8, 2014 at 8:44 AM, Jianneng Li jiannen...@berkeley.edu wrote:
I understand that Spark Streaming uses micro-batches to implement
streaming, while traditional streaming systems use the record-at-a-time
processing model. The performance benefit of the former is throughput,
Hi,
On Wed, Oct 8, 2014 at 4:50 AM, Josh J joshjd...@gmail.com wrote:
I have a source which fluctuates in the frequency of streaming tuples. I
would like to process certain batch counts, rather than batch window
durations. Is it possible to either
1) define batch window sizes
Cf.
Arko,
On Sat, Oct 4, 2014 at 1:40 AM, Arko Provo Mukherjee
arkoprovomukher...@gmail.com wrote:
Apologies if this is a stupid question but I am trying to understand
why this can or cannot be done. As far as I understand that streaming
algorithms need to be different from batch algorithms as
Hi,
I have a setup (in mind) where data is written to Kafka and this data is
persisted in HDFS (e.g., using camus) so that I have an all-time archive of
all stream data ever received. Now I want to process that all-time archive
and when I am done with that, continue with the live stream, using
Hi,
On Wed, Sep 24, 2014 at 7:23 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
So you have a single Kafka topic which has very high retention period (
that decides the storage capacity of a given Kafka topic) and you want to
process all historical data first using Camus and
. But,
there
is no job execution taking place.
After sometime, one by one, each goes down and the cluster shuts
down.
On Fri, Sep 19, 2014 at 2:15 PM, Tobias Pfeiffer t...@preferred.jp
wrote:
Hi,
On Fri, Sep 19, 2014 at 5:17 PM, rapelly kartheek
kartheek.m...@gmail.com wrote:
,
* you have copied
Hi,
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Kafka-building-project-with-sbt-assembly-is-extremely-slow-td13152.html
-- Maybe related to this?
Tobias
Hi,
On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
This worked perfectly. But, I wanted to simultaneously rsync all the
slaves. So, added the other slaves as following:
rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname
Hi,
I am wondering: Is it possible to run spark-submit in a mode where it will
start an application on a YARN cluster (i.e., driver and executors run on
the cluster) and then forget about it in the sense that the Spark
application is completely independent from the host that ran the
spark-submit
Hi,
thanks for everyone's replies!
On Thu, Sep 18, 2014 at 7:37 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:
YARN cluster mode should have the behavior you're looking for. The
client
process will stick around to report on things, but should be able to be
killed without affecting the
Hi,
On Thu, Sep 18, 2014 at 8:55 PM, Victor Tso-Guillen v...@paxata.com wrote:
Is it possible to express a diamond DAG and have the leaf dependency
evaluate only once?
Well, strictly speaking your graph is not a tree, and also the meaning of
leaf is not totally clear, I'd say.
So say data
Hi,
by now I understood maybe a bit better how spark-submit and YARN play
together and how Spark driver and slaves play together on YARN.
Now for my usecase, as described on
https://spark.apache.org/docs/latest/submitting-applications.html, I would
probably have a end-user-facing gateway that
Hi,
On Fri, Sep 12, 2014 at 9:12 AM, Patrick Wendell pwend...@gmail.com wrote:
I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
the second release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 171 developers!
Great,
Hi,
On Mon, Sep 8, 2014 at 4:39 PM, Sean Owen so...@cloudera.com wrote:
if (rdd.take (1).size == 1) {
rdd foreachPartition { iterator =
I was wondering: Since take() is an output operation, isn't it computed
twice (once for the take(1), once during the
Hi,
On Thu, Sep 4, 2014 at 10:33 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
In the current state of Spark Streaming, creating separate Java processes
each having a streaming context is probably the best approach to
dynamically adding and removing of input sources. All of these
Ron,
On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid
wrote:
I’m trying to figure out how I can run Spark Streaming like an API.
The goal is to have a synchronous REST API that runs the spark data flow
on YARN.
I guess I *may* develop something similar in the
Hi,
On Tue, Sep 9, 2014 at 12:59 PM, Ron's Yahoo! zlgonza...@yahoo.com wrote:
I want to create a synchronous REST API that will process some data that
is passed in as some request.
I would imagine that the Spark Streaming Job on YARN is a long
running job that waits on requests from
Hi,
On Tue, Sep 9, 2014 at 2:02 PM, Ron's Yahoo! zlgonza...@yahoo.com wrote:
So I guess where I was coming from was the assumption that starting up a
new job to be listening on a particular queue topic could be done
asynchronously.
No, with the current state of Spark Streaming, all data
Hi,
On Fri, Sep 5, 2014 at 6:16 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Does Spark support recursive calls?
Can you give an example of which kind of recursion you would like to use?
Tobias
Hi,
On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Err... there's no such feature?
The problem is that the SQLContext's `catalog` member is protected, so you
can't access it from outside. If you subclass SQLContext, and make sure
that `catalog` is always a
Hi,
On Thu, Sep 4, 2014 at 11:49 PM, Johnny Kelsey jkkel...@semblent.com
wrote:
As a concrete example, we have a python class (part of a fairly large
class library) which, as part of its constructor, also creates a record of
itself in the cassandra key space. So we get an initialised class a
Hello,
On Wed, Sep 3, 2014 at 6:02 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Can someone tell me what kind of operations can be performed on a
replicated rdd?? What are the use-cases of a replicated rdd.
I suggest you read
Hi,
I am not sure if multi-tenancy is the right word, but I am thinking about
a Spark application where multiple users can, say, log into some web
interface and specify a data processing pipeline with streaming source,
processing steps, and output.
Now as far as I know, there can be only one
Hi,
On Wed, Sep 3, 2014 at 6:54 AM, salemi alireza.sal...@udo.edu wrote:
I was able to calculate the individual measures separately and know I have
to merge them and spark streaming doesn't support outer join yet.
Can't you assign some dummy key (e.g., index) before your processing and
then
Hi,
On Mon, Aug 25, 2014 at 7:11 PM, praveshjain1991 praveshjain1...@gmail.com
wrote:
If you want to issue an SQL statement on streaming data, you must have
both
the registerAsTable() and the sql() call *within* the foreachRDD(...)
block,
or -- as you experienced -- the table name will be
Hi again,
On Tue, Aug 26, 2014 at 10:13 AM, Tobias Pfeiffer t...@preferred.jp wrote:
On Mon, Aug 25, 2014 at 7:11 PM, praveshjain1991
praveshjain1...@gmail.com wrote:
If you want to issue an SQL statement on streaming data, you must have
both
the registerAsTable() and the sql() call
Hi,
On Mon, Aug 25, 2014 at 9:56 AM, Dean Chen deanch...@gmail.com wrote:
We are using HDFS for log storage where logs are flushed to HDFS every
minute, with a new file created for each hour. We would like to consume
these logs using spark streaming.
The docs state that new HDFS will be
Hi,
computations are triggered by an output operation. No output operation, no
computation. Therefore in your code example,
On Thu, Aug 21, 2014 at 11:58 PM, Josh J joshjd...@gmail.com wrote:
JavaPairReceiverInputDStreamString, String messages =
Hi,
On Thu, Aug 21, 2014 at 3:11 PM, praveshjain1991 praveshjain1...@gmail.com
wrote:
The part that you mentioned */the variable `result ` is of type
DStream[Row]. That is, the meta-information from the SchemaRDD is lost and,
from what I understand, there is then no way to learn about the
Hi,
On Thu, Aug 21, 2014 at 2:19 PM, praveshjain1991 praveshjain1...@gmail.com
wrote:
Using Spark SQL with batch data works fine so I'm thinking it has to do
with
how I'm calling streamingcontext.start(). Any ideas what is the issue? Here
is the code:
Please have a look at
Hi,
On Tue, Aug 19, 2014 at 7:01 PM, Patrick McGloin mcgloin.patr...@gmail.com
wrote:
I think the type of the data contained in your RDD needs to be a known
case class and not abstract for createSchemaRDD. This makes sense when
you think it needs to know about the fields in the object to
Hi Wei,
On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu wei@stellarloyalty.com
wrote:
Since our application cannot tolerate losing customer data, I am wondering
what is the best way for us to address this issue.
1) We are thinking writing application specific logic to address the data
loss. To
Hi,
On Sat, Aug 16, 2014 at 3:29 AM, Yan Fang yanfang...@gmail.com wrote:
If all consecutive data points are in one batch, it's not complicated
except that the order of data points is not guaranteed in the batch and so
I have to use the timestamp in the data point to reach my goal. However,
Hi,
On Thu, Aug 14, 2014 at 5:49 AM, salemi alireza.sal...@udo.edu wrote:
what is the best way to make a spark streaming driver highly available.
I would also be interested in that. In particular for Streaming
applications where the Spark driver is running for a long time, this might
be
Uh, for some reason I don't seem to automatically reply to the list any
more.
Here is again my message to Tom.
-- Forwarded message --
Tom,
On Wed, Aug 13, 2014 at 5:35 AM, Tom Vacek minnesota...@gmail.com wrote:
This is a back-to-basics question. How do we know when Spark
Hi,
On Wed, Aug 13, 2014 at 4:24 AM, Zia Syed xia.s...@gmail.com wrote:
I dont particularly see any errors on my logs, either on console, or on
slaves. I see slave downloads the spark-1.0.2-bin-hadoop1.tgz file and
unpacks them as well. Mesos Master shows quiet alot of Tasks created and
Hi,
On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers
gwenhael.pasqui...@ericsson.com wrote:
We intend to apply other operations on the data later in the same spark
context, but our first step is to archive it.
Our goal is somth like this
Step 1 : consume kafka
Step 2 : archive to
(Forgot to include the mailing list in my reply. Here it is.)
Hi,
On Thu, Aug 7, 2014 at 7:55 AM, Tom thubregt...@gmail.com wrote:
When I look at the output, I see that there are several stages, and several
tasks per stage. The tasks have a TID, I do not see such a thing for a
stage.
They
Hi,
that quoted statement doesn't make too much sense for me, either. Maybe if
you had a link for us that shows the context (Google doesn't reveal
anything but this conversation), we could evaluate that statement better.
Tobias
On Tue, Jul 29, 2014 at 5:53 PM, Sean Owen so...@cloudera.com
Mayur,
I don't know if I exactly understand the context of what you are asking,
but let me just mention issues I had with deploying.
* As my application is a streaming application, it doesn't read any files
from disk, so therefore I have no Hadoop/HDFS in place and I there is no
need for it,
Bill,
Spark Streaming's DStream provides overloaded methods for transform() and
foreachRDD() that allow you to access the timestamp of a batch:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.DStream
I think the timestamp is the end of the batch, not
Hi,
as far as I know, after the Streaming Context has started, the processing
pipeline (e.g., filter.map.join.filter) cannot be changed. As your SQL
statement is transformed into RDD operations when the Streaming Context
starts, I think there is no way to change the statement that is executed on
On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer t...@preferred.jp
wrote:
Bill,
are you saying, after repartition(400), you have 400 partitions on one
host and the other hosts receive nothing of the data?
Tobias
On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com
wrote
that?
On Fri, Jul 4, 2014 at 11:11 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
unfortunately, when I go the above approach, I run into this problem:
http://mail-archives.apache.org/mod_mbox/kafka-users/201401.mbox/%3ccabtfevyxvtaqvnmvwmh7yscfgxpw5kmrnw_gnq72cy4oa1b...@mail.gmail.com%3E
On Jul 17, 2014, at 12:59 PM, Nick Chammas nicholas.cham...@gmail.com
wrote:
I often find myself wanting to reference one thread from another, or from
a JIRA issue. Right now I have to google the thread subject and find the
link that way.
+1
1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp
wrote:
Hi,
On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com
wrote:
* Is there a way to control how far Kafka Dstream can read on
topic-partition (via offset for example). By setting this to a small
number
Hi,
thanks for creating the issue. It feels like in the last week, more or less
half of the questions about Spark Streaming rooted in setting the master to
local ;-)
Tobias
On Wed, Jul 16, 2014 at 11:03 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Aah, right, copied from the wrong
Hi,
congratulations on the release! I'm always pleased to see how features pop
up in new Spark versions that I had added for myself in a very hackish way
before (such as JSON support for Spark SQL).
I am wondering if there is any good way to learn early about what is going
to be in upcoming
Hi,
I experienced exactly the same problems when using SparkContext with
local[1] master specification, because in that case one thread is used
for receiving data, the others for processing. As there is only one thread
running, no processing will take place. Once you shut down the connection,
the
Hi,
I think it would be great if we could do the string parsing only once and
then just apply the transformation for each interval (reducing the
processing overhead for short intervals).
Also, one issue with the approach above is that transform() has the
following signature:
def
:26 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Bill,
do the additional 100 nodes receive any tasks at all? (I don't know which
cluster you use, but with Mesos you could check client logs in the web
interface.) You might want to try something like repartition(N) or
repartition(N*2) (with N
Siyuan,
I do it like this:
// get data from Kafka
val ssc = new StreamingContext(...)
val kvPairs = KafkaUtils.createStream(...)
// we need to wrap the data in a case class for registerAsTable() to succeed
val lines = kvPairs.map(_._2).map(s = StringWrapper(s))
val result = lines.transform((rdd,
, 2014 at 5:51 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Bill,
I haven't worked with Yarn, but I would try adding a repartition() call
after you receive your data from Kafka. I would be surprised if that didn't
help.
On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com
Bill,
have you packaged org.apache.spark % spark-streaming-kafka_2.10 %
1.0.0 into your application jar? If I remember correctly, it's not
bundled with the downloadable compiled version of Spark.
Tobias
On Wed, Jul 9, 2014 at 8:18 AM, Bill Jay bill.jaypeter...@gmail.com wrote:
Hi all,
I
.
Tobias
On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay bill.jaypeter...@gmail.com wrote:
Hi Tobias,
Thanks for the suggestion. I have tried to add more nodes from 300 to 400.
It seems the running time did not get improved.
On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote
Juan,
I am doing something similar, just not insert into SQL database, but
issue some RPC call. I think mapPartitions() may be helpful to you. You
could do something like
dstream.mapPartitions(iter = {
val db = new DbConnection()
// maybe only do the above if !iter.isEmpty
iter.map(item =
use the approach of multiple kafkaStreams, I
don't get this error, but also work is never distributed in my cluster...
Thanks
Tobias
On Thu, Jul 3, 2014 at 11:58 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Thank you very much for the link, that was very helpful!
So, apparently the `topics
Sergey,
On Fri, Jul 4, 2014 at 1:06 AM, Sergey Malov sma...@collective.com wrote:
On the other hand, under the hood KafkaInputDStream which is create with
this KafkaUtils call, calls ConsumerConnector.createMessageStream which
returns a Map[String, List[KafkaStream] keyed by topic. It is,
Hi,
I am using Mesos to run my Spark tasks. I would be interested to see how
Spark distributes the tasks in the cluster (nodes, partitions) and which
nodes are more or less active and do what kind of tasks, and how long the
transfer of data and jobs takes. Is there any way to get this information
Hi,
On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote:
* Is there a way to control how far Kafka Dstream can read on
topic-partition (via offset for example). By setting this to a small
number, it will force DStream to read less data initially.
Please see the post at
the processing time and solve the problem of data piling up.
Thanks!
Bill
On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer t...@preferred.jp wrote:
If your batch size is one minute and it takes more than one minute to
process, then I guess that's what causes your problem
Hi,
I have a number of questions using the Kafka receiver of Spark
Streaming. Maybe someone has some more experience with that and can
help me out.
I have set up an environment for getting to know Spark, consisting of
- a Mesos cluster with 3 only-slaves and 3 master-and-slaves,
- 2 Kafka nodes,
I have a log4j.xml in src/main/resources with
?xml version=1.0 encoding=UTF-8 ?
!DOCTYPE log4j:configuration SYSTEM log4j.dtd
log4j:configuration xmlns:log4j=http://jakarta.apache.org/log4j/;
[...]
root
priority value =warn /
appender-ref ref=Console /
/root
is different than what
is documented, but then it's good for you (and me) because it allows
to specify I want all that I can get or I want to start reading
right now, even if there is an offset stored in Zookeeper.
Tobias
On Sun, Jun 15, 2014 at 11:27 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi
The error message *means* that there is no column called c_address.
However, maybe it's a bug with Spark SQL not understanding the
a.c_address syntax. Can you double-check the column name is correct?
Thanks
Tobias
On Wed, Jun 18, 2014 at 5:02 AM, Zuhair Khayyat
zuhair.khay...@gmail.com wrote:
Hi,
there are apparently helpers to tell you the offsets
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example#id-0.8.0SimpleConsumerExample-FindingStartingOffsetforReads,
but I have no idea how to pass that to the Kafka stream consumer. I am
interested in that as well.
Hi,
I remembered I saw this as well and found this ugly comment in my
build.sbt file:
On Mon, Jun 9, 2014 at 11:37 PM, Sean Owen so...@cloudera.com wrote:
Looks like this crept in again from the shaded Akka dependency. I'll
propose a PR to remove it. I believe that remains the way we have to
/javax.servlet/jars/javax.servlet-3.0.0.v201112011016.jar
* file manually.
*/
Tobias
On Tue, Jun 10, 2014 at 10:47 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
I remembered I saw this as well and found this ugly comment in my
build.sbt file:
On Mon, Jun 9, 2014 at 11:37 PM, Sean Owen so
Gaurav,
I am not sure that the * expands to what you expect it to do.
Normally the bash expands * to a space-separated string, not
colon-separated. Try specifying all the jars manually, maybe?
Tobias
On Thu, Jun 5, 2014 at 6:45 PM, Gaurav Dasgupta gaurav.d...@gmail.com wrote:
Hi,
I have
Jeremy,
On Mon, Jun 9, 2014 at 10:22 AM, Jeremy Lee
unorthodox.engine...@gmail.com wrote:
When you use match, the match must be exhaustive. That is, a match error
is thrown if the match fails.
Ahh, right. That makes sense. Scala is applying its strong typing rules
here instead of no
Hi,
I had a similar problem; I was using `sbt assembly` to build a jar
containing all my dependencies, but since my file system has a problem
with long file names (due to disk encryption), some class files (which
correspond to functions in Scala) where not included in the jar I
uploaded.
Hi,
I am trying to use Spark Streaming with Kafka, which works like a
charm -- except for shutdown. When I run my program with sbt
run-main, sbt will never exit, because there are two non-daemon
threads left that don't die.
I created a minimal example at
Sean,
your patch fixes the issue, thank you so much! (This is the second
time within one week I run into network libraries not shutting down
threads properly, I'm really glad your code fixes the issue.)
I saw your pull request is closed, but not merged yet. Can I do
anything to get your fix into
Hi,
I guess it should be possible to dig through the scripts
bin/spark-shell, bin/spark-submit etc. and convert them to a long sbt
command that you can run. I just tried
sbt run-main org.apache.spark.deploy.SparkSubmit spark-shell
--class org.apache.spark.repl.Main
but that fails with
Failed
Hi,
with the hints from Gerard I was able to get my locally working Spark
code running on Mesos. Thanks!
Basically, on my local dev machine, I use sbt assembly to create a
fat jar (which is actually not so fat since I use ... % 'provided'
in my sbt file for the Spark dependencies), upload it to
Hi,
I have set up a cluster with Mesos (backed by Zookeeper) with three
master and three slave instances. I set up Spark (git HEAD) for use
with Mesos according to this manual:
http://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html
Using the spark-shell, I can connect to this
On Thu, May 22, 2014 at 8:07 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] }
I think a Scala-ish way would be
records.flatMap(_ match {
case i: Int=
Some(i)
case _ =
None
})
Hi,
On Wed, May 21, 2014 at 9:42 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
I want to do union of all RDDs in each window of DStream.
A window *is* a union of all RDDs in the respective time interval.
The documentation says a DStream is represented as a sequence of
RDDs. However, data from a
101 - 200 of 201 matches
Mail list logo