It will add scheduling delay for the new batch. The new batch data will be
processed after finish up the previous batch, when the time is too high,
sometimes it will throw fetch failures as the batch data could get removed
from memory.
Thanks
Best Regards
On Wed, Apr 1, 2015 at 11:35 AM,
Hi,
In spark over YARN, there is a property spark.yarn.max.executor.failures
which controls the maximum number of executor's failure an application will
survive.
If number of executor's failures ( due to any reason like OOM or machine
failure etc ), increases this value then applications quits.
The expensive query can take all executor slots, but no task occupy the
executor permanently.
i.e. The second job can possibly to take some resources to execute in-between
tasks of the expensive queries.
Can the fair scheduler mode help in this case? Or is it possible to setup
thrift such that
hi guys:
I got a question when reading
http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval.
What will happen to the streaming data if the batch processing time is
bigger than the batch interval? Will the next batch data be
Hello Minnow,
It is possible. You can for example use Jersey REST client to query a web
service and get its results in a Spark job. In fact, that's what we did
actually in a recent project (in a Spark Streaming application).
Kind regards,
Emre Sevinç
http://www.bigindustries.be/
On Tue, Mar
rerun person.count and you will see the performance of cache.
person.cache would not cache it right now. It'll actually cache this RDD after
one action[person.count here]
- 原始邮件 -
发件人: fightf...@163.com
收件人: user user@spark.apache.org
发送时间: 星期三, 2015年 4 月 01日 下午 1:21:25
主题:
I am facing the same issue. FAIR and FIFO behaving in the same way.
On Wed, Apr 1, 2015 at 1:49 AM, asadrao as...@microsoft.com wrote:
Hi, I am using the Spark ‘fair’ scheduler mode. I have noticed that if the
first query is a very expensive query (ex: ‘select *’ on a really big data
set)
You can use zipWithIndex() to get index for each record and then you can
increment by 1 for each index.
val tf=sc.textFile(test).zipWithIndex()
tf.map(s=(s[1]+1,s[0]))
Above should serve your purpose.
--
View this message in context:
Once you submit the job do a ps aux | grep spark-submit and see how much is
the heap space allocated to the process (the -Xmx params), if you are
seeing a lower value you could try increasing it yourself with:
export _JAVA_OPTIONS=-Xmx5g
Thanks
Best Regards
On Wed, Apr 1, 2015 at 1:57 AM, Shuai
hummm, got it. Thank you Akhil.
Thanksamp;Best regards!
罗辉 San.Luo
- 原始邮件 -
发件人:Akhil Das ak...@sigmoidanalytics.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
主题:Re: SparkStreaming batch processing time question
日期:2015年04月01日
No, cache() changes the bookkeeping of the existing RDD. Although it
returns a reference, it works to just call person.cache.
I can't reproduce this. When I try to cache an RDD and then count it,
it is persisted in memory and I see it in the web UI. Something else
must be different about what's
This is tracked by these JIRAs..
https://issues.apache.org/jira/browse/SPARK-5947
https://issues.apache.org/jira/browse/SPARK-5948
From: denny.g@gmail.com
Date: Wed, 1 Apr 2015 04:35:08 +
Subject: Creating Partitioned Parquet Tables via SparkSQL
To: user@spark.apache.org
Creating
I have a simple setup/runtime of Kafka and Sprak.
I have a command line consumer displaying arrivals to Kafka topic. So i
know messages are being received.
But when I try to read from Kafka topic I get no messages, here are some
logs below.
I'm thinking there aren't enough threads. How do i
Hmm, I got the same error with the master. Here is another test example
that fails. Here, I explicitly create
a Row RDD which corresponds to the use case I am in :
*object TestDataFrame { def main(args: Array[String]): Unit = {
val conf = new
I am using the Spark ‘fair’ scheduler mode.
What do you mean by this? Fair scheduling mode is not one thing in Spark,
but allows for multiple configurations and usages. Presumably, at a
minimum you are using SparkConf to set spark.scheduling.mode to FAIR, but
then how are you setting up
cache() method returns new RDD so you have to use something like this:
val person =
sc.textFile(hdfs://namenode_host:8020/user/person.txt).map(_.split(,)).map(p
= Person(p(0).trim.toInt, p(1)))
val cached = person.cache
cached.count
when you rerun count on cached you will see that cache
Hi
Still no good luck with your guide.
Best.
Sun.
fightf...@163.com
From: Yuri Makhno
Date: 2015-04-01 15:26
To: fightf...@163.com
CC: Taotao.Li; user
Subject: Re: Re: rdd.cache() not working ?
cache() method returns new RDD so you have to use something like this:
val person =
Please make sure that you have given more cores than Receiver numbers.
From: James King
Date: 2015-04-01 15:21
To: user
Subject: Spark + Kafka
I have a simple setup/runtime of Kafka and Sprak.
I have a command line consumer displaying arrivals to Kafka topic. So i know
messages are being
Thanks Saisai,
Sure will do.
But just a quick note that when i set master as local[*] Spark started
showing Kafka messages as expected, so the problem in my view was to do
with not enough threads to process the incoming data.
Thanks.
On Wed, Apr 1, 2015 at 10:53 AM, Saisai Shao
You can disable with spark.ui.showConsoleProgress=false but I also
wasn't sure why this writes straight to System.err, at first. I assume
it's because it's writing carriage returns to achieve the animation
and this won't work via a logging framework. stderr is where log-like
output goes, because
Would you please share your code snippet please, so we can identify is
there anything wrong in your code.
Beside would you please grep your driver's debug log to see if there's any
debug log about Stream xxx received block xxx, this means that Spark
Streaming is keeping receiving data from
Since switching to Spark 1.2.1 I'm seeing logging for the stage progress
(ex.):
[error] [Stage 2154: (14 + 8) / 48][Stage 2210:
(0 + 0) / 48]
Any reason why these are error level logs? Shouldn't they be info level?
In any case is there a way to disable them other than
OK, seems there’s nothing strange from your code. So maybe we need to narrow
down the cause, would you please run KafkaWordCount example in Spark to see if
it is OK, if this is OK, then we should focus on your implementation, otherwise
Kafka potentially has some problems.
Thanks
Jerry
From:
Question:
---
Is there a way to have JDBC DataFrames use quoted/escaped column names?
Right now, it looks like it sees the names correctly in the schema created
but does not escape them in the SQL it creates when they are not compliant:
org.apache.spark.sql.jdbc.JDBCRDD
private
Hi
That is just the issue. After running person.cache we then run person.count
however, there still not be any cache performance showed from web ui storage.
Thanks,
Sun.
fightf...@163.com
From: Taotao.Li
Date: 2015-04-01 14:02
To: fightfate
CC: user
Subject: Re: rdd.cache() not working ?
Does the expensive query take all executor slots? Then there is nothing for
any other job to use regardless of scheduling policy.
On Mar 31, 2015 9:20 PM, asadrao as...@microsoft.com wrote:
Hi, I am using the Spark ‘fair’ scheduler mode. I have noticed that if the
first query is a very
Hi all
Thanks a lot for caspuring this.
We are now using 1.3.0 release. We tested with both prebuilt version spark and
source code compiling version targeting our CDH component,
and the cache result did not show as expected. However, if we create dataframe
with the person rdd and using
Thank you bit1129,
From looking at the web UI i can see 2 cores
Also looking at http://spark.apache.org/docs/1.2.1/configuration.html
But can't see obvious configuration for number of receivers can you help
please.
On Wed, Apr 1, 2015 at 9:39 AM, bit1...@163.com bit1...@163.com wrote:
Thanks Felix :)
On Wed, Apr 1, 2015 at 00:08 Felix Cheung felixcheun...@hotmail.com wrote:
This is tracked by these JIRAs..
https://issues.apache.org/jira/browse/SPARK-5947
https://issues.apache.org/jira/browse/SPARK-5948
--
From: denny.g@gmail.com
Date:
Hi Tathagata
do you know if JMS Reciever was introduced during last year as standard
Spark component or somebody is developing it?
Regards
Danila
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-JMS-tp5371p22337.html
Sent from the
This is the code. And I couldn't find anything like the log you suggested.
public KafkaLogConsumer(int duration, String master) {
JavaStreamingContext spark = createSparkContext(duration, master);
MapString, Integer topics = new HashMapString, Integer();
topics.put(test, 1);
+1 on escaping column names.
On Apr 1, 2015, at 5:50 AM, fergjo00 johngfergu...@gmail.com wrote:
Question:
---
Is there a way to have JDBC DataFrames use quoted/escaped column names?
Right now, it looks like it sees the names correctly in the schema created
but does not
Thank you Sean that does the trick.
On Wed, Apr 1, 2015 at 12:05 PM, Sean Owen so...@cloudera.com wrote:
You can disable with spark.ui.showConsoleProgress=false but I also
wasn't sure why this writes straight to System.err, at first. I assume
it's because it's writing carriage returns to
Hi,
I try to set up a minimal 2-node spark cluster for testing purposes. When I
start the cluster with start-all.sh I get a rsync error message:
rsync: change_dir /usr/local/spark130/sbin//right failed: No such file or
directory (2)
rsync error: some files/attrs were not transferred (see
Hi Akhil,
Thanks a lot!
After set export _JAVA_OPTIONS=-Xmx5g, the OutOfMemory exception disappeared.
But this make me confused, so the driver-memory options doesn’t work for
spark-submit to YARN (I haven’t check other clusters), is it a bug?
Regards,
Shuai
From: Akhil Das
Please invoke dev/change-version-to-2.11.sh before running mvn.
Cheers
On Mon, Mar 30, 2015 at 1:02 AM, Night Wolf nightwolf...@gmail.com wrote:
Hey,
Trying to build Spark 1.3 with Scala 2.11 supporting yarn hive (with
thrift server).
Running;
*mvn -e -DskipTests -Pscala-2.11
Hi all,
As I understand from docs and talks, the streaming state is in memory as
RDD (optionally checkpointable to disk). SPARK-2629 hints that this in
memory structure is not indexed efficiently?
I am wondering how my performance would be if the streaming state does not
fit in memory (say 100GB
Response inline.
On Tue, Mar 31, 2015 at 10:41 PM, Sean Bigdatafun sean.bigdata...@gmail.com
wrote:
(resending...)
I was thinking the same setup… But the more I think of this problem, and
the more interesting this could be.
If we allocate 50% total memory to Tachyon statically, then the
Hi,
we always get issues on inserting or creating table with Amazon EMR Spark
version, by inserting about 1GB resultset, the spark sql query will never be
finished.
by inserting small resultset (like 500MB), works fine.
*spark.sql.shuffle.partitions* by default 200
or *set
I feel like I recognize that problem, and it's almost the inverse of
https://issues.apache.org/jira/browse/SPARK-3884 which I was looking
at today. The spark-class script didn't seem to handle all the ways
that driver memory can be set.
I think this is also something fixed by the new launcher
Hi.
In Spark SQL 1.2.0, with HiveContext, I'm executing the following statement:
CREATE TABLE testTable STORED AS PARQUET AS
SELECT
field1
FROM table1
*field1 is SMALLINT. If table1 is in text format all it's ok, but if table1
is in parquet format, spark returns the following error*:
Upon executing these two lines of code:
conf = new SparkConf().setAppName(appName).setMaster(master);
sc = new JavaSparkContext(conf);
I get the following error message:
ERROR Configuration: Failed to set setXIncludeAware(true) for parser
No, it means you have a conflicting version of Xerces somewhere in
your classpath. Maybe your app is bundling an old version?
On Wed, Apr 1, 2015 at 4:46 PM, ARose ashley.r...@telarix.com wrote:
Upon executing these two lines of code:
conf = new
With receivers, it was pretty obvious which code ran where - each receiver
occupied a core and ran on the workers. However, with the new kafka direct
input streams, its hard for me to understand where the code that's reading
from kafka brokers runs. Does it run on the driver (I hope not), or does
Good Morning All,
I would like to use Spark in a special synthetics case that justifies the use
of spark.
So , I am looking for a case based on data ( can be big) and eventually the
associated java and/or python code.
It will be fantastic if you can refer me a link where I can load this
I am attempting to read an hbase table in pyspark with a range scan.
conf = {
hbase.zookeeper.quorum: host,
hbase.mapreduce.inputtable: table,
hbase.mapreduce.scan : scan
}
hbase_rdd = sc.newAPIHadoopRDD(
org.apache.hadoop.hbase.mapreduce.TableInputFormat,
How hard would it be to expose this in some way? I ask because the current
textFile and objectFile functions are obviously at some point calling out
to a FileInputFormat and configuring it.
Could we get a way to configure any arbitrary inputformat / outputformat?
Hi,
We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of
Tuple2. At the end of day, a daily job is launched, which works on the
outputs of the hourly jobs.
For data locality and speed, we wish that when the daily job launches, it
finds all instances of a given key at a single
I'm actually running this in a separate environment to our HDFS cluster.
I think I've been able to sort out the issue by copying
/opt/cloudera/parcels/CDH/lib to the machine I'm running this on (I'm just
using a one-worker setup at present) and adding the following to
spark-env.sh:
export
I'm curious - I'm not sure if I understand you correctly. With SparkR, the work
is distributed in Spark and computed in R, isn't that what your are looking for?
SparkR was on rJava for the R-JVM but moved away from it.
rJava has a component called JRI which allows JVM to call R.
You could call
I have 7 workers for spark and set SPARK_WORKER_CORES=12, therefore 84 tasks in
one job can run simultaneously,
I call the tasks in a job started almost simultaneously a wave.
While inserting, there is only one job on spark, not inserting from multiple
programs concurrently.
—
Best Regards!
Hi,
Thanks Sandy.
Another way to look at this is that would we like to have our long running
application to die?
So let's say, we create a window of around 10 batches, and we are using
incremental kind of operations inside our application, as restart here is a
relatively more costlier, so
Can you read snappy compressed file in hdfs? Looks like the libsnappy.so is
not in the hadoop native lib path.
On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote:
Has anyone else encountered the following error when trying to read a snappy
compressed sequence file from HDFS?
Great! Thank you!
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, April 02, 2015 8:11 AM
To: Haopu Wang
Cc: user; d...@spark.apache.org
Subject: Re: Can I call aggregate UDF in DataFrame?
You totally can.
You totally can.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L792
There is also an attempt at adding stddev here already:
https://github.com/apache/spark/pull/5228
On Thu, Mar 26, 2015 at 12:37 AM, Haopu Wang hw...@qilinsoft.com
hi,
Yes, you're right.
Original VertexIds are used to join them in VeretexRDD#xxxJoin.
On Thu, Apr 2, 2015 at 7:31 AM, hokiegeek2 soozandjohny...@gmail.com
wrote:
Hi Everyone,
Quick (hopefully) and silly (likely) question--the VertexId can be used to
join the VertexRDD generated from
Ravi, we just merged https://issues.apache.org/jira/browse/SPARK-6642
and used the same lambda scaling as in 1.2. The change will be
included in Spark 1.3.1, which will be released soon. Thanks for
reporting this issue! -Xiangrui
On Tue, Mar 31, 2015 at 8:53 PM, Xiangrui Meng men...@gmail.com
This is fixed in Spark 1.3.
https://issues.apache.org/jira/browse/SPARK-5195
On Wed, Apr 1, 2015 at 4:05 PM, Judy Nash judyn...@exchange.microsoft.com
wrote:
Hi all,
Noticed a bug in my current version of Spark 1.2.1.
After a table is cached with “cache table table” command, query will
Surprised I haven't gotten any responses about this. Has anyone tried using
rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the
other way- what I'd like to do is use R for model calculation and Spark to
distribute the load across the cluster.
Also, has anyone used Scalation
Have you looked at http://happybase.readthedocs.org/en/latest/ ?
Cheers
On Apr 1, 2015, at 4:50 PM, Eric Kimbrel eric.kimb...@soteradefense.com
wrote:
I am attempting to read an hbase table in pyspark with a range scan.
conf = {
hbase.zookeeper.quorum: host,
Has anyone else encountered the following error when trying to read a snappy
compressed sequence file from HDFS?
*java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z*
The following works for me when the file is uncompressed:
import
Thanks for the super quick response!
I can read the file just fine in hadoop, it's just when I point Spark at
this file it can't seem to read it due to the missing snappy jars / so's.
I'l paying around with adding some things to spark-env.sh file, but still
nothing.
On Wed, Apr 1, 2015 at 7:19
Error 23 is defined as a partial transfer and might be caused by
filesystem incompatibilities, such as different character sets or access
control lists. In this case it could be caused by the double slashes (// at
the end of sbin), You could try editing your sbin/spark-daemon.sh file,
look for
Hi,
Just to let you know, I have made some enhancement in Low Level Reliable
Receiver based Kafka Consumer (
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) .
Earlier version uses as many Receiver task for number of partitions of your
kafka topic . Now you can configure desired
Try sbt assembly instead.
On Wed, Apr 1, 2015 at 10:09 AM, Vijayasarathy Kannan kvi...@vt.edu wrote:
Why do I get
Failed to find Spark assembly JAR.
You need to build Spark before running this program. ?
I downloaded spark-1.2.1.tgz from the downloads page and extracted it.
When I do sbt
Can you open a JIRA please?
On Wed, Apr 1, 2015 at 9:38 AM, Hao Ren inv...@gmail.com wrote:
Hi,
I find HiveContext.setConf does not work correctly. Here are some code
snippets showing the problem:
snippet 1:
What do you mean by permanently. If you start up the JDBC server and say
CACHE TABLE it will stay cached as long as the server is running. CACHE
TABLE is idempotent, so you could even just have that command in your BI
tools setup queries.
On Wed, Apr 1, 2015 at 11:02 AM, Venkat, Ankam
Thanks Cody, that was really helpful. I have a much better understanding
now. One last question - Kafka topics are initialized once in the driver,
is there an easy way of adding/removing topics on the fly?
KafkaRDD#getPartitions() seems to be computed only once, and no way of
refreshing them.
Filed https://issues.apache.org/jira/browse/SPARK-6653
On Sun, Mar 29, 2015 at 8:18 PM, Shixiong Zhu zsxw...@gmail.com wrote:
LGTM. Could you open a JIRA and send a PR? Thanks.
Best Regards,
Shixiong Zhu
2015-03-28 7:14 GMT+08:00 Manoj Samel manojsamelt...@gmail.com:
I looked @ the 1.3.0
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
The kafka consumers run in the executors.
On Wed, Apr 1, 2015 at 11:18 AM, Neelesh neele...@gmail.com wrote:
With receivers, it was pretty obvious which code ran where - each receiver
occupied a core and ran on the
Can you try with Spark 1.3? Much of this code path has been rewritten /
improved in this version.
On Wed, Apr 1, 2015 at 7:53 AM, Masf masfwo...@gmail.com wrote:
Hi.
In Spark SQL 1.2.0, with HiveContext, I'm executing the following
statement:
CREATE TABLE testTable STORED AS PARQUET AS
Maybe check the examples?
http://spark.apache.org/examples.html
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Wed, Apr 1, 2015 at 8:31 PM, Vila, Didier didier.v...@teradata.com
wrote:
Good Morning All,
I would like to use
If you want to change topics from batch to batch, you can always just
create a KafkaRDD repeatedly.
The streaming code as it stands assumes a consistent set of topics though.
The implementation is private so you cant subclass it without building your
own spark.
On Wed, Apr 1, 2015 at 1:09 PM,
That's a good question, Twinkle.
One solution could be to allow a maximum number of failures within any
given time span. E.g. a max failures per hour property.
-Sandy
On Tue, Mar 31, 2015 at 11:52 PM, twinkle sachdeva
twinkle.sachd...@gmail.com wrote:
Hi,
In spark over YARN, there is a
Hi Dibyendu,
Thanks for your work on this project. Spark 1.3 now has direct kafka
streams, but still does not provide enough control over partitions and
topics. For example, the streams are fairly statically configured -
RDD.getPartitions() is computed only once, thus making it difficult to use
That is failing too, with
sbt.resolveexception: unresolved
dependency:org.apache.spark#spark-network-common_2.10;1.2.1
On Wed, Apr 1, 2015 at 1:24 PM, Marcelo Vanzin van...@cloudera.com wrote:
Try sbt assembly instead.
On Wed, Apr 1, 2015 at 10:09 AM, Vijayasarathy Kannan kvi...@vt.edu
Hi,
I find HiveContext.setConf does not work correctly. Here are some code
snippets showing the problem:
snippet 1:
import org.apache.spark.sql.hive.HiveContext
import
Can you open a JIRA for this please?
On Wed, Apr 1, 2015 at 6:14 AM, Ted Yu yuzhih...@gmail.com wrote:
+1 on escaping column names.
On Apr 1, 2015, at 5:50 AM, fergjo00 johngfergu...@gmail.com wrote:
Question:
---
Is there a way to have JDBC DataFrames use quoted/escaped
Nice.
But when my case shows that even I use Yarn-Client, I have same issue. I do
verify it several times.
And I am running 1.3.0 on EMR (use the version dispatch by installSpark script
from AWS).
I agree _JAVA_OPTIONS is not a right solution, but I will use it until 1.4.0
out :)
Regards,
I am trying to integrate SparkSQL with a BI tool. My requirement is to query a
Hive table very frequently from the BI tool.
Is there a way to cache the Hive Table permanently in SparkSQL? I don't want
to read the Hive table and cache it everytime the query is submitted from BI
tool.
Thanks!
You will need to create a hive parquet table that points to the data and
run ANALYZE TABLE tableName noscan so that we have statistics on the size.
On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra jitesh...@gmail.com
wrote:
Hi Michael,
Thanks for your response. I am running 1.2.1.
Is
Hi Experts,
I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL
queries repetitively.
Few questions :
1. When I do the below (persist to memory after reading from disk), it takes
lot of time to persist to memory, any suggestions of how to tune this?
val inputP
As I said in the original ticket, I think the implementation classes should
be exposed so that people can subclass and override compute() to suit their
needs.
Just adding a function from Time = Set[TopicAndPartition] wouldn't be
sufficient for some of my current production use cases.
compute()
Sonal,
Thanks for the link ( I worked with them yet)
Do you have any other example more “exotics” where I can play with specific
data too and algo ?
Cheers,
Didier
From: Sonal Goyal [mailto:sonalgoy...@gmail.com]
Sent: Wednesday, April 01, 2015 7:09 PM
To: Vila, Didier
Cc:
Thanks for confirming!
On Wed, Apr 1, 2015 at 12:33 PM, Tathagata Das t...@databricks.com wrote:
In the current state yes there will be performance issues. It can be done
much more efficiently and we are working on doing that.
TD
On Wed, Apr 1, 2015 at 7:49 AM, Vinoth Chandar
Hi all,
I just tried launching a Spark cluster on EC2 as described in
http://spark.apache.org/docs/1.3.0/ec2-scripts.html
I got the following response:
*ResponseErrorsErrorCodePendingVerification/CodeMessageYour
account is currently being verified. Verification normally takes less than
2
Thanks Cody!
On Wed, Apr 1, 2015 at 11:21 AM, Cody Koeninger c...@koeninger.org wrote:
If you want to change topics from batch to batch, you can always just
create a KafkaRDD repeatedly.
The streaming code as it stands assumes a consistent set of topics
though. The implementation is
We should be able to support that use case in the direct API. It may be as
simple as allowing the users to pass on a function that returns the set of
topic+partitions to read from.
That is function (Time) = Set[TopicAndPartition] This gets called every
batch interval before the offsets are
Its not a built in component of Spark. However there is a spark-package for
Apache Camel receiver which can integrate with JMS.
http://spark-packages.org/package/synsys/spark
I have not tried it but do check it out.
TD
On Wed, Apr 1, 2015 at 4:38 AM, danila danila.erma...@gmail.com wrote:
Hi
In the current state yes there will be performance issues. It can be done
much more efficiently and we are working on doing that.
TD
On Wed, Apr 1, 2015 at 7:49 AM, Vinoth Chandar vin...@uber.com wrote:
Hi all,
As I understand from docs and talks, the streaming state is in memory as
RDD
Why do I get
Failed to find Spark assembly JAR.
You need to build Spark before running this program. ?
I downloaded spark-1.2.1.tgz from the downloads page and extracted it.
When I do sbt package inside my application, it worked fine. But when I
try to run my application, I get the above
Hey Chris,
Apologies for the delayed reply. Your responses are always insightful and
appreciated :-)
However, I have a few more questions.
also, it looks like you're writing to S3 per RDD. you'll want to broaden
that out to write DStream batches
I assume you mean dstream.saveAsTextFiles()
When few waves (1 or 2) are used in a job, LoadApp could finish after a
few failures and retries.
But when more waves (3) are involved in a job, the job would terminate
abnormally.
Can you clarify what you mean by waves? Are you inserting from multiple
programs concurrently?
Note: I am running Spark on Windows 7 in standalone mode.
In my app, I run the following:
DataFrame df = sqlContext.sql(SELECT * FROM tbBER);
System.out.println(Count: + df.count());
tbBER is registered as a temp table in my SQLContext. When I try to print
the number of rows in
Ignore the question. There was a Hadoop setting that needed to be set to
get it working.
--
Kannan
On Wed, Apr 1, 2015 at 1:37 PM, Kannan Rajah kra...@maprtech.com wrote:
Running a simple word count job in standalone mode as a non root user from
spark-shell. The spark master, worker services
Is it possible tbBER is empty? If so, it shouldn't fail like this, of
course.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
Appears to be a problem with boto. Make sure you have boto 2.34 on your
system.
On Wed, Apr 1, 2015 at 11:19 AM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:
Hi all – I’m trying to bring up a spark ec2 cluster with the script below
and see the following error. Can anyone please advise as
I'm in the same boat. What are the equivalent commands to stop the master and
workers?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-start-master-and-workers-on-Windows-tp12669p22341.html
Sent from the Apache Spark User List mailing list archive at
The challenge of opening up these internal classes to public (even with
Developer API tag) is that it prevents us from making non-trivial changes
without breaking API compatibility for all those who had subclassed. Its a
tradeoff that is hard to optimize. That's why we favor exposing more
optional
Managed to make sbt assembly work.
I run into another issue now. When I do ./sbin/start-all.sh, the script
fails saying JAVA_HOME is not set although I have explicitly set that
variable to point to the correct Java location. Same happens with
./sbin/start-master.sh script. Any idea what I might
1 - 100 of 106 matches
Mail list logo