You might let your data stored in tachyon
发件人: Jahagirdar, Madhu [mailto:madhu.jahagir...@philips.com]
发送时间: 2014年7月8日 10:16
收件人: user@spark.apache.org
主题: Spark RDD Disk Persistance
Should i use Disk based Persistance for RDD's and if the machine goes down
during the program execution, next
Hello,
I am a novice.I want to classify the text into two classes. For this
purpose I want to use Naive Bayes model. I am using Python for it.
Here are the problems I am facing:
*Problem 1:* I wanted to use all words as features for the bag of words
model. Which means my features will be count
The Java API requires a Java Class to register as table.
// Apply a schema to an RDD of JavaBeans and register it as a
table.JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people,
Person.class);schemaPeople.registerAsTable(people);
If instead of JavaRDDPerson I had JavaRDDList (along with the
Hi All,
I am having a few issues with stability and scheduling. When I use spark shell
to submit my application. I get the following error message and spark shell
crashes. I have a small 4-node cluster for PoC. I tried both manual and
scripts-based cluster set up. I tried using FQDN as well for
Are you sure this is your master URL spark://pzxnvm2018:7077 ?
You can look it up in the WebUI (mostly http://pzxnvm2018:8080) top left
corner. Also make sure you are able to telnet pzxnvm2018 7077 from the
machines where you are running the spark shell.
Thanks
Best Regards
On Tue, Jul 8, 2014
On Tue, Jul 8, 2014 at 4:07 AM, Srikrishna S srikrishna...@gmail.com
wrote:
Hi All,
Does anyone know what the command line arguments to mvn are to generate
the pre-built binary for spark on Hadoop 2-CHD5.
I would like to pull in a recent bug fix in spark-master and rebuild the
binaries in
On Tue, Jul 8, 2014 at 2:01 AM, DB Tsai dbt...@dbtsai.com wrote:
Actually, the one needed to install the jar to each individual node is
standalone mode which works for both MR1 and MR2. Cloudera and
Hortonworks currently support spark in this way as far as I know.
(CDH5 uses Spark on YARN.)
This is on the roadmap for the next release (1.1)
JIRA: SPARK-2179 https://issues.apache.org/jira/browse/SPARK-2179
On Mon, Jul 7, 2014 at 11:48 PM, Ionized ioni...@gmail.com wrote:
The Java API requires a Java Class to register as table.
// Apply a schema to an RDD of JavaBeans and
Hi all,
I encounter a strange issue when using spark 1.0 to access hdfs with Kerberos
I just have one spark test node for spark and HADOOP_CONF_DIR is set to the
location containing the hdfs configuration files(hdfs-site.xml and
core-site.xml)
When I use spark-shell with local mode, the access
Hi Tobias, thanks for your help. I understand that with that code we obtain
a database connection per partition, but I also suspect that with that code
a new database connection is created per each execution of the function
used as argument for mapPartitions(). That would be very inefficient
I think you can maintain a connection pool or keep the connection as a
long-lived object in executor side (like lazily creating a singleton object in
object { } in Scala), so your task can get this connection each time executing
a task, not creating a new one, that would be good for your
Hi Akhil:
Thanks for your response.
Mans
On Thursday, July 3, 2014 9:16 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
Hi Singh!
For this use-case its better to have a Streaming context listening to that
directory in hdfs where the files are being dropped and you can set the
Streaming
Hi Piotr:
It would be great if we can have an api to support batch updates (counter +
non-counter).
Thanks
Mans
On Monday, July 7, 2014 11:36 AM, Piotr Kołaczkowski pkola...@datastax.com
wrote:
Hi, we're planning to add a basic Java-API very soon, possibly this week.
There's a ticket
Hi Jerry, thanks for your answer. I'm using Spark Streaming for Java, and I
only have rudimentary knowledge about Scala, how could I recreate in Java
the lazy creation of a singleton object that you propose for Scala? Maybe a
static class member in Java for the connection would be the solution?
What's the meaning of a Task's Scheduler Delay in the web ui?
And what could cause that delay? Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Task-s-Scheduler-Delay-in-web-ui-tp9019.html
Sent from the Apache Spark User List mailing list archive at
Hi Ankur,
I was trying out the PatterMatcher it works for smaller path, but I see that
for the longer ones it continues to run forever...
Here's what I am trying:
https://gist.github.com/hihellobolke/dd2dc0fcebba485975d1 (The example of 3
share traders transacting in appl shares)
The first
Please let me know if the following can be done in Spark:
In terms of MapReduce I need:
1) Map function:
1.1) Get Hive record.
1.2) Create a key from some fileds of the record. Register with framework
my own key comparison function. This function will make decision about key
equality by
Digging a bit more I see that there is yet another jetty instance that
is causing the problem, namely the BroadcastManager has one. I guess
this one isn't very wise to disable... It might very well be that the
WebUI is a problem as well, but I guess the code doesn't get far
enough. Any ideas on
In addition to Scalding and Scrunch, there is Scoobi. Unlike the others, it
is only Scala (it doesn't wrap a Java framework). All three have fairly
similar APIs and aren't too different from Spark. For example, instead of
RDD you have DList (distributed list) or PCollection (parallel collection)
-
I don't have those numbers off-hand. Though the shuffle spill to disk was
coming to several gigabytes per node, if I recall correctly.
The MapReduce pipeline takes about 2-3 hours I think for the full 60 day
data set. Spark chugs along fine for awhile and then hangs. We restructured
the flow a
Hi,
I wanted to use Naive Bayes for a text classification problem.I am using
Spark 0.9.1.
I was just curious to ask that is the Naive Bayes implementation in Spark
0.9.1 correct? Or are there any bugs in the Spark 0.9.1 implementation
which are taken care in Spark 1.0. My question is specific
Hi,
I am using the MLlib Naive Bayes for a text classification problem. I have
very less amount of training data. And then the data will be coming
continuously and I need to classify it as either A or B. I am training the
MLlib Naive Bayes model using the training data but next time when data
Hi guys,
when i try to compile the latest source by sbt/sbt compile, I got an error. Can
any one help me?
The following is the detail: it may cause by TestSQLContext.scala
[error]
[error] while compiling:
/disk3/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala
do you control your cluster and spark deployment? if so, you can try to
rebuild with jetty 9.x
On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter
martingammelsae...@gmail.com wrote:
Digging a bit more I see that there is yet another jetty instance that
is causing the problem, namely the
When you say "large data sets", how large?
Thanks
On 07/07/2014 01:39 PM, Daniel Siegmann
wrote:
From a development perspective, I vastly prefer Spark to
MapReduce. The MapReduce API is very constrained; Spark's
Hi,
I am a post graduate student, new to spark. I want to understand how
Spark scheduler works. I just have theoretical understanding of DAG
scheduler and the underlying task scheduler.
I want to know, given a job to the framework, after the DAG scheduler
phase, how the scheduling happens??
Hi all,
I faced with the next exception during map step:
java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit
exceeded)
java.lang.reflect.Array.newInstance(Array.java:70)
We're doing similar thing to lunch spark job in tomcat, and I opened a
JIRA for this. There are couple technical discussions there.
https://issues.apache.org/jira/browse/SPARK-2100
In this end, we realized that spark uses jetty not only for Spark
WebUI, but also for distributing the jars and
I'll respond for Dan.
Our test dataset was a total of 10 GB of input data (full production
dataset for this particular dataflow would be 60 GB roughly).
I'm not sure what the size of the final output data was but I think it was
on the order of 20 GBs for the given 10 GB of input data. Also, I
We are looking to add a note about Talend Open Studio's support for Spark
components to the page at:
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
Name: Talend Open Studio
URL: http://www.talendforge.org/exchange/
Description:
Talend Labs are building open source tooling
To be honest I'm a scala newbie too. I just copied it from createStream.
I assume it's the canonical way to convert a java map (JMap) to a scala
map (Map)
On Mon, Jul 7, 2014 at 1:40 PM, mcampbell michael.campb...@gmail.com
wrote:
xtrahotsauce wrote
I had this same problem as well. I
Hi all,
sorry for fooly question, but how can I get PairRDDFunctions RDD? I'm doing
it to perform leftOuterJoin aftewards
currently I do in this was (it seems incorrect):
val parRDD = new PairRDDFunctions( oldRdd.map(i = (i.key, i)) )
I guess this constructor is definitely wrong...
Thank you,
If your RDD contains pairs, like an RDD[(String,Integer)] or something,
then you get to use the functions in PairRDDFunctions as if they were
declared on RDD.
On Tue, Jul 8, 2014 at 6:25 PM, Konstantin Kudryavtsev
kudryavtsev.konstan...@gmail.com wrote:
Hi all,
sorry for fooly question, but
See Working with Key-Value Pairs
http://spark.apache.org/docs/latest/programming-guide.html. In
particular: In Scala, these operations are automatically available on RDDs
containing Tuple2 objects (the built-in tuples in the language, created by
simply writing (a, b)), as long as you import
I believe our full 60 days of data contains over ten million unique
entities. Across 10 days I'm not sure, but it should be in the millions. I
haven't verified that myself though. So that's the scale of the RDD we're
writing to disk (each entry is entityId - profile).
I think it's hard to know
There is a difference from actual GC overhead, which can be reduced by
reusing objects, versus this error, which actually means you ran out of
memory. This error can probably be relieved by increasing your executor
heap size, unless your data is corrupt and it is allocating huge arrays, or
you are
Hi Konstantin,
I just ran into the same problem. I mitigated the issue by reducing the
number of cores when I executed the job which otherwise it won't be able to
finish.
Unlike many people believes, it might not means that you were running out
of memory. A better answer can be found here:
It seems to me that you're not taking full advantage of the lazy
evaluation, especially persisting to disk only. While it might be
true that the cumulative size of the RDDs looks like it's 300GB,
only a small portion of that should be resident at any one time.
We've
This seems almost equivalent to a heap size error -- since GCs are
stop-the-world events, the fact that we were unable to release more than 2%
of the heap suggests that almost all the memory is *currently in use *(i.e.,
live).
Decreasing the number of cores is another solution which decreases
Hi all,
I am working on a pipeline that needs to join two Spark streams. The input
is a stream of integers. And the output is the number of integer's
appearance divided by the total number of unique integers. Suppose the
input is:
1
2
3
1
2
2
There are 3 unique integers and 1 appears twice.
To clarify, we are not persisting to disk. That was just one of the
experiments we did because of some issues we had along the way.
At this time, we are NOT using persist but cannot get the flow to complete
in Standalone Cluster mode. We do not have a YARN-capable cluster at this
time.
We agree
Hi All,
I used ip addresses in my scripts (spark-env.sh) and slaves contain ip
addresses of master and slave nodes respectively. However, I still have no
luck. Here is the relevant log file snippet:
Master node log:14/07/08 10:56:19 ERROR EndpointWriter: AssociationError
Hi Tobias,
Thanks for the suggestion. I have tried to add more nodes from 300 to 400.
It seems the running time did not get improved.
On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Bill,
can't you just add more nodes in order to speed up the processing?
Tobias
Nothing particularly custom. We've tested with small (4 node)
development clusters, single-node pseudoclusters, and bigger, using
plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark
master, Spark local, Spark Yarn (client and cluster) modes, with
total
How wide are the rows of data, either the raw input data or any generated
intermediate data?
We are at a loss as to why our flow doesn't complete. We banged our heads
against it for a few weeks.
-Suren
On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey kevin.mar...@oracle.com
wrote:
Nothing
Also, our exact same flow but with 1 GB of input data completed fine.
-Suren
On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
How wide are the rows of data, either the raw input data or any generated
intermediate data?
We are at a loss as to why our flow
Dear All,
When I look inside the following directory on my worker
node:$SPARK_HOME/work/app-20140708110707-0001/3
I see the following error message:
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.conf.Configuration).log4j:WARN Please initialize the log4j
system
Hmm, looks like the Executor is trying to connect to the driver on
localhost, from this line:
14/07/08 11:07:13 INFO CoarseGrainedExecutorBackend: Connecting to driver:
akka.tcp://spark@localhost:39701/user/CoarseGrainedScheduler
What is your setup? Standalone mode with 4 separate machines? Are
Hi All,
I tried the make distribution script and it worked well. I was able to
compile the spark binary on our CDH5 cluster. Once I compiled Spark, I
copied over the binaries in the dist folder to all the other machines
in the cluster.
However, I run into an issue while submit a job in
Someone might be able to correct me if I'm wrong, but I don't believe
standalone mode supports kerberos. You'd have to use Yarn for that.
On Tue, Jul 8, 2014 at 1:40 AM, 许晓炜 xuxiao...@qiyi.com wrote:
Hi all,
I encounter a strange issue when using spark 1.0 to access hdfs with
Kerberos
I
HI,
I am getting this error. Can anyone help out to explain why is this error
coming.
Exception in thread delete Spark temp dir
C:\Users\shawn\AppData\Local\Temp\spark-27f60467-36d4-4081-aaf5-d0ad42dda560
java.io.IOException: Failed to delete:
This is generally a side effect of your executor being killed. For
example, Yarn will do that if you're going over the requested memory
limits.
On Tue, Jul 8, 2014 at 12:17 PM, Rahul Bhojwani
rahulbhojwani2...@gmail.com wrote:
HI,
I am getting this error. Can anyone help out to explain why is
Hi Marcelo.
Thanks for the quick reply. Can you suggest me how to increase the memory
limits or how to tackle this problem. I am a novice. If you want I can post
my code here.
Thanks
On Wed, Jul 9, 2014 at 12:50 AM, Marcelo Vanzin van...@cloudera.com wrote:
This is generally a side effect of
Hi Aaron,
I have 4 nodes - 1 master and 3 workers. I am not setting up driver public dns
name anywhere. I didn't see that step in the documentation -- may be I missed
it. Can you please point me in the right direction?
Sent via the Samsung GALAXY S®4, an ATT 4G LTE smartphone
div
Note I didn't say that was your problem - it would be if (i) you're
running your job on Yarn and (ii) you look at the Yarn NodeManager
logs and see that it's actually killing your process.
I just said that the exception shows up in those kinds of situations.
You haven't provided enough
We kind of hijacked Santos' original thread, so apologies for that and let
me try to get back to Santos' original question on Map/Reduce versus Spark.
I would say it's worth migrating from M/R, with the following thoughts.
Just my opinion but I would summarize the latest emails in this thread as
On Tue, Jul 8, 2014 at 8:32 PM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
Libraries like Scoobi, Scrunch and Scalding (and their associated Java
versions) provide a Spark-like wrapper around Map/Reduce but my guess is
that, since they are limited to Map/Reduce under the covers, they
Here I am adding my code. If you can have a look to help me out.
Thanks
###
import tokenizer
import gettingWordLists as gl
from pyspark.mllib.classification import NaiveBayes
from numpy import array
from pyspark import SparkContext, SparkConf
conf =
I have pasted the logs below:
PS F:\spark-0.9.1\codes\sentiment analysis pyspark
.\naive_bayes_analyser.py
Running python with PYTHONPATH=F:\spark-0.9.1\spark-0.9.1\bin\..\python;
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
This is a good start:
http://www.eecs.berkeley.edu/~tdas/spark_docs/job-scheduling.html
On Tue, Jul 8, 2014 at 9:11 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Hi,
I am a post graduate student, new to spark. I want to understand how
Spark scheduler works. I just have theoretical
Hi there!
1/ Is there a way to convert a SchemaRDD (for instance loaded from a parquet
file) back to a RDD of a given case class?
2/ Even better, is there a way to get the schema information from a
SchemaRDD ? I am trying to figure out how to properly get the various fields
of the Rows of a
Not sure exactly what is happening but perhaps there are ways to
restructure your program for it to work better. Spark is definitely able to
handle much, much larger workloads.
I've personally run a workload that shuffled 300 TB of data. I've also ran
something that shuffled 5TB/node and stuffed
Here's the most updated version of the same page:
http://spark.apache.org/docs/latest/job-scheduling
2014-07-08 12:44 GMT-07:00 Sujeet Varakhedi svarakh...@gopivotal.com:
This is a good start:
http://www.eecs.berkeley.edu/~tdas/spark_docs/job-scheduling.html
On Tue, Jul 8, 2014 at 9:11
I added you to the list. Cheers.
On Mon, Jul 7, 2014 at 6:19 PM, Alex Gaudio adgau...@gmail.com wrote:
Hi,
Sailthru is also using Spark. Could you please add us to the Powered By
Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
page
when you have a chance?
Hi Rahul,
We plan to add online model updates with Spark Streaming, perhaps in
v1.1, starting with linear methods. Please open a JIRA for Naive
Bayes. For Naive Bayes, we need to update the priors and conditional
probabilities, which means we should also remember the number of
observations for
Thanks for the heads-up.
In the meantime, we'd like to test this out ASAP - are there any open PR's
we could take to try it out? (or do you have an estimate on when some will
be available?)
On Tue, Jul 8, 2014 at 12:24 AM, Michael Armbrust mich...@databricks.com
wrote:
This is on the roadmap
Well, I believe this is a correct implementation but please let us
know if you run into problems. The NaiveBayes implementation in MLlib
v1.0 supports sparse data, which is usually the case for text
classificiation. I would recommend upgrading to v1.0. -Xiangrui
On Tue, Jul 8, 2014 at 7:20 AM,
try sbt/sbt clean first
On Tue, Jul 8, 2014 at 8:25 AM, bai阿蒙 smallmonkey...@hotmail.com wrote:
Hi guys,
when i try to compile the latest source by sbt/sbt compile, I got an error.
Can any one help me?
The following is the detail: it may cause by TestSQLContext.scala
[error]
[error]
Yin (cc-ed) is working on it as we speak. We'll post to the JIRA as soon
as a PR is up.
On Tue, Jul 8, 2014 at 1:04 PM, Ionized ioni...@gmail.com wrote:
Thanks for the heads-up.
In the meantime, we'd like to test this out ASAP - are there any open PR's
we could take to try it out? (or do
I think we're missing the point a bit. Everything was actually flowing
through smoothly and in a reasonable time. Until it reached the last two
tasks (out of over a thousand in the final stage alone), at which point it
just fell into a coma. Not so much as a cranky message in the logs.
I don't
On Tue, Jul 8, 2014 at 12:43 PM, Pierre B
pierre.borckm...@realimpactanalytics.com wrote:
1/ Is there a way to convert a SchemaRDD (for instance loaded from a
parquet
file) back to a RDD of a given case class?
There may be someday, but doing so will either require a lot of reflection
or a
Not sure exactly what is happening but perhaps there are ways to
restructure your program for it to work better. Spark is definitely able to
handle much, much larger workloads.
+1 @Reynold
Spark can handle big big data. There are known issues with informing the
user about what went wrong
1) The feature dimension should be a fixed number before you run
NaiveBayes. If you use bag of words, you need to handle the
word-to-index dictionary by yourself. You can either ignore the words
that never appear in training (because they have no effect in
prediction), or use hashing to randomly
You can either use sc.wholeTextFiles and then a flatMap to reduce the
number of partitions, or give more memory to the driver process by
using --driver-memory 20g and then call RDD.repartition(small number)
after you load the data in. -Xiangrui
On Mon, Jul 7, 2014 at 7:38 PM, innowireless TaeYun
Cool Thanks Michael!
Message sent from a mobile device - excuse typos and abbreviations
Le 8 juil. 2014 à 22:17, Michael Armbrust [via Apache Spark User List]
ml-node+s1001560n9084...@n3.nabble.com a écrit :
On Tue, Jul 8, 2014 at 12:43 PM, Pierre B [hidden email] wrote:
1/ Is there a way
Thanks a lot Xiangrui. This will help.
On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng men...@gmail.com wrote:
Hi Rahul,
We plan to add online model updates with Spark Streaming, perhaps in
v1.1, starting with linear methods. Please open a JIRA for Naive
Bayes. For Naive Bayes, we need to
Hi Rahul,
Can you try calling sc.close() at the end of your program, so Spark
can clean up after itself?
On Tue, Jul 8, 2014 at 12:40 PM, Rahul Bhojwani
rahulbhojwani2...@gmail.com wrote:
Here I am adding my code. If you can have a look to help me out.
Thanks
###
import
Thanks a lot Xiangrui for the help.
On Wed, Jul 9, 2014 at 1:39 AM, Xiangrui Meng men...@gmail.com wrote:
Well, I believe this is a correct implementation but please let us
know if you run into problems. The NaiveBayes implementation in MLlib
v1.0 supports sparse data, which is usually the
Thanks Xiangrui. You have solved almost all my problems :)
On Wed, Jul 9, 2014 at 1:47 AM, Xiangrui Meng men...@gmail.com wrote:
1) The feature dimension should be a fixed number before you run
NaiveBayes. If you use bag of words, you need to handle the
word-to-index dictionary by yourself.
Sorry, that would be sc.stop() (not close).
On Tue, Jul 8, 2014 at 1:31 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi Rahul,
Can you try calling sc.close() at the end of your program, so Spark
can clean up after itself?
On Tue, Jul 8, 2014 at 12:40 PM, Rahul Bhojwani
Aaron,
I don't think anyone was saying Spark can't handle this data size, given
testimony from the Spark team, Bizo, etc., on large datasets. This has kept
us trying different things to get our flow to work over the course of
several weeks.
Agreed that the first instinct should be what did I do
Thanks Marcelo.
I was having another problem. My code was running properly and then it
suddenly stopped with the error:
java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedOutputStream.init(Unknown Source)
at
Have you tried the obvious (increase the heap size of your JVM)?
On Tue, Jul 8, 2014 at 2:02 PM, Rahul Bhojwani
rahulbhojwani2...@gmail.com wrote:
Thanks Marcelo.
I was having another problem. My code was running properly and then it
suddenly stopped with the error:
Hi Aaron,Would really appreciate your help if you can point me to the
documentation. Is this something that I need to do with /etc/hosts on each of
the worker machines ? Or do I set SPARK_PUBLIC_DNS (if yes, what is the
format?) or something else?
I have the following set up:
master node:
As a new user, I can definitely say that my experience with Spark has
been rather raw. The appeal of interactive, batch, and in between all
using more or less straight Scala is unarguable. But the experience
of deploying Spark has been quite painful, mainly about gaps between
compile time and
Build: Spark 1.0.0 rc11 (git commit tag:
2f1dc868e5714882cf40d2633fb66772baf34789)
Hi All,
When I enabled the spark-defaults.conf for the eventLog, spark-shell broke
while spark-submit works.
I'm trying to create a separate directory per user to keep track with their own
Spark job event
It seems that your driver (which I'm assuming you launched on the master
node) can now connect to the Master, but your executors cannot. Did you
make sure that all nodes have the same conf/spark-defaults.conf,
conf/spark-env.sh, and conf/slaves? It would be good if you can post the
stderr of the
Hello Mayur,
How can I implement these methods mentioned below. Do u you have any clue on
this pls et me know.
public void onJobStart(SparkListenerJobStart arg0) {
}
@Override
public void onStageCompleted(SparkListenerStageCompleted arg0) {
}
Hi all,
I used sbt to package a code that uses spark-streaming-kafka. The packaging
succeeded. However, when I submitted to yarn, the job ran for 10 seconds
and there was an error in the log file as follows:
Caused by: java.lang.NoClassDefFoundError:
org/apache/spark/streaming/kafka/KafkaUtils$
Hi! I've been using Spark compiled from 1.0 branch at some point (~2 month
ago). The setup is a standalone cluster with 4 worker machines and 1 master
machine. I used to run spark shell like this:
./bin/spark-shell -c 30 -em 20g -dm 10g
Today I've finally updated to Spark 1.0 release. Now I
Hi Mikhail,
It looks like the documentation is a little out-dated. Neither is true
anymore. In general, we try to shift away from short options (-em, -dm
etc.) in favor of more explicit ones (--executor-memory,
--driver-memory). These options, and --cores, refer to the arguments
passed in to
More updates:
Seems in TachyonBlockManager.scala(line 118) of Spark 1.1.0, the
TachyonFS.mkdir() method is called, which creates a directory in Tachyon.
Right after that, TachyonFS.getFile() method is called. In all the versions
of Tachyon I tried (0.4.1, 0.4.0), the second method will return a
Thanks Andrew,
./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores 30
--executor-memory 20g --driver-memory 10g
works well, just wanted to make sure that I'm not missing anything
--
View this message in context:
Bill,
have you packaged org.apache.spark % spark-streaming-kafka_2.10 %
1.0.0 into your application jar? If I remember correctly, it's not
bundled with the downloadable compiled version of Spark.
Tobias
On Wed, Jul 9, 2014 at 8:18 AM, Bill Jay bill.jaypeter...@gmail.com wrote:
Hi all,
I
Bill,
do the additional 100 nodes receive any tasks at all? (I don't know which
cluster you use, but with Mesos you could check client logs in the web
interface.) You might want to try something like repartition(N) or
repartition(N*2) (with N the number of your nodes) after you receive your
data.
Hi Tobias,
Currently, I do not use bundle any dependency into my application jar. I
will try that. Thanks a lot!
Bill
On Tue, Jul 8, 2014 at 5:22 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Bill,
have you packaged org.apache.spark % spark-streaming-kafka_2.10 %
1.0.0 into your
Hi all,
I am trying to run the NetworkWordCount.java file in the streaming examples.
The example shows how to read from a network socket. But my usecase is that
, I have a local log file which is a stream and continuously updated (say
/Users/.../Desktop/mylog.log).
I would like to write the same
Santosh,
To add a bit more to what Nabeel said, Spark and Impala are very different
tools. Impala is *not* built on map/reduce, though it was built to replace
Hive, which is map/reduce based. It has its own distributed query engine,
though it does load data from HDFS, and is part of the hadoop
Yes, that would be the Java equivalence to use static class member, but you
should carefully program to prevent resource leakage. A good choice is to use
third-party DB connection library which supports connection pool, that will
alleviate your programming efforts.
Thanks
Jerry
From: Juan
hi all
I am a newbie to Spark Streaming, and used Strom before.Have u test the
performance both of them and which one is better?
xichen_tju@126
1 - 100 of 111 matches
Mail list logo