Thanks for your answer. Yes, I cached the data, I can observed from the
WebUI that all the data is cached in the memory.
What I worry is that the dimension, not the total size.
Sean Owen ever answered me that the Broadcast support the maximum array
size is 2GB, so 10^7 is a little huge?
On
Hi, Thanks for the reponse.
I discovered my problem was that some of the executors got OOM, tracing
down the logs of executors helps discovering the problem. Usually the log
from the driver do not reflect the OOM error and therefore causes
confusions among users.
This is just the discoveries on
Thanks Imran. That's exactly what I needed to know.
Darin.
From: Imran Rashid iras...@cloudera.com
To: Darin McBeath ddmcbe...@yahoo.com
Cc: User user@spark.apache.org
Sent: Tuesday, February 17, 2015 8:35 PM
Subject: Re: How do you get the partitioner for
Hi, Sparkers:
I just happened to search in google for something related to the
RangePartitioner of spark, and found an old thread in this email list as here:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html
I followed the code example mentioned in that email thread
Hi,
I am running spark batch processing job using spark-submit command. And
below is my code snippet. Basically converting JsonRDD to parquet and
storing it in HDFS location.
The problem I am facing is if multiple jobs are are triggered parallely,
even though job executes properly (as i can see
Hi,
there is a small Spark Meetup group in Berlin, Germany :-)
http://www.meetup.com/Berlin-Apache-Spark-Meetup/
Plaes add this group to the Meetups list at
https://spark.apache.org/community.html
Ralph
-
To unsubscribe,
Thanks! I've added you.
Matei
On Feb 17, 2015, at 4:06 PM, Ralph Bergmann | the4thFloor.eu
ra...@the4thfloor.eu wrote:
Hi,
there is a small Spark Meetup group in Berlin, Germany :-)
http://www.meetup.com/Berlin-Apache-Spark-Meetup/
Plaes add this group to the Meetups list at
Hi Darrin,
You are asking for something near dear to me:
https://issues.apache.org/jira/browse/SPARK-1061
There is a PR attached there as well. Note that you could do everything in
that PR in your own user code, you don't need to wait for it to get merged,
*except* for the change to HadoopRDD
a JavaRDD is just a wrapper around a normal RDD defined in scala, which is
stored in the rdd field. You can access everything that way. The
JavaRDD wrappers just provide some interfaces that are a bit easier to work
with in Java.
If this is at all convincing, here's me demonstrating it inside
Hello folks,
Our intended use case is:
- Spark Streaming app #1 reads from RabbitMQ and output to HDFS
- Spark Streaming app #2 reads #1's output and stores the data into
Elasticsearch
The idea behind this architecture is that if Elasticsearch is down due to an
upgrade or
Hi,
I want to use spark-core inside of a HttpServlet. I use Maven for the
build task but I have a dependency problem :-(
I get this error message:
ClassCastException:
com.sun.jersey.server.impl.container.servlet.JerseyServletContainerInitializer
cannot be cast to
Hi, Thanks for the reponse.
I discovered my problem was that some of the executors got OOM, tracing
down the logs of executors helps discovering the problem. Usually the log
from the driver do not reflect the OOM error and therefore causes
confusions among users.
This is just the discoveries on
I am not sure, if this the easiest way to solve your problem. But you can
connect to the HIVE metastore(through derby) and find the HDFS path from
there.
On Wed, Feb 18, 2015 at 9:31 AM, Vasu C vasuc.bigd...@gmail.com wrote:
Hi,
I am running spark batch processing job using spark-submit
Thanks for the reply, I'll try your suggestions.
Apologies, in my previous post I was mistaken. rdd is actually an PairRDD
of (Int, Int). I'm doing the self-join so I can count two things. First, I
can count the number of times a value appears in the data set. Second I can
count number of times
I'm getting the below error when running spark-submit on my class. This class
has a transitive dependency on HttpClient v.4.3.1 since I'm calling SolrJ
4.10.3 from within the class.
This is in conflict with the older version, HttpClient 3.1 that's a
dependency of Hadoop 2.4 (I'm running Spark
Hi All,
I'm a new Spark (and Hadoop) user and I want to find out if the cluster
resources I am using are feasible for my use-case. The following is a
snippet of code that is causing a OOM exception in the executor after about
125/1000 tasks during the map stage.
val rdd2 = rdd.join(rdd,
Thanks Imran for very detailed explanations and options. I think for now
T-Digest is what I want.
From: iras...@cloudera.com
Date: Tue, 17 Feb 2015 08:39:48 -0600
Subject: Re: Percentile example
To: myl...@hotmail.com
CC: user@spark.apache.org
(trying to repost to the list w/out URLs --
Why are you joining the rdd with itself?
You can try these things:
- Change the StorageLevel of both rdds to MEMORY_AND_DISK_2 or
MEMORY_AND_DISK_SER, so that it doesnt need to keep everything up in memory.
- Set your default Serializer to Kryo (.set(spark.serializer,
I am not sure if this could be causing the issue but spark is compatible
with scala 2.10.
Instead of spark-core_2.11 you might want to try spark-core_2.10
On Wed, Feb 18, 2015 at 5:44 AM, Ralph Bergmann | the4thFloor.eu
ra...@the4thfloor.eu wrote:
Hi,
I want to use spark-core inside of a
Thanks Kohler, that's very interesting approach. I never used Spark SQL and not
sure whether my cluster was configured well for it. But will definitely have a
try.
From: c.koh...@elsevier.com
To: myl...@hotmail.com; user@spark.apache.org
Subject: Re: Percentile example
Date: Tue, 17 Feb 2015
Hi Sasi,
To pass parameters to spark-jobserver usecurl -d input.string = a b c
a b see and in Job server class use config.getString(input.string).
You can pass multiple parameters like starttime,endtime etc and use
config.getString() to get the values.
The examples are shown here
RangePartitioner does not actually provide a guarantee that all partitions
will be equal sized (that is hard), and instead uses sampling to
approximate equal buckets. Thus, it is possible that a bucket is left empty.
If you want the specified behavior, you should define your own partitioner.
It
Hi
Did you try to make maven pick the latest version
http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Management
That way solrj won't cause any issue, you can try this and check if the
part of your code where you access HDFS works fine?
On Wed,
It would be good if you can share the piece of code that you are using, so
people can suggest you how to optimize it further and stuffs like that.
Also, since you are having 20Gb of memory and ~30Gb of data, you can try
doing a rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
or
Emre,
As you are keeping the properties file external to the JAR you need to make
sure to submit the properties file as an additional --files (or whatever
the necessary CLI switch is) so all the executors get a copy of the file
along with the JAR.
If you know you are going to just put the
(trying to repost to the list w/out URLs -- rejected as spam earlier)
Hi,
Using take() is not a good idea, as you have noted it will pull a lot of
data down to the driver so its not scalable. Here are some more scalable
alternatives:
1. Approximate solutions
1a. Sample the data. Just sample
Hi,
I am working on a Spark application that processes graphs and I am trying
to do the following.
- group the vertices (key - vertex, value - set of its outgoing edges)
- distribute each key to separate processes and process them (like mapper)
- reduce the results back at the main process
Does
More tasks means a little *more* total CPU time is required, not less,
because of the overhead of handling tasks. However, more tasks can
actually mean less wall-clock time.
This is because tasks vary in how long they take. If you have 1 task
per core, the job takes as long as the slowest task
Hi Todd,
When I see /data/json appears in your log, I have a feeling that that is the
default hive.metastore.warehouse.dir from hive-site.xml where the value is
/data/. Could you check that property and see if you can point that to the
correct Hive table HDFS directory?
Another thing to look
HI All,
Just want to give everyone an update of what worked for me. Thanks for Cheng's
comment and other ppl's help.
So what I misunderstood was the --driver-class-path and how that was related to
--files. I put both /etc/hive/hive-site.xml in both --files and
--driver-class-path when I
+1 for TypeSafe config
Our practice is to include all spark properties under a 'spark' entry in
the config file alongside job-specific configuration:
A config file would look like:
spark {
master =
cleaner.ttl = 123456
...
}
job {
context {
src = foo
action =
I ended up with the following:
def firstValue(items: Iterable[String]) = for { i - items
} yield (i, items.head)
data.groupByKey().map{case(a, b)=(a, firstValue(b))}.collect
More details:
http://dmtolpeko.com/2015/02/17/first_value-last_value-lead-and-lag-in-spark/
I would appreciate any
Hi,
I am running brute force similarity from RowMatrix on a job with 5M x 1.5M
sparse matrix with 800M entries. With 200M entries the job run fine but
with 800M I am getting exceptions like too many files open and no space
left on device...
Seems like I need more nodes or use dimsum sampling ?
Hi Emre,
there shouldn't be any difference in which files get processed w/ print()
vs. foreachRDD(). In fact, if you look at the definition of print(), it is
just calling foreachRDD() underneath. So there is something else going on
here.
We need a little more information to figure out exactly
In the following code, I read in a large sequence file from S3 (1TB) spread
across 1024 partitions. When I look at the job/stage summary, I see about
400GB of shuffle writes which seems to make sense as I'm doing a hash partition
on this file.
// Get the baseline input file
Thanks for the pointer it led me to
http://spark.apache.org/docs/1.2.0/tuning.html increasing parallelism
resolved the issue.
On Mon, Feb 16, 2015 at 11:57 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Increase your executor memory, Also you can play around with increasing
the number of
Hi Darin,
When you say you see 400GB of shuffle writes from the first code snippet,
what do you mean? There is no action in that first set, so it won't do
anything. By itself, it won't do any shuffle writing, or anything else for
that matter.
Most likely, the .count() on your second code
The best approach I've found to calculate Percentiles in Spark is to leverage
SparkSQL. If you use the Hive Query Language support, you can use the UDAFs
for percentiles (as of Spark 1.2)
Something like this (Note: syntax not guaranteed to run but should give you the
gist of what you need
Hi,
I am running streaning word count program in Spark Standalone mode cluster,
having four machines in cluster.
public final class JavaKafkaStreamingWordCount {
private static final Pattern SPACE = Pattern.compile( );
static transient Configuration conf;
private
The raw data is ~30 GB. It consists of 250 millions sentences. The total
length of the documents (i.e. the sum of the length of all sentences) is 11
billions. I also ran a simple algorithm to roughly count the maximum number
of word pairs by summing up d * (d - 1) over all sentences, where d is
Hi,I've seen a few articles where they CqlStorageHandler to create hive tables
referencing Cassandra data using the thriftserver. Is there a secret to getting
this to work? I've basically got Spark built with Hive, and a Cassandra
cluster. Is there a way to get the hive server to talk to
Thanks Imran.
I think you are probably correct. I was a bit surprised that there was no
shuffle read in the initial hash partition step. I will adjust the code as you
suggest to prove that is the case.
I have a slightly different question. If I save an RDD to S3 (or some
equivalent) and this
The best step size depends on the condition number of the problem. You
can try some conditioning heuristics first, e.g., normalizing the
columns, and then try a common step size like 0.01. We should
implement line search for linear regression in the future, as in
LogisticRegressionWithLBFGS. Line
Did you cache the data? Was it fully cached? The k-means
implementation doesn't create many temporary objects. I guess you need
more RAM to avoid GC triggered frequently. Please monitor the memory
usage using YourKit or VisualVM. -Xiangrui
On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com
Hey Sandy,
The work should be done by a VectorAssembler, which combines multiple
columns (double/int/vector) into a vector column, which becomes the
features column for regression. We can going to create JIRAs for each
of these standard feature transformers. It would be great if you can
help
JavaDStream.foreachRDD
(https://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#foreachRDD(org.apache.spark.api.java.function.Function))
and Statistics.corr
The complexity of DIMSUM is independent of the number of rows but
still have quadratic dependency on the number of columns. 1.5M columns
may be too large to use DIMSUM. Try to increase the threshold and see
whether it helps. -Xiangrui
On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das
Thanks! I added Radius to
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.
-Xiangrui
On Tue, Feb 10, 2015 at 12:02 AM, Alexis Roos alexis.r...@gmail.com wrote:
Also long due given our usage of Spark ..
Radius Intelligence:
URL: radius.com
Description:
Spark, MLLib
Using
If there exists a sample that doesn't not belong to A/B/C, it means
that there exists another class D or Unknown besides A/B/C. You should
have some of these samples in the training set in order to let naive
Bayes learn the priors. -Xiangrui
On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet
Could you share the error log? What do you mean by 500 instead of
200? If this is the number of files, try to use `repartition` before
calling naive Bayes, which works the best when the number of
partitions matches the number of cores, or even less. -Xiangrui
On Tue, Feb 10, 2015 at 10:34 PM,
In an 'early release' of the Learning Spark book, there is the following
reference:
In Scala and Java, you can determine how an RDD is partitioned using its
partitioner property (or partitioner() method in Java)
However, I don't see the mentioned 'partitioner()' method in Spark 1.2 or a way
It may be caused by GC pause. Did you check the GC time in the Spark
UI? -Xiangrui
On Sun, Feb 15, 2015 at 8:10 PM, Debasish Das debasish.da...@gmail.com wrote:
Hi,
I am sometimes getting WARN from running Similarity calculation:
15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing
Hi Kannan,
I am not sure I have understood what your question is exactly, but maybe the
reduceByKey or reduceByKeyLocally functionality is better to your need.
Best,
Yifan LI
On 17 Feb 2015, at 17:37, Vijayasarathy Kannan kvi...@vt.edu wrote:
Hi,
I am working on a Spark
I can use nscala-time with scala, but my issue is that I can't use it witinh
spark-shell console! It gives my the error below.
Thanks
From: kevin...@apache.org
Date: Tue, 17 Feb 2015 08:50:04 +
Subject: Re: Use of nscala-time within spark-shell
To: hscha...@hotmail.com; kevin...@apache.org;
Thanks Kevin for your reply,
I downloaded the pre_built version and as you said the default spark scala
version is 2.10. I'm now building spark 1.2.1 with scala 2.11. I'll share the
results here.
Regards,
From: kevin...@apache.org
Date: Tue, 17 Feb 2015 01:10:09 +
Subject: Re: Use of
Great, or you can just use nscala-time with scala 2.10!
On Tue Feb 17 2015 at 5:41:53 PM Hammam CHAMSI hscha...@hotmail.com wrote:
Thanks Kevin for your reply,
I downloaded the pre_built version and as you said the default spark scala
version is 2.10. I'm now building spark 1.2.1 with scala
What application are you running? Here's a few things:
- You will hit bottleneck on CPU if you are doing some complex computation
(like parsing a json etc.)
- You will hit bottleneck on Memory if your data/objects used in the
program is large (like defining playing with HashMaps etc inside your
Then, why don't you use nscala-time_2.10-1.8.0.jar, not
nscala-time_2.11-1.8.0.jar ?
On Tue Feb 17 2015 at 5:55:50 PM Hammam CHAMSI hscha...@hotmail.com wrote:
I can use nscala-time with scala, but my issue is that I can't use it
witinh spark-shell console! It gives my the error below.
Where did you look?
BTW, it is defined in the RDD class as a val:
val partitioner: Option[Partitioner]
Mohammed
-Original Message-
From: Darin McBeath [mailto:ddmcbe...@yahoo.com.INVALID]
Sent: Tuesday, February 17, 2015 1:45 PM
To: User
Subject: How do you get the partitioner for
My fault, I didn't notice the 11 in the jar name. It is working now with
nscala-time_2.10-1.8.0.jar
Thanks Kevin
From: kevin...@apache.org
Date: Tue, 17 Feb 2015 08:58:13 +
Subject: Re: Use of nscala-time within spark-shell
To: hscha...@hotmail.com; kevin...@apache.org;
Hi
It could be due to the connectivity issue between the master and the slaves.
I have seen this issue occur for the following reasons.Are the slaves
visible in the Spark UI?And how much memory is allocated to the executors.
1. Syncing of configuration between Spark Master and Slaves.
2.
Thank you very much for your reply!
My task is to count the number of word pairs in a document. If w1 and w2
occur together in one sentence, the number of occurrence of word pair (w1,
w2) adds 1. So the computational part of this algorithm is simply a
two-level for-loop.
Since the cluster is
Tip: to debug where the unserializable reference comes from, run with
-Dsun.io.serialization.extendeddebuginfo=true
On Tue, Feb 17, 2015 at 10:20 AM, Jadhav Shweta jadhav.shw...@tcs.com wrote:
15/02/17 12:57:10 ERROR actor.OneForOneStrategy:
org.apache.hadoop.conf.Configuration
Hi
How big is your dataset?
Thanks
Arush
On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate jalaf...@eng.ucsd.edu
wrote:
Thank you very much for your reply!
My task is to count the number of word pairs in a document. If w1 and w2
occur together in one sentence, the number of occurrence of
Two questions regarding worker cleanup:
1) Is the best place to enable worker cleanup setting
export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.interval=30 in conf/spark-env.sh for each worker? Or is
there a better place?
2) I see this has a default TTL of 7
I've decided to try
spark-submit ... --conf
spark.driver.extraJavaOptions=-DpropertiesFile=/home/emre/data/myModule.properties
But when I try to retrieve the value of propertiesFile via
System.err.println(propertiesFile : +
System.getProperty(propertiesFile));
I get NULL:
66 matches
Mail list logo