Can anybody help me?
Thanks.
Chieh-Yen
On Wed, Apr 16, 2014 at 5:18 PM, Chieh-Yen r01944...@csie.ntu.edu.twwrote:
Dear all,
I developed a application that the message size of communication
is greater than 10 MB sometimes.
For smaller datasets it works fine, but fails for larger datasets.
Hi Chieh,
You can increase the heap size by exporting the java options (See below,
will increase the heap size to 10Gb)
export _JAVA_OPTIONS=-Xmx10g
On Mon, Apr 21, 2014 at 11:43 AM, Chieh-Yen r01944...@csie.ntu.edu.twwrote:
Can anybody help me?
Thanks.
Chieh-Yen
On Wed, Apr 16, 2014
You should add the hub command line wrapper of git for github to that wiki
page: https://github.com/github/hub -- doesn't look like I have edit access
to the wiki, or I've forgotten a password, or something
Once you've got hub installed and aliased, you've got some nice additional
options,
1.) How about if data is in S3 and we cached in memory , instead of hdfs ?
2.) How is the numbers of reducers determined in both case .
Even if I specify set.mapred.reduce.tasks=50, still somehow reducers
allocated are only 2, instead of 50. Although query/tasks gets completed.
Regards,
Arpit
On Sun, Apr 20, 2014 at 6:27 PM, Qi Song songqi1...@gmail.com wrote:
I wander if there exists some
documentation about how to choose partition methods, based on the graph's
structure or some other properties?
The best option is to try all the partition strategies (as well as the
default,
On Fri, Apr 11, 2014 at 4:42 AM, Pierre-Alexandre Fonta
pierre.alexandre.fonta+sp...@gmail.com wrote:
Testing in mapTriplets if a vertex attribute, which is defined as Integer
in
first VertexRDD but has been changed after to Double by mapVertices, is
greater than a number throws
Hi,
I think you can do just fine with your Java knowledge. There is a Java API
that you can use [1]. I am also new to Spark and i have got around with
just my Java knowledge. And Scala is easy to learn if you are good with
Java.
[1] http://spark.apache.org/docs/latest/java-programming-guide.html
Hello,
Is it possible to use a custom class as my spark's KryoSerializer running
under Mesos?
I've tried adding my jar containing the class to my spark context (via
SparkConf.addJars), but I always get:
java.lang.ClassNotFoundException: flambo.kryo.FlamboKryoSerializer
at
Hi Christophe,
Adding the jars to both SPARK_CLASSPATH and ADD_JARS is required. The
former makes them available to the spark-shell driver process, and the
latter tells Spark to make them available to the executor processes running
on the cluster.
-Sandy
On Wed, Apr 16, 2014 at 9:27 AM,
Hi Sung,
On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
The goal is to keep an intermediate value per row in memory, which would
allow faster subsequent computations. I.e., computeSomething would depend on
the previous value from the previous computation.
I
Hi, all
I’m writing a Spark application to load S3 data to HDFS,
the HDFS version is 2.3.0, so I have to compile Spark with Hadoop 2.3.0
after I execute
val allfiles = sc.textFile(s3n://abc/*.txt”)
val output = allfiles.saveAsTextFile(hdfs://x.x.x.x:9000/dataset”)
Spark throws exception:
I'm trying to understand when I would want to checkpoint an RDD rather than
just persist to disk.
Every reference I can find to checkpoint related to Spark Streaming. But
the method is defined in the core Spark library, not Streaming.
Does it exist solely for streaming, or are there
Checkpoint clears dependencies. You might need checkpoint to cut a
long lineage in iterative algorithms. -Xiangrui
On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll dcarr...@cloudera.com wrote:
I'm trying to understand when I would want to checkpoint an RDD rather than
just persist to disk.
Hi Joe,
On Mon, Apr 21, 2014 at 11:23 AM, Joe L selme...@yahoo.com wrote:
And, I haven't gotten any answers to my questions.
One thing that might explain that is that, at least for me, all (and I
mean *all*) of your messages are ending up in my GMail spam folder,
complaining that GMail can't
I would probably agree that it's typically not a good idea to add states to
distributed systems. Additionally, from a purist's perspective, this would
be a bit of hacking to the paradigm.
However, from a practical point of view, I think that it's a reasonable
trade-off between efficiency and
Yahoo made some changes that drive mailing list posts into spam
folders: http://www.virusbtn.com/blog/2014/04_15.xml
On Mon, Apr 21, 2014 at 2:50 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi Joe,
On Mon, Apr 21, 2014 at 11:23 AM, Joe L selme...@yahoo.com wrote:
And, I haven't gotten any
I'm trying to get my feet wet with Spark. I've done some simple stuff in the
shell in standalone mode, and now I'm trying to connect to HDFS resources, but
I'm running into a problem.
I synced to git's master branch (c399baa - SPARK-1456 Remove view bounds on
Ordered in favor of a context
When might that be necessary or useful? Presumably I can persist and
replicate my RDD to avoid re-computation, if that's my goal. What
advantage does checkpointing provide over disk persistence with
replication?
On Mon, Apr 21, 2014 at 2:42 PM, Xiangrui Meng men...@gmail.com wrote:
I'm seeing the same thing as Marcelo, Joe. All your mail is going to my
Spam folder. :(
With regards to your questions, I would suggest in general adding some more
technical detail to them. It will be difficult for people to give you
suggestions if all they are told is Spark is slow. How does
Why don't start by explaining what kind of operation you're running on
spark that's faster than hadoop mapred. Mybewe could start there. And yes
this mailing is very busy since many people are getting into Spark, it's
hard to answer to everyone.
On 21 Apr 2014 20:23, Joe L selme...@yahoo.com
I'm looking to start experimenting with Spark Streaming, and I'd like to
use Amazon Kinesis https://aws.amazon.com/kinesis/ as my data source.
Looking at the list of supported Spark Streaming
sourceshttp://spark.apache.org/docs/latest/streaming-programming-guide.html#linking,
I don't see any
would be good if you can contribute this as an example. BFS is a common
enough algo.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Sat, Apr 19, 2014 at 4:16 AM, Ghufran Malik gooksma...@gmail.com wrote:
Ahh nvm I found
There was a patch posted a few weeks ago
(https://github.com/apache/spark/pull/223), but it needs a few changes in
packaging because it uses a license that isn’t fully compatible with Apache.
I’d like to get this merged when the changes are made though — it would be a
good input source to
Are you by any chance starting two StreamingContexts in the same JVM? That
could explain a lot of the weird mixing of data that you are seeing. Its
not a supported usage scenario to start multiple streamingContexts
simultaneously in the same JVM.
TD
On Thu, Apr 17, 2014 at 10:58 PM, gaganbm
As long as the socket server sends data through the same connection, the
existing code is going to work. The socket.getInputStream returns a input
stream which will continuously allow you to pull data sent over the
connection. The bytesToObject function continuously reads data from the
input
I figured it out - I should be using textFile(...), not hadoopFile(...). And
my HDFS URL should include the host:
hdfs://host/user/kwilliams/corTable2/part-m-0
I haven't figured out how to let the hostname default to the host mentioned in
our /etc/hadoop/conf/hdfs-site.xml like the
Hi Ken,
On Mon, Apr 21, 2014 at 1:39 PM, Williams, Ken
ken.willi...@windlogics.com wrote:
I haven't figured out how to let the hostname default to the host mentioned
in our /etc/hadoop/conf/hdfs-site.xml like the Hadoop command-line tools do,
but that's not so important.
Try adding
Diana, that is a good question.
When you persist an RDD, the system still remembers the whole lineage of
parent RDDs that created that RDD. If one of the executor fails, and the
persist data is lost (both local disk and memory data will get lost), then
the lineage is used to recreate the RDD. The
it is possible Nick. Please take a look here:
https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923
the source code is here as a pull request:
https://github.com/apache/spark/pull/223
let me know if you have any questions.
On Mon, Apr 21, 2014 at 1:00 PM, Nicholas Chammas
sorry Matei. Will definitely start working on making the changes soon :)
On Mon, Apr 21, 2014 at 1:10 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
There was a patch posted a few weeks ago (
https://github.com/apache/spark/pull/223), but it needs a few changes in
packaging because it uses
-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com]
Hi Ken,
On Mon, Apr 21, 2014 at 1:39 PM, Williams, Ken
ken.willi...@windlogics.com wrote:
I haven't figured out how to let the hostname default to the host
mentioned in our /etc/hadoop/conf/hdfs-site.xml like the
Hi, I am trying to set up my own standalone Spark, and I started the master
node and worker nodes. Then I ran ./bin/spark-shell, and I get this message:
14/04/21 16:31:51 ERROR TaskSchedulerImpl: Lost an executor 1 (already
removed): remote Akka client disassociated
14/04/21 16:31:51 ERROR
I tried removing the CLUSTERED directive and get the same results :( I also
removed SORTED, same deal.
I'm going to try removign partitioning all together for now.
On Mon, Apr 21, 2014 at 4:58 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:
Clustering is not supported. Can you remove that
So after a little more investigation it turns out this issue happens
specifically when I interact with shark server. If I log in to the master
and start a shark session (./bin/shark), everything works as expected.
i'm starting shark server with the following upstart script, am I doing
something
Sounds great François.
On 21 Apr 2014 22:31, François Le Lay f...@spotify.com wrote:
Hi everyone,
This is a quick email to announce the creation of a Spark-NYC Meetup.
We have 2 upcoming events, one at PlaceIQ, another at Spotify where
Reynold Xin (Databricks) and Christopher Johnson
I ran into the same issue. The problem seems to be with the jets3t library
that Spark uses in project/SparkBuild.scala.
change this:
net.java.dev.jets3t % jets3t % 0.7.1
to
net.java.dev.jets3t % jets3t % 0.9.0
0.7.1 is not the right version of jets3t for Hadoop
Yes, I fixed in the same way, but didn’t get a change to get back to here
I also made a PR: https://github.com/apache/spark/pull/468
Best,
--
Nan Zhu
On Monday, April 21, 2014 at 8:19 PM, Parviz Deyhim wrote:
I ran into the same issue. The problem seems to be with the jets3t library
Feels like a silly questions,
But what if I wanted to apply a map to each element in a RDD, but instead of
replacing it, I wanted to add new columns of the manipulate value
I.e.
res0: Array[String] = Array(1 2, 1 3, 1 4, 2 1, 3 1, 4 1)
Becomes
res0: Array[String] = Array(1 2 2 4, 1 3 1 6,
g1 = pairs1.groupByKey().count()
pairs1 = pairs1.groupByKey(g1).cache()
g2 = triples.groupByKey().count()
pairs2 = pairs2.groupByKey(g2)
pairs = pairs2.join(pairs1)
Hi, I want to implement hash-partitioned joining as shown above. But
somehow, it is taking so long to perform. As I
i just call saveAsTextFile() twice. 'doc_topic_dist' is type of RDD[(Long,
Array[Int])],
each element is pair of (doc, topic_arr), for the same doc, they have
different of topic_arr in two files.
...
doc_topic_dist.coalesce(1, true).saveAsTextFile(save_path)
14/04/22 10:43:45 WARN scheduler.TaskSetManager: Loss was due to
java.util.NoSuchElementException
java.util.NoSuchElementException: End of stream
at org.apache.spark.util.NextIterator.next(NextIterator.scala:83)
at
it's ok when i call doc_topic_dist.cache() firstly.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
As long as the function that you are mapping over the RDD is pure,
preserving referential transparency so that anytime you map the same
function over the same initial RDD elements you get the same result
elements, then there is no problem in doing what you suggest. In fact,
it's common practice.
Yes. I am running this in a local mode and the SSCs run on the same JVM.
So, if I deploy this on a cluster, such behavior would be gone ? Also, is
there anyway I can start the SSCs on a local machine but on different JVMs?
I couldn't find anything about this in the documentation.
The
I wanted some clarification on the behavior of join streams.
As I believe, the join works per batch. I am reading data from two Kafka
streams and then joining them based on some keys. But what happens if one
stream hasn't produced any data in that batch duration, and the other has
some ? Or lets
Please check my comment on the shark-users
threadhttps://groups.google.com/forum/#!searchin/shark-users/Failure$20recovery$20in$20Shark$20when$20cluster$20/shark-users/vUUGLZANxr8/MMCtKhqjhLMJ
.
On Tue, Apr 22, 2014 at 8:06 AM, rama0120 lakshminaarayana...@gmail.comwrote:
Hi,
I couldn't find
46 matches
Mail list logo