Is this the serialization throughput per task or the serialization
throughput for all the tasks?
On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond raymond@intel.com wrote:
Hi
I am running a WordCount program which count words from HDFS, and I
noticed that the serializer part of code
This class was made to be java friendly so that we wouldn't have to
use two versions. The class itself is simple. But I agree adding java
setters would be nice.
On Tue, Apr 29, 2014 at 8:32 PM, Soren Macbeth so...@yieldbot.com wrote:
There is a JavaSparkContext, but no JavaSparkConf object. I
You are right, once you sort() the RDD, then yes it has a well defined ordering.
But that ordering is lost as soon as you transform the RDD, including
if you union it with another RDD.
On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim m...@palantir.com wrote:
Hi Patrick,
I¹m a little confused
the error first before the reader knows what is
going on.
Anyways maybe if you have a simpler solution you could sketch it out in the
JIRA and we could talk over there. The current proposal in the JIRA is
somewhat complicated...
- Patrick
On Mon, Apr 28, 2014 at 1:01 PM, Jim Blomo jim.bl
What about if you run ./bin/spark-shell
--driver-class-path=/path/to/your/jar.jar
I think either this or the --jars flag should work, but it's possible there
is a bug with the --jars flag when calling the Repl.
On Mon, Apr 28, 2014 at 4:30 PM, Roger Hoover roger.hoo...@gmail.comwrote:
A
You can also accomplish this by just having a separate service that submits
multiple jobs to a cluster where those jobs e.g. use different jars.
- Patrick
On Mon, Apr 28, 2014 at 4:44 PM, Andrew Ash and...@andrewash.com wrote:
For the second question, you can submit multiple jobs through
Try running sbt/sbt clean and re-compiling. Any luck?
On Thu, Apr 24, 2014 at 5:33 PM, martin.ou martin...@orchestrallinc.cnwrote:
occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3
1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly
2.found Exception:
I put some notes in this doc:
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
On Sun, Apr 20, 2014 at 8:58 PM, Arun Ramakrishnan
sinchronized.a...@gmail.com wrote:
I would like to run some of the tests selectively. I am in branch-1.0
Tried the following two
For a HadoopRDD, first the spark scheduler calculates the number of tasks
based on input splits. Usually people use this with HDFS data so in that
case it's based on HDFS blocks. If the HDFS datanodes are co-located with
the Spark cluster then it will try to run the tasks on the data node that
I've actually done it using PySpark and python libraries which call cuda code,
though I've never done it from scala directly. The only major challenge I've
hit is assigning tasks to gpus on multiple gpu machines.
Sent from my iPhone
On Apr 11, 2014, at 8:38 AM, Jaonary Rabarisoa
To reiterate what Tom was saying - the code that runs inside of Spark on
YARN is exactly the same code that runs in any deployment mode. There
shouldn't be any performance difference once your application starts
(assuming you are comparing apples-to-apples in terms of hardware).
The differences
:
Hey Patrick,
I've created SPARK-1458 https://issues.apache.org/jira/browse/SPARK-1458 to
track this request, in case the team/community wants to implement it in the
future.
Nick
On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
No use case at the moment
Pierre - I'm not sure that would work. I just opened a Spark shell and did
this:
scala classOf[SparkContext].getClass.getPackage.getImplementationVersion
res4: String = 1.7.0_25
It looks like this is the JVM version.
- Patrick
On Thu, Apr 10, 2014 at 2:08 PM, Pierre Borckmans
pierre.borckm
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller bmill...@eecs.berkeley.eduwrote:
I am running the latest version of PySpark branch-0.9 and having some
trouble with join.
One RDD is about 100G (25GB compressed and serialized in memory) with
130K records, the other RDD is about 10G (2.5G
and on jobs that crunch hundreds of
terabytes (uncompressed) of data.
- Patrick
On Fri, Apr 4, 2014 at 12:05 PM, Parviz Deyhim pdey...@gmail.com wrote:
Spark community,
What's the size of the largest Spark cluster ever deployed? I've heard
Yahoo is running Spark on several hundred nodes
in
the community has feedback from trying this.
- Patrick
On Fri, Apr 4, 2014 at 12:43 PM, Rahul Singhal rahul.sing...@guavus.comwrote:
Hi Christophe,
Thanks for your reply and the spec file. I have solved my issue for now.
I didn't want to rely building spark using the spec file (%build
(default-cli) on project spark-0.9.0-incubating: Error reading assemblies:
No assembly descriptors found. - [Help 1]
upon runnning
mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean assembly:assembly
On Apr 1, 2014, at 4:13 PM, Patrick Wendell pwend...@gmail.com wrote:
Do you get the same
For textFile I believe we overload it and let you set a codec directly:
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59
For saveAsSequenceFile yep, I think Mark is right, you need an option.
On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra
The driver stores the meta-data associated with the partition, but the
re-computation will occur on an executor. So if several partitions are
lost, e.g. due to a few machines failing, the re-computation can be striped
across the cluster making it fast.
On Wed, Apr 2, 2014 at 11:27 AM, David
of functionality and something we might, e.g.
want to change the API of over time.
- Patrick
On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.comwrote:
What I'd like is a way to capture the information provided on the stages
page (i.e. cluster:4040/stages via IndexPage). Looking
Do you get the same problem if you build with maven?
On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey vipan...@gmail.com wrote:
SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly
That's all I do.
On Apr 1, 2014, at 11:41 AM, Patrick Wendell pwend...@gmail.com wrote:
Vidal - could you show
Also in NYC, definitely interested in a spark meetup!
Sent from my iPhone
On Mar 31, 2014, at 3:07 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Happy to help with an NYC meet up (just emailed Andy). I recently moved to
VA, but am back in NYC quite often, and have been turning several
dependencies including the exact Spark version and other libraries.
- Patrick
On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey vipan...@gmail.com wrote:
I'm using ScalaBuff (which depends on protobuf2.5) and facing the same
issue. any word on this one?
On Mar 27, 2014, at 6:41 PM, Kanwaldeep kanwal
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark
applications can persist their state so that the UI can be reloaded after
they have completed.
- Patrick
On Sun, Mar 30, 2014 at 10:30 AM, David Thomas dt5434...@gmail.com wrote:
Is there a way to see 'Application
to
the respective cassandra columns. I think all of this would be fairly easy
to implement on SchemaRDD and likely will make it into Spark 1.1
- Patrick
On Wed, Mar 26, 2014 at 10:59 PM, Rohit Rai ro...@tuplejump.com wrote:
Great work guys! Have been looking forward to this . . .
In the blog it mentions
I'm not sure exactly how your cluster is configured. But as far as I can
tell Cloudera's MR1 CDH5 dependencies are against Hadoop 2.3. I'd just find
the exact CDH version you have and link against the `mr1` version of their
published dependencies in that version.
So I think you wan't
Starting with Spark 0.9 the protobuf dependency we use is shaded and
cannot interfere with other protobuf libaries including those in
Hadoop. Not sure what's going on in this case. Would someone who is
having this problem post exactly how they are building spark?
- Patrick
On Fri, Mar 21, 2014
Ah we should just add this directly in pyspark - it's as simple as the
code Shivaram just wrote.
- Patrick
On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman
shivaram.venkatara...@gmail.com wrote:
There is no direct way to get this in pyspark, but you can get it from the
underlying java
... but that's not quite
released yet :)
- Patrick
On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers ko...@tresata.com wrote:
i currently typically do something like this:
scala val rdd = sc.parallelize(1 to 10)
scala import com.twitter.algebird.Operators._
scala import com.twitter.algebird.{Max, Min
if you do a highly selective filter on an
RDD. For instance, you filter out one day of data from a dataset of a
year.
- Patrick
On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra m...@clearstorydata.com wrote:
It's much simpler: rdd.partitions.size
On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a No space left on device error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.
On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
og...@nengoiksvelzud.com wrote:
This is not released yet but we're planning to cut a 0.9.1 release
very soon (e.g. most likely this week). In the mean time you'll have
checkout branch-0.9 of Spark and publish it locally then depend on the
snapshot version. Or just wait it out...
On Fri, Mar 14, 2014 at 2:01 PM, Adrian Mocanu
itself and override getPreferredLocations.
Keep in mind this is tricky because the set of executors might change
during the lifetime of a Spark job.
- Patrick
On Thu, Mar 13, 2014 at 11:50 AM, David Thomas dt5434...@gmail.com wrote:
Is it possible to parition the RDD elements in a round robin
Hey Sen,
Suarav is right, and I think all of your print statements are inside of the
driver program rather than inside of a closure. How are you running your
program (i.e. what do you run that starts this job)? Where you run the
driver you should expect to see the output.
- Patrick
On Mon, Mar
change so it won't help the ulimit problem.
This means you'll have to use fewer reducers (e.g. pass reduceByKey a
number of reducers) or use fewer cores on each machine.
- Patrick
On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah
matthew.c.ch...@gmail.com wrote:
Hi everyone,
My team (cc'ed
on the workers machines. If you see stderr but not stdout
that's a bit of a puzzler since they both go through the same
mechanism.
- Patrick
On Sun, Mar 9, 2014 at 2:32 PM, Sen, Ranjan [USA] sen_ran...@bah.com wrote:
Hi
I have some System.out.println in my Java code that is working ok in a local
environment
The difference between your two jobs is that take() is optimized and
only runs on the machine where you are using the shell, whereas
sortByKey requires using many machines. It seems like maybe python
didn't get upgraded correctly on one of the slaves. I would look in
the /root/spark/work/ folder
- Patrick
On Wed, Mar 5, 2014 at 1:52 PM, Sergey Parhomenko sparhome...@gmail.com wrote:
Hi Patrick,
Thanks for the patch. I tried building a patched version of
spark-core_2.10-0.9.0-incubating.jar but the Maven build fails:
[ERROR]
/home/das/Work/thx/incubator-spark/core/src/main/scala/org
301 - 338 of 338 matches
Mail list logo