That source repo is at https://github.com/palantir/spark/ with artifacts
published to Palantir's bintray at
https://palantir.bintray.com/releases/org/apache/spark/ If you're seeing
any of them in Maven Central please flag, as that's a mistake!
Andrew
On Tue, Jan 9, 2018 at 10:10 AM, Sean Owen
Rishabh,
Have you come across any ANSI SQL queries that Spark SQL didn't support?
I'd be interested to hear if you have.
Andrew
On Tue, Jan 17, 2017 at 8:14 PM, Deepak Sharma
wrote:
> From spark documentation page:
> Spark SQL can now run all 99 TPC-DS queries.
>
> On
You sometimes have to hard refresh to get the page to update.
On Wed, Jul 27, 2016 at 5:12 PM, Jim O'Flaherty
wrote:
> Nevermind, it literally just appeared right after I posted this.
>
>
>
> --
> View this message in context:
>
It's unlikely you need to increase the amount of memory on your master node
since it does simple bookkeeping. The majority of the memory pressure
across a cluster is on executor nodes.
See the conf/spark-env.sh file for configuring heap sizes, and this section
in the docs for more information on
Josh is that class something you guys would consider open sourcing, or
would you rather the community step up and create an OutputCommitter
implementation optimized for S3?
On Fri, Feb 20, 2015 at 4:02 PM, Josh Rosen rosenvi...@gmail.com wrote:
We (Databricks) use our own DirectOutputCommitter
People can continue using the stack exchange sites as is with no additional
work from the Spark team. I would not support migrating our mailing lists
yet again to another system like Discourse because I fear fragmentation of
the community between the many sites.
On Sat, Jan 17, 2015 at 6:24 AM,
To confirm, lihu, are you using Spark version 1.2.0 ?
On Tue, Jan 13, 2015 at 9:26 PM, lihu lihu...@gmail.com wrote:
Hi,
I just test groupByKey method on a 100GB data, the cluster is 20
machine, each with 125GB RAM.
At first I set conf.set(spark.shuffle.use.netty, false) and run
Hi Jem,
Linear time in scaling on the big table doesn't seem that surprising to
me. What were you expecting?
I assume you're doing normalRDD.join(indexedRDD). If you were to replace
the indexedRDD with a normal RDD, what times do you get?
On Tue, Jan 13, 2015 at 5:35 AM, Jem Tucker
Note also for short circuit reads that early versions are actually
net-negative in performance. Only after a second hadoop release of the
feature did it turn towards being a positive change. See earlier threads
on this mailing list where short circuit reads are discussed.
On Fri, Jan 9, 2015 at
That's a worker setting which cleans up the files left behind by executors,
so spark.cleaner.ttl isn't at the RDD level. After
https://issues.apache.org/jira/browse/SPARK-1860 the cleaner won't clean up
directories left by running executors.
On Fri, Jan 9, 2015 at 7:38 AM,
Hi Michael,
I think you need to explicitly call sc.stop() on the spark context for it
to close down properly (this doesn't happen automatically). See
https://issues.apache.org/jira/browse/SPARK-2972 for more details
Andrew
On Wed, Jan 7, 2015 at 3:38 AM, michael.engl...@nomura.com wrote:
You can also read about locality here in the docs:
http://spark.apache.org/docs/latest/tuning.html#data-locality
On Tue, Jan 6, 2015 at 8:37 AM, Cody Koeninger c...@koeninger.org wrote:
No, not all rdds have location information, and in any case tasks may be
scheduled on non-local nodes if
Hi Manoj,
I've noticed that the storage tab only shows RDDs that have been cached.
Did you call .cache() or .persist() on any of the RDDs?
Andrew
On Tue, Jan 6, 2015 at 6:48 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Hi,
I create a bunch of RDDs, including schema RDDs. When I run the
Hi Phil,
This sounds a lot like a deadlock in Hadoop's Configuration object that I
ran into a while back. If you jstack the JVM and see a thread that looks
like the below, it could be https://issues.apache.org/jira/browse/SPARK-2546
Executor task launch worker-6 daemon prio=10
Patrick is working on the release as we speak -- I expect it'll be out
later tonight (US west coast) or tomorrow at the latest.
On Fri, Dec 19, 2014 at 1:09 AM, Ted Yu yuzhih...@gmail.com wrote:
Interesting, the maven artifacts were dated Dec 10th.
However vote for RC2 closed recently:
Releases are roughly every 3mo so you should expect around March if the
pace stays steady.
2014-12-16 22:56 GMT-05:00 Marco Shaw marco.s...@gmail.com:
When it is ready.
On Dec 16, 2014, at 11:43 PM, 张建轶 zhangjia...@youku.com wrote:
Hi £¡
when will the spark 1.3.0 be released£¿
I
Gerard,
Are you familiar with spark.deploy.spreadOut
http://spark.apache.org/docs/latest/spark-standalone.html in Standalone
mode? It sounds like you want the same thing in Mesos mode.
On Thu, Dec 11, 2014 at 6:48 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Not that I am aware of.
Hi Steve,
You changed the first value in a Tuple2, which is the one that Spark uses
to hash and determine where in the cluster to place the value. By changing
the first part of the PairRDD, you've implicitly asked Spark to reshuffle
the data according to the new keys. I'd guess that you would
Hi Nathan,
It sounds like what you're asking for has already been filed as
https://issues.apache.org/jira/browse/SPARK-664 Does that ticket match
what you're proposing?
Andrew
On Fri, Nov 21, 2014 at 12:29 PM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
We've done this with reduce -
I can see this being valuable for users wanting to live on the cutting edge
without building CI infrastructure themselves, myself included. I think
Patrick's recent work on the build scripts for 1.2.0 will make delivering
nightly builds to a public maven repo easier.
On Tue, Nov 18, 2014 at
Deep,
toLocalIterator is a method on the RDD class. So try this instead:
rdd.toLocalIterator()
On Fri, Nov 14, 2014 at 12:21 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
val iter = toLocalIterator (rdd)
This is what I am doing and it says error: not found
On Fri, Nov 14, 2014 at
TJ, what was your expansion factor between image size on disk and in memory
in pyspark? I'd expect in memory to be larger due to Java object overhead,
but don't know the exact amounts you should expect.
On Fri, Nov 14, 2014 at 12:50 AM, TJ Klein tjkl...@gmail.com wrote:
Hi,
I am using
Jeremy,
Did you complete this benchmark in a way that's shareable with those
interested here?
Andrew
On Tue, Apr 15, 2014 at 2:50 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
I'd also be interested in seeing such a benchmark.
On Tue, Apr 15, 2014 at 9:25 AM, Ian Ferreira
Hi Anny, SPARK_WORKER_INSTANCES is the number of copies of spark workers
running on a single box. If you change the number you change how the
hardware you have is split up (useful for breaking large servers into 32GB
heaps each which perform better) but doesn't change the amount of hardware
you
that Spark
could recognize Phi as one worker and run workloads on it?
Thanks
On Oct 10, 2014, at 4:54 AM, Andrew Ash and...@andrewash.com wrote:
Hi Lang,
What special features of the Xeon Phil do you want Spark to take advantage
of?
On Thu, Oct 9, 2014 at 4:50 PM, Lang Yu lysubscr
I'm not sure what the design is, but I think the current behavior if the
driver can't reach the master is to attempt to connect once and fail if
that attempt fails. Is that what you're observing? (What version of Spark
also?)
On Fri, Oct 17, 2014 at 3:51 AM, preeze etan...@gmail.com wrote:
Hi
Hi Lang,
What special features of the Xeon Phil do you want Spark to take advantage
of?
On Thu, Oct 9, 2014 at 4:50 PM, Lang Yu lysubscr...@gmail.com wrote:
Hi,
I have set up Spark 1.0.2 on the cluster using standalone mode and the
input is managed by HDFS. One node of the cluster has Intel
You will need to restart your Mesos workers to pick up the new limits as
well.
On Tue, Oct 7, 2014 at 4:02 PM, Sunny Khatri sunny.k...@gmail.com wrote:
@SK:
Make sure ulimit has taken effect as Todd mentioned. You can verify via
ulimit -a. Also make sure you have proper kernel parameters set
Hi Meethu,
I believe you may be hitting a regression in
https://issues.apache.org/jira/browse/SPARK-3633
If you are able, could you please try running a patched version of Spark
1.1.0 that has commit 4fde28c reverted and see if the errors go away?
Posting your results on that bug would be
You're putting those into spark-env.sh? Try setting LD_LIBRARY_PATH as
well, that might help.
Also where is the exception coming from? You have to set this properly for
both the cluster and the driver, which are independently set.
Cheers!
Andrew
On Sun, Oct 5, 2014 at 1:06 PM, Tom
Hi Michael,
I couldn't find anything in Jira for it --
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22window%22%20AND%20component%20%3D%20Streaming
Could you or Adrian please file a Jira ticket explaining the functionality
and maybe a proposed API? This
Hi Mingyu,
Maybe we should be limiting our heaps to 32GB max and running multiple
workers per machine to avoid large GC issues.
For a 128GB memory, 32 core machine, this could look like:
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_MEMORY=32
SPARK_WORKER_CORES=8
Are people running with large (32GB+)
You may also be writing your algorithm in a way that it requires high peak
memory usage. An example of this could be using .groupByKey() where
.reduceByKey() might suffice instead. Maybe you can express the algorithm
in a different way that's more efficient?
On Thu, Oct 2, 2014 at 4:30 AM, Sean
Hi Gary,
I gave this a shot on a test cluster of CDH4.7 and actually saw a
regression in performance when running the numbers. Have you done any
benchmarking? Below are my numbers:
Experimental method:
1. Write 14GB of data to HDFS via [1]
2. Read data multiple times via [2]
*Experiment 1:
Hi Maddenpj,
Right now the best estimate I've heard for the open file limit is that
you'll need the square of the largest partition count in your dataset.
I filed a ticket to log the ulimit value when it's too low at
https://issues.apache.org/jira/browse/SPARK-3750
On Mon, Sep 29, 2014 at 6:20
verified it.
-Kay
-- Forwarded message --
From: Andrew Ash and...@andrewash.com
Date: Tue, Sep 30, 2014 at 1:33 PM
Subject: Re: Short Circuit Local Reads
To: Matei Zaharia matei.zaha...@gmail.com
Cc: user@spark.apache.org user@spark.apache.org, Gary Malouf
malouf.g
Hi Romi,
I've observed this many times as well. So much so that on some clusters I
restart the workers every night in order to maintain these worker - master
connections.
I couldn't find an open SPARK ticket on it so filed
https://issues.apache.org/jira/browse/SPARK-3736 with you and Piotr
26, 2014 at 10:35 AM, Andrew Ash and...@andrewash.com wrote:
Hi Alexey,
You should see in the logs a locality measure like NODE_LOCAL,
PROCESS_LOCAL, ANY, etc. If your Spark workers each have an HDFS data node
on them and you're reading out of HDFS, then you should be seeing almost
all
Hi SK,
For the problem with lots of shuffle files and the too many open files
exception there are a couple options:
1. The linux kernel has a limit on the number of open files at once. This
is set with ulimit -n, and can be set permanently in /etc/sysctl.conf or
/etc/sysctl.d/. Try increasing
Matt you should be able to set an HDFS path so you'll get logs written to a
unified place instead of to local disk on a random box on the cluster.
On Thu, Sep 25, 2014 at 1:38 PM, Matt Narrell matt.narr...@gmail.com
wrote:
How does this work with a cluster manager like YARN?
mn
On Sep 25,
Hi Vinay,
What I'm guessing is happening is that Spark is taking the locality of
files into account and you don't have node-local data on all your
machines. This might be the case if you're reading out of HDFS and your
600 files are somehow skewed to only be on about 200 of your 400 machines.
A
Hi Harsha,
I use LZOP files extensively on my Spark cluster -- see my writeup for how
to do this on this mailing list post:
http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCAOoZ679ehwvT1g8=qHd2n11Z4EXOBJkP+q=Aj0qE_=shhyl...@mail.gmail.com%3E
Maybe we should better document how
Hi Christy,
I'm more of a Gradle fan but I know SBT fits better into the Scala
ecosystem as a build tool. If you'd like to give Gradle a shot try this
skeleton Gradle+Spark repo from my coworker Punya.
https://github.com/punya/spark-gradle-test-example
Good luck!
Andrew
On Thu, Sep 25, 2014
Hi Alexey,
You should see in the logs a locality measure like NODE_LOCAL,
PROCESS_LOCAL, ANY, etc. If your Spark workers each have an HDFS data node
on them and you're reading out of HDFS, then you should be seeing almost
all NODE_LOCAL accesses. One cause I've seen for mismatches is if Spark
Hi Paul,
There are several ports you need to configure in order to run in a tight
network environment. It sounds like you the DMZ that contains the spark
cluster is wide open internally, but you have to poke holes between that
and the driver.
You should take a look at the port configuration
Hi Theodore,
What do you mean by module diagram? A high level architecture diagram of
how the classes are organized into packages?
Andrew
On Tue, Sep 23, 2014 at 12:46 AM, Theodore Si sjyz...@gmail.com wrote:
Hi,
Please help me with that.
BR,
Theodore Si
Also you'd rather have 2-3 tasks per core than 1 task per core because if
the 1 task per core is actually 1.01 tasks per core, then you have one wave
of tasks complete and another wave of tasks with very few tasks in them.
You get better utilization when you're higher than 1.
Aaron Davidson goes
the parquet library and as far as I know can be
safely ignored.
On Mon, Sep 22, 2014 at 3:27 AM, Andrew Ash and...@andrewash.com wrote:
Hi All,
I'm seeing the below WARNINGs in stdout using Spark SQL in Spark 1.1.0 --
is this warning a known issue? I don't see any open Jira tickets for it.
Sep 22
in Spark Streaming, and some MLlib
algorithms. If you can help with the guide, I think it would be a nice
feature to have!
Burak
- Original Message -
From: Andrew Ash and...@andrewash.com
To: Burak Yavuz bya...@stanford.edu
Cc: Макар Красноперов connector@gmail.com, user
user
Hi Nicolas,
I've had suspicions about speculation causing problems on my cluster but
don't have any hard evidence of it yet.
I'm also interested in why it's turned off by default.
On Tue, Sep 16, 2014 at 3:01 PM, Nicolas Mai nicolas@gmail.com wrote:
Hi, guys
My current project is using
Hi Harsha,
You could look through the GraphX source to see the approach taken there
for ideas in your own. I'd recommend starting at
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/Graph.scala#L385
to see the storage technique.
Why do you want to avoid
Hi Burak,
Most discussions of checkpointing in the docs is related to Spark
streaming. Are you talking about the sparkContext.setCheckpointDir()?
What effect does that have?
https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
On Wed, Sep 17, 2014 at 7:44 AM,
Thanks for the info!
Are there performance impacts with writing to HDFS instead of local disk?
I'm assuming that's why ALS checkpoints every third iteration instead of
every iteration.
Also I can imagine that checkpointing should be done every N shuffles
instead of every N operations (counting
nicholas.cham...@gmail.com wrote:
Andrew,
This email was pretty helpful. I feel like this stuff should be
summarized
in the docs somewhere, or perhaps in a blog post.
Do you know if it is?
Nick
On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote:
The locality
Hi Dhimant,
We also cleaned up these needless warnings on port failover in Spark 1.1 --
see https://issues.apache.org/jira/browse/SPARK-1902
Andrew
On Thu, Sep 4, 2014 at 7:38 AM, Dhimant dhimant84.jays...@gmail.com wrote:
Thanks Yana,
I am able to execute application and command via
Hi Grega,
Did you ever get this figured out? I'm observing the same issue in Spark
1.0.2.
For me it was after 1.5hr of a large .distinct call, followed by a
.saveAsTextFile()
14/08/26 20:57:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned
task 18500
14/08/26 20:57:43 INFO
Hi Patrick,
For the spilling within on key work you mention might land in Spark 1.2, is
that being tracked in https://issues.apache.org/jira/browse/SPARK-1823 or
is there another ticket I should be following?
Thanks!
Andrew
On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell pwend...@gmail.com
I'm actually not sure the Spark+Mesos integration supports dynamically
allocating memory (it does support dynamically allocating cores though).
Has anyone here actually used Spark+Mesos on heterogenous hardware and
done dynamic memory allocation?
My understanding is that each Spark executor
/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L114
- where Spark accepts sc.executorMemory of a resource offer, regardless of
how much more memory was available
On Thu, Aug 21, 2014 at 2:12 PM, Andrew Ash and...@andrewash.com
What happens when a run of numbers is spread across a partition boundary?
I think you might end up with two adjacent groups of the same value in
that situation.
On Mon, Aug 18, 2014 at 2:05 AM, Davies Liu dav...@databricks.com wrote:
import itertools
l = [1,1,1,2,2,3,4,4,5,1]
gs =
Hi Deb,
If you don't have long-running Spark applications (those taking more than
spark.worker.cleanup.appDataTtl) then the TTL-based cleaner is a good
solution. If however you have a mix of long-running and short-running
applications, then the TTL-based solution will fail. It will clean up
Hi Chen,
Please see the bug I filed at
https://issues.apache.org/jira/browse/SPARK-2984 with the
FileNotFoundException on _temporary directory issue.
Andrew
On Mon, Aug 11, 2014 at 10:50 PM, Andrew Ash and...@andrewash.com wrote:
Not sure which stalled HDFS client issue your'e referring
// assuming Spark 1.0
Hi Baoqiang,
In my experience for the standalone cluster you need to set
SPARK_WORKER_DIR not SPARK_LOCAL_DIRS to control where shuffle files are
written. I think this is a documentation issue that could be improved, as
I've also been seeing similar stacktraces on Spark core (not streaming) and
have a theory it's related to spark.speculation being turned on. Do you
have that enabled by chance?
On Mon, Aug 11, 2014 at 8:10 AM, Chen Song chen.song...@gmail.com wrote:
Bill
Did you get this resolved somehow?
:13 AM, Andrew Ash and...@andrewash.com wrote:
I've also been seeing similar stacktraces on Spark core (not streaming)
and have a theory it's related to spark.speculation being turned on. Do
you have that enabled by chance?
On Mon, Aug 11, 2014 at 8:10 AM, Chen Song chen.song...@gmail.com
Hi Jikai,
It looks like you're trying to run a Spark job on data that's stored in
HDFS in .lzo format. Spark can handle this (I do it all the time), but you
need to configure your Spark installation to know about the .lzo format.
There are two parts to the hadoop lzo library -- the first is the
Yes, I've done it before.
On Thu, Aug 7, 2014 at 10:18 PM, Gary Zhao garyz...@gmail.com wrote:
Hello
Is it possible to use spark-cassandra-connector in spark-shell?
Thanks
Gary
7, 2014 at 10:20 PM, Andrew Ash and...@andrewash.com wrote:
Yes, I've done it before.
On Thu, Aug 7, 2014 at 10:18 PM, Gary Zhao garyz...@gmail.com wrote:
Hello
Is it possible to use spark-cassandra-connector in spark-shell?
Thanks
Gary
Hi Oleg,
Did you ever figure this out? I'm observing the same exception also in
0.9.1 and think it might be related to setting spark.speculation=true. My
theory is that multiple attempts at the same task start, the first finishes
and cleans up the _temporary directory, and then the second fails
Hi Martin,
In standalone mode, each SparkContext you initialize gets its own set of
executors across the cluster. So for example if you have two shells open,
they'll each get two JVMs on each worker machine in the cluster.
As far as the other docs, you can configure the total number of cores
I'm not sure if you guys ever picked a preferred method for doing this, but
I just encountered it and came up with this method that's working
reasonably well on a small dataset. It should be quite easily
generalizable to non-String RDDs.
def addRowNumber(r: RDD[String]): RDD[Tuple2[Long,String]]
In general it would be nice to be able to configure replication on a
per-job basis. Is there a way to do that without changing the config
values in the Hadoop conf/ directory between jobs? Maybe by modifying
OutputFormats or the JobConf ?
On Mon, Jul 14, 2014 at 11:12 PM, Matei Zaharia
Hi Nan,
Great digging in -- that makes sense to me for when a job is producing some
output handled by Spark like a .count or .distinct or similar.
For the other part of the question, I'm also interested in side effects
like an HDFS disk write. If one task is writing to an HDFS path and
another
Ni Nick,
The cluster I was working on in those linked messages was a private data
center cluster, not on EC2. I'd imagine that the setup would be pretty
similar, but I'm not familiar with the EC2 init scripts that Spark uses.
Also I upgraded that cluster to 1.0 recently and am continuing to use
Hi Sameer,
If you set those two IDs to be a Tuple2 in the key of the RDD, then you can
join on that tuple.
Example:
val rdd1: RDD[Tuple3[Int, Int, String]] = ...
val rdd2: RDD[Tuple3[Int, Int, String]] = ...
val resultRDD = rdd1.map(k = ((k._1, k._2), k._3)).join(
rdd2.map(k =
Sounds good. Mingyu and I are waiting on 1.0.1 to get the fix for the
below issues without running a patched version of Spark:
https://issues.apache.org/jira/browse/SPARK-1935 -- commons-codec version
conflicts for client applications
https://issues.apache.org/jira/browse/SPARK-2043 --
What's the advantage of Apache maintaining the brew installer vs users?
Apache handling it means more work on this dev team, but probably a better
experience for brew users. Just wanted to weigh pros/cons before
committing to support this installation method.
Andrew
On Wed, Jun 18, 2014 at
Wait, so the file only has four lines and the job running out of heap
space? Can you share the code you're running that does the processing?
I'd guess that you're doing some intense processing on every line but just
writing parsed case classes back to disk sounds very lightweight.
I
On Wed,
Gerard,
Strings in particular are very inefficient because they're stored in a
two-byte format by the JVM. If you use the Kryo serializer and have use
StorageLevel.MEMORY_ONLY_SER then Kryo stores Strings in UTF8, which for
ASCII-like strings will take half the space.
Andrew
On Tue, Jun 17,
In Spark you can use the normal globs supported by Hadoop's FileSystem,
which are documented here:
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)
On Wed, Jun 18, 2014 at 12:09 AM, MEETHU MATHEW meethu2...@yahoo.co.in
wrote:
Andrew,
This is a standalone cluster. And, yes, if my understanding of Spark
terminology is correct, you are correct about the port ownerships.
Jacob
Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075
[image: Inactive hide details for Andrew Ash ---05/28
Hi Oleg,
I set the size of my executors on a standalone cluster when using the shell
like this:
./bin/spark-shell --master $MASTER --total-executor-cores
$CORES_ACROSS_CLUSTER --driver-java-options
-Dspark.executor.memory=$MEMORY_PER_EXECUTOR
It doesn't seem particularly clean, but it works.
-Dspark.executor.memory=$MEMORY_PER_EXECUTOR
I get
bad option: '--driver-java-options'
There must be something different in my setup. Any ideas?
Thank you again,
Oleg
On 5 June 2014 22:28, Andrew Ash and...@andrewash.com wrote:
Hi Oleg,
I set the size of my executors on a standalone cluster when
Hi Ajay,
Can you please try running the same code with spark.shuffle.spill=false and
see if the numbers turn out correctly? That parameter controls whether or
not the buggy code that Matei fixed in ExternalAppendOnlyMap is used.
FWIW I saw similar issues in 0.9.0 but no longer in 0.9.1 after I
Hi Roger,
You should be able to sort within partitions using the rdd.mapPartitions()
method, and that shouldn't require holding all data in memory at once. It
does require holding the entire partition in memory though. Do you need
the partition to never be held in memory all at once?
As far as
Just curious, what do you want your custom RDD to do that the normal ones
don't?
On Wed, Jun 4, 2014 at 6:30 AM, bluejoe2008 bluejoe2...@gmail.com wrote:
hi, folks,
is there any easier way to define a custom RDD in Java?
I am wondering if I have to define a new java class which
nilmish,
To confirm your code is using kryo, go to the web ui of your application
(defaults to :4040) and look at the environment tab. If your serializer
settings are there then things should be working properly.
I'm not sure how to confirm that it works against typos in the setting, but
you
You can change storage level on an individual RDD with
.persist(StorageLevel.MEMORY_AND_DISK), but I don't think you can change
what the default persistency level is for RDDs.
Andrew
On Wed, Jun 4, 2014 at 1:52 AM, Salih Kardan karda...@gmail.com wrote:
Hi
I'm using Spark 0.9.1 and Shark
When you group by IP address in step 1 to this:
(ip1,(lat1,lon1),(lat2,lon2))
(ip2,(lat3,lon3),(lat4,lat5))
How many lat/lon locations do you expect for each IP address? avg and max
are interesting.
Andrew
On Wed, Jun 4, 2014 at 5:29 AM, Oleg Proudnikov
Your applications are probably not connecting to your existing cluster and
instead running in local mode. Are you passing the master URL to the
SparkPi application?
Andrew
On Tue, Jun 3, 2014 at 12:30 AM, MrAsanjar . afsan...@gmail.com wrote:
- HI all,
- Application running and
current conclusion is that the best option would be to roll an own
saveHdfsFile(...)
Would you agree?
-greetz, Gerard.
[1]
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
On Mon, Jun 2, 2014 at 11:44 PM, Andrew Ash and...@andrewash.com wrote
Hi Mayur, is that closure cleaning a JVM issue or a Spark issue? I'm used
to thinking of closure cleaner as something Spark built. Do you have
somewhere I can read more about this?
On Tue, Jun 3, 2014 at 12:47 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
So are you using Java 7 or 8.
7
Also see this context from February. We started working with Chill to get
Avro records automatically registered with Kryo. I'm not sure the final
status, but from the Chill PR #172 it looks like this might be much less
friction than before.
Issue we filed:
Hi Carter,
In Spark 1.0 there will be an implementation of k-means available as part
of MLLib. You can see the documentation for that below (until 1.0 is fully
released).
https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/mllib-clustering.html
Maybe diving into the source here will help
.
Martin
Am 13.05.2014 08:48, schrieb Andrew Ash:
Are you setting a core limit with spark.cores.max? If you don't, in
coarse mode each Spark job uses all available cores on Mesos and doesn't
let them go until the job is terminated. At which point the other job can
access the cores.
https
Hi Randy,
In Spark 1.0 there was a lot of work done to allow unpersisting data that's
no longer needed. See the below pull request.
Try running kvGlobal.unpersist() on line 11 before the re-broadcast of the
next variable to see if you can cut the dependency there.
Hi Andrea,
What version of Spark are you using? There were some improvements in how
Spark uses Kryo in 0.9.1 and to-be 1.0 that I would expect to improve this.
Also, can you share your registrator's code?
Another possibility is that Kryo can have some difficulty serializing very
large objects.
it aligns!
Jacob
Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075
[image: Inactive hide details for Andrew Ash ---05/23/2014 10:30:58
AM---Hi everyone, I've also been interested in better understanding]Andrew
Ash ---05/23/2014 10:30:58 AM---Hi everyone, I've also
Hi everyone,
I've also been interested in better understanding what ports are used where
and the direction the network connections go. I've observed a running
cluster and read through code, and came up with the below documentation
addition.
https://github.com/apache/spark/pull/856
Scott and
Hi Jamal,
I don't believe there are pre-written algorithms for Cosine similarity or
Pearson Porrelation in PySpark that you can re-use. If you end up writing
your own implementation of the algorithm though, the project would
definitely appreciate if you shared that code back with the project for
1 - 100 of 148 matches
Mail list logo