think something is not linked to something properly (not a
Java expert unfortunately).
Thanks!
Ognen
On 7/13/14, 10:35 AM, Nicholas Chammas wrote:
If you’re still seeing gibberish, it’s because Spark is not using the
LZO libraries properly. In your case, I believe you should be calling
On Sun, Jul 13, 2014 at 9:49 PM, Ron Gonzalez zlgonza...@yahoo.com wrote:
I can easily fix this by changing this to YarnConfiguration instead of
MRJobConfig but was wondering what the steps are for submitting a fix.
Relevant links:
-
For example, are LIKE 'string%' queries supported? Trying one on 1.0.1
yields java.lang.ExceptionInInitializerError.
Nick
On Sat, Jul 12, 2014 at 10:16 PM, Nick Chammas nicholas.cham...@gmail.com
wrote:
Is there a place where we can find an up-to-date list of supported SQL
syntax in Spark
Actually, this looks like its some kind of regression in 1.0.1, perhaps
related to assembly and packaging with spark-ec2. I don’t see this issue
with the same data on a 1.0.0 EC2 cluster.
How can I trace this down for a bug report?
Nick
On Sun, Jul 13, 2014 at 11:18 PM, Nicholas Chammas
Are you sure the code running on the cluster has been updated?
I launched the cluster using spark-ec2 from the 1.0.1 release, so I’m
assuming that’s taken care of, at least in theory.
I just spun down the clusters I had up, but I will revisit this tomorrow
and provide the information you
, Jul 12, 2014 at 3:21 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
To add a potentially relevant piece of information, around when I stop
the StreamingContext, I get the following warning:
14/07/12 22:16:18 WARN ReceiverTracker: All of the receivers have not
deregistered, Map(0
Okie doke. Thanks for the confirmation, Burak and Tathagata.
On Thu, Jul 10, 2014 at 2:23 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
I confirm that is indeed the case. It is designed to be so because it
keeps things simpler - less chances of issues related to cleanup when
stop()
In short, Spark SQL is the future, built from the ground up. Shark was
built as a drop-in replacement for Hive, will be retired, and will perhaps
be replaced by a future initiative to run Hive on Spark
https://issues.apache.org/jira/browse/HIVE-7292.
More info:
-
Awww ye. That worked! Thank you Sameer.
Is this documented somewhere? I feel there there's a slight doc deficiency
here.
Nick
On Wed, Jul 9, 2014 at 2:50 PM, Sameer Tilak ssti...@live.com wrote:
Hi Nicholas,
I am using Spark 1.0 and I use this method to specify the additional jars.
changes in
its API for us.
Nick
On Wed, Jul 9, 2014 at 3:34 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Awww ye. That worked! Thank you Sameer.
Is this documented somewhere? I feel there there's a slight doc deficiency
here.
Nick
On Wed, Jul 9, 2014 at 2:50 PM, Sameer Tilak
I found it quite painful to figure out all the steps required and have
filed SPARK-2394 https://issues.apache.org/jira/browse/SPARK-2394 to
track improving this. Perhaps I have been going about it the wrong way, but
it seems way more painful than it should be to set up a Spark cluster built
using
, Nicholas Chammas wrote:
On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh
gurvinder.si...@uninett.no mailto:gurvinder.si...@uninett.no wrote:
csv =
sc.newAPIHadoopFile(opts.input,com.hadoop.mapreduce.LzoTextInputFormat,org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text).count
On Sun, Jul 6, 2014 at 10:10 AM, Robert James srobertja...@gmail.com
wrote:
If I've created a Spark EC2 cluster, how can I add or take away workers?
There is a manual process by which this is possible, but I’m not sure of
the procedure. There is also SPARK-2008
On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh gurvinder.si...@uninett.no
wrote:
csv =
sc.newAPIHadoopFile(opts.input,com.hadoop
.mapreduce.LzoTextInputFormat,org.apache.hadoop
.io.LongWritable,org.apache.hadoop.io.Text).count()
Does anyone know what the rough equivalent of this would be in
On Thu, Jun 26, 2014 at 2:26 PM, Michael Bach Bui free...@adatao.com
wrote:
The overhead of bringing up a AWS Spark spot instances is NOT the
inherent problem of Spark.
That’s technically true, but I’d be surprised if there wasn’t a lot of
room for improvement in spark-ec2 regarding cluster
not
sure.
On Thu, Jun 26, 2014 at 4:48 PM, Aureliano Buendia buendia...@gmail.com
wrote:
On Thu, Jun 26, 2014 at 9:42 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
That’s technically true, but I’d be surprised if there wasn’t a lot of
room for improvement in spark-ec2 regarding
of 2171 S3 files, with an average size of about 18MB.
On Tue, Jun 24, 2014 at 1:13 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
What do you get for rdd1._jrdd.splits().size()? You might think you’re
getting 100 partitions, but it may not be happening.
On Tue, Jun 24, 2014 at 3
The main thing that will affect the concurrency of any saveAs...()
operations is a) the number of partitions of your RDD, and b) how many
cores your cluster has.
How big is the RDD in question? How many partitions does it have?
On Thu, Jun 19, 2014 at 3:38 PM, Sandeep Parikh
Is that month= syntax something special, or do your files actually have
that string as part of their name?
On Wed, Jun 18, 2014 at 2:25 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi all,
Thanks for the reply. I'm using parquetFile as input, is that a problem?
In hadoop fs -ls, the
the partitions. It's part of the url
of my files.
Jianshi
On Wed, Jun 18, 2014 at 11:52 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Is that month= syntax something special, or do your files actually have
that string as part of their name?
On Wed, Jun 18, 2014 at 2:25 AM
Agreed, it would be better if Apache controlled or managed this directly.
I think making such a change is just a matter of opening a new issue
https://github.com/Homebrew/homebrew/issues/new on the Homebrew issue
tracker. I believe that's how Spark made it in there in the first place--it
was just
Matei,
You might want to comment on that issue Sherl linked to, or perhaps this one
https://github.com/Homebrew/homebrew/issues/30228, to ask about how
Apache can manage this going forward. I know that mikemcquaid
https://github.com/mikemcquaid is very active on the Homebrew repo and is
one of
Ah, this looks like exactly what I need! It looks like this was recently added
into PySpark https://github.com/apache/spark/pull/705/files#diff-6 (and
Spark Core), but it's not in the 1.0.0 release.
Thank you.
Nick
On Wed, Jun 18, 2014 at 7:42 PM, Doris Xin doris.s@gmail.com wrote:
Hi
That’s pretty neat! So I guess if you start with an RDD of objects, you’d
first do something like RDD.map(lambda x: Record(x['field_1'],
x['field_2'], ...)) in order to register it as a table, and from there run
your aggregates. Very nice.
On Wed, Jun 18, 2014 at 7:56 PM, Evan R. Sparks
:
If your input data is JSON, you can also try out the recently merged
in initial JSON support:
https://github.com/apache/spark/commit/d2f4f30b12f99358953e2781957468e2cfe3c916
On Wed, Jun 18, 2014 at 5:27 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
That’s pretty neat! So I guess
This appears to be missing from PySpark.
Reported in SPARK-2141 https://issues.apache.org/jira/browse/SPARK-2141.
On Fri, Jun 13, 2014 at 10:43 AM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
val myRdds = sc.getPersistentRDDs
assert(myRdds.size === 1)
It'll return a map. Its
Yeah, unfortunately PySpark still lags behind the Scala API a bit, but it's
being patched up at a good pace.
On Fri, Jun 13, 2014 at 1:43 PM, mrm ma...@skimlinks.com wrote:
Hi Nick,
Thank you for the reply, I forgot to mention I was using pyspark in my
first
message.
Maria
--
View
On Fri, Jun 13, 2014 at 1:55 PM, Albert Chu ch...@llnl.gov wrote:
1) How is this data process-local? I *just* copied it into HDFS. No
spark worker or executor should have loaded it.
Yeah, I thought that PROCESS_LOCAL meant the data was already in the JVM on
the worker node, but I do see the
Yeah, we badly need new AMIs that include at a minimum package/security
updates and Python 2.7. There is an open issue to track the 2.7 AMI update
https://issues.apache.org/jira/browse/SPARK-922, at least.
On Thu, Jun 12, 2014 at 3:34 PM, unorthodox.engine...@gmail.com wrote:
Creating AMIs
FYI: Here is a related discussion
http://apache-spark-user-list.1001560.n3.nabble.com/Persist-and-unpersist-td6437.html
about this.
On Thu, Jun 12, 2014 at 8:10 PM, innowireless TaeYun Kim
taeyun@innowireless.co.kr wrote:
Maybe It would be nice that unpersist() ‘triggers’ the computations
Yes, I mean the RDD would just have elements to define partitions or
ranges within the search space, not have actual hashes. It's really just a
using the RDD as a control structure, rather than a real data set.
As you noted, we don't need to store any hashes. We just need to check them
as they
In PySpark you can also do help(my_rdd) and get a nice help page of methods
available.
2014년 6월 8일 일요일, Cartergyz...@hotmail.com님이 작성한 메시지:
Thank you very much Gerard.
--
View this message in context:
To add another note on the benefits of using Scala to build Spark, here is
a very interesting and well-written post
http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
on
the Databricks blog about how Scala 2.10's runtime reflection enables
I think by default a thread can die up to 4 times before Spark considers it
a failure. Are you seeing that happen? I believe that is a configurable
thing, but don't know off the top of my head how to change it.
I've seen this error before when reading data from a large amount of files
on S3, and
On Wed, Jun 4, 2014 at 9:35 AM, Jeremy Lee unorthodox.engine...@gmail.com
wrote:
Oh, I went back to m1.large while those issues get sorted out.
Random side note, Amazon is deprecating the m1 instances in favor of m3
instances, which have SSDs and more ECUs than their m1 counterparts.
On Tue, Jun 3, 2014 at 6:52 AM, sirisha_devineni
sirisha_devin...@persistent.co.in wrote:
Did you open a JIRA ticket for this feature to be implemented in spark-ec2?
If so can you please point me to the ticket?
Just created it: https://issues.apache.org/jira/browse/SPARK-2008
Nick
with the _success folder.
In any case this change of behavior should be documented IMO.
Cheers
Pierre
Message sent from a mobile device - excuse typos and abbreviations
Le 2 juin 2014 à 17:42, Nicholas Chammas nicholas.cham...@gmail.com a
écrit :
What I've found using saveAsTextFile
OK, thanks for confirming. Is there something we can do about that leftover
part- files problem in Spark, or is that for the Hadoop team?
2014년 6월 2일 월요일, Aaron Davidsonilike...@gmail.com님이 작성한 메시지:
Yes.
On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote
method
if you like), but I can see it both ways. Caller beware.
On Mon, Jun 2, 2014 at 10:08 PM, Nicholas Chammas
nicholas.cham...@gmail.com javascript:; wrote:
OK, thanks for confirming. Is there something we can do about that
leftover
part- files problem in Spark, or is that for the Hadoop
On Mon, Jun 2, 2014 at 10:39 PM, Patrick Wendell pwend...@gmail.com wrote:
(B) Semantics in Spark 1.0 and earlier:
Do you mean 1.0 and later?
Option (B) with the exception-on-clobber sounds fine to me, btw. My use
pattern is probably common but not universal, and deleting user files is
indeed
Could you post how exactly you are invoking spark-ec2? And are you having
trouble just with r3 instances, or with any instance type?
2014년 6월 1일 일요일, Jeremy Leeunorthodox.engine...@gmail.com님이 작성한 메시지:
It's been another day of spinning up dead clusters...
I thought I'd finally worked out what
If you are explicitly specifying the AMI in your invocation of spark-ec2,
may I suggest simply removing any explicit mention of AMI from your
invocation? spark-ec2 automatically selects an appropriate AMI based on the
specified instance type.
2014년 6월 1일 일요일, Nicholas
PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
If you are explicitly specifying the AMI in your invocation of
spark-ec2,
may I suggest simply removing any explicit mention of AMI from your
invocation? spark-ec2 automatically selects an appropriate AMI based on
the
specified
No, you don't have to set up your own AMI. Actually it's probably simpler
and less error prone if you let spark-ec2 manage that for you as you first
start to get comfortable with Spark. Just spin up a cluster without any
explicit mention of AMI and it will do the right thing.
2014년 6월 1일 일요일,
(,)
sc.textFile(fileStr)...
- Patrick
On Fri, May 30, 2014 at 4:20 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
YES, your hunches were correct. I’ve identified at least one file among
the hundreds I’m processing that is indeed not a valid gzip file.
Does anyone know of an easy way
You guys were up late, eh? :) I'm looking forward to using this latest
version.
Is there any place we can get a list of the new functions in the Python
API? The release notes don't enumerate them.
Nick
On Fri, May 30, 2014 at 10:15 AM, Ian Ferreira ianferre...@hotmail.com
wrote:
Congrats
Daniel,
Is SPARK-1103 https://issues.apache.org/jira/browse/SPARK-1103 related to
your example? Automatic unpersist()-ing of unreferenced RDDs would be nice.
Nick
On Tue, May 27, 2014 at 12:28 PM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
I keep bumping into a problem with
in error, please inform the sender
immediately.
If you are not the intended recipient you must not use, disclose, copy,
print, distribute or rely on this email.
On 22 May 2014 04:43, Nicholas Chammas nicholas.cham...@gmail.com wrote:
That's a good idea. So you're saying create a SchemaRDD
Looking forward to that update!
Given a table of JSON objects like this one:
{
name: Nick,
location: {
x: 241.6,
y: -22.5
},
likes: [ice cream, dogs, Vanilla Ice]}
It would be SUPER COOL if we could query that table in a way that is as
natural as follows:
SELECT
, May 22, 2014 at 5:32 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Looking forward to that update!
Given a table of JSON objects like this one:
{
name: Nick,
location: {
x: 241.6,
y: -22.5
},
likes: [ice cream, dogs, Vanilla Ice
Thanks for the suggestions, people. I will try to hone in on which specific
gzipped files, if any, are actually corrupt.
Michael,
I’m using Hadoop 1.0.4, which I believe is the default version that gets
deployed by spark-ec2. The JIRA issue I linked to earlier,
Any tips on how to troubleshoot this?
On Thu, May 15, 2014 at 4:15 PM, Nick Chammas nicholas.cham...@gmail.comwrote:
I’m trying to do a simple count() on a large number of GZipped files in
S3. My job is failing with the following message:
14/05/15 19:12:37 WARN scheduler.TaskSetManager:
Yes, it does work with fewer GZipped files. I am reading the files in using
sc.textFile() and a pattern string.
For example:
a = sc.textFile('s3n://bucket/2014-??-??/*.gz')
a.count()
Nick
On Tue, May 20, 2014 at 10:09 PM, Madhu ma...@madhu.com wrote:
I have read gzip files from S3
Where's your driver code (the code interacting with the RDDs)? Are you
getting serialization errors?
2014년 5월 17일 토요일, Samarth Mailinglistmailinglistsama...@gmail.com님이 작성한
메시지:
Hi all,
I am trying to store the results of a reduce into mongo.
I want to share the variable collection in the
Thanks for the info about adding/removing nodes dynamically. That's
valuable.
2014년 5월 16일 금요일, Akhil Dasak...@sigmoidanalytics.com님이 작성한 메시지:
Hi Han :)
1. Is there a way to automatically re-spawn spark workers? We've
situations where executor OOM causes worker process to be DEAD and it does
On Wed, May 7, 2014 at 4:44 PM, Aaron Davidson ilike...@gmail.com wrote:
Spark can only run as many tasks as there are partitions, so if you don't
have enough partitions, your cluster will be underutilized.
This is a very important point.
kamatsuoka, how many partitions does your RDD have
Would cache() + count() every N iterations work just as well as
checkPoint() + count() to get around this issue?
We're basically trying to get Spark to avoid working on too lengthy a
lineage at once, right?
Nick
On Tue, May 13, 2014 at 12:04 PM, Xiangrui Meng men...@gmail.com wrote:
After
On Wed, May 7, 2014 at 4:00 AM, Han JU ju.han.fe...@gmail.com wrote:
But in my experience, when reading directly from s3n, spark create only 1
input partition per file, regardless of the file size. This may lead to
some performance problem if you have big files.
You can (and perhaps should)
, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
I think you're looking for
RDD.foreach()http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#foreach
.
According to the programming
guidehttp://spark.apache.org/docs/latest/scala-programming-guide.html
:
Run
On Tue, May 6, 2014 at 10:07 PM, kamatsuoka ken...@gmail.com wrote:
I was using s3n:// but I got frustrated by how
slow it is at writing files.
I'm curious: How slow is slow? How long does it take you, for example, to
save a 1GB file to S3 using s3n vs s3?
Amazon also strongly discourages the use of s3:// because the block file
system it maps to is deprecated.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html
Note
The configuration of Hadoop running on Amazon EMR differs from the default
configuration
I think you're looking for
RDD.foreach()http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#foreach
.
According to the programming
guidehttp://spark.apache.org/docs/latest/scala-programming-guide.html
:
Run a function func on each element of the dataset. This is usually
Yes, persist/cache will cache an RDD only when an action is applied to it.
On Sun, May 4, 2014 at 6:32 AM, Earthson earthson...@gmail.com wrote:
thx for the help, unpersist is excatly what I want:)
I see that spark will remove some cache automatically when memory is full,
it is much more
is an extension of the familiar hadoop distcp, of course.
On Thu, May 1, 2014 at 11:41 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
The fastest way to save to S3 should be to leave the RDD with many
partitions, because all partitions will be written out in parallel.
Then, once
There are many freely-available resources for the enterprising individual
to use if they want to Spark up their life.
For others, some structured training is in order. Say I want everyone from
my department at my company to get something like the AMP
Camphttp://ampcamp.berkeley.edu/experience,
12:15 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Yes, saveAsTextFile() will give you 1 part per RDD partition. When you
coalesce(1), you move everything in the RDD to a single partition, which
then gives you 1 output file.
It will still be called part-0 or something like
Yes, saveAsTextFile() will give you 1 part per RDD partition. When you
coalesce(1), you move everything in the RDD to a single partition, which
then gives you 1 output file.
It will still be called part-0 or something like that because that’s
defined by the Hadoop API that Spark uses for
It would be useful to have some way to open multiple files at once into a
single RDD (e.g. sc.textFile(iterable_over_uris)). Logically, it would be
equivalent to opening a single file which is made by concatenating the
various files together. This would only be useful, of course, if the source
not obvious.
Nick
2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com님이 작성한 메시지:
Perfect.
BTW just so I know where to look next time, was that in some docs?
On Apr 28, 2014, at 7:04 PM, Nicholas Chammas
nicholas.cham...@gmail.comjavascript:_e(%7B%7D,'cvml','nicholas.cham...@gmail.com');
wrote
to more traditional methods though.
The only worry I have is that the Phoenix input format doesn't adequately
split the data across multiple nodes, so that's something I will need to
look at further.
Josh
On Apr 25, 2014, at 6:33 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Josh
On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Just took a quick look at the overview
herehttp://phoenix.incubator.apache.org/ and
the quick start guide
herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html
.
It looks like Apache
How long are the count() steps taking? And how many partitions are pairs1and
triples initially divided into? You can see this by doing
pairs1._jrdd.splits().size(), for example.
If you just need to count the number of distinct keys, is it faster if you
did the following instead of
I'm seeing the same thing as Marcelo, Joe. All your mail is going to my
Spam folder. :(
With regards to your questions, I would suggest in general adding some more
technical detail to them. It will be difficult for people to give you
suggestions if all they are told is Spark is slow. How does
I'm looking to start experimenting with Spark Streaming, and I'd like to
use Amazon Kinesis https://aws.amazon.com/kinesis/ as my data source.
Looking at the list of supported Spark Streaming
sourceshttp://spark.apache.org/docs/latest/streaming-programming-guide.html#linking,
I don't see any
Never mind. I'll take it from both Andrew and Syed's comments that the
answer is yes. Dunno why I thought otherwise.
On Wed, Apr 16, 2014 at 5:43 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
I’m running into a similar issue as the OP. I’m running the same job over
and over
From the Spark tuning guidehttp://spark.apache.org/docs/latest/tuning.html
:
In general, we recommend 2-3 tasks per CPU core in your cluster.
I think you can only get one task per partition to run concurrently for a
given RDD. So if your RDD has 10 partitions, then 10 tasks at most can
operate
Looking at the Python version of
textFile()http://spark.apache.org/docs/latest/api/pyspark/pyspark.context-pysrc.html#SparkContext.textFile,
shouldn't it be *max*(self.defaultParallelism, 2)?
If the default parallelism is, say 4, wouldn't we want to use that for
minSplits instead of 2?
On Tue,
doesn't seem to pick them
up.
Nick
On Fri, Mar 7, 2014 at 4:10 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Mayur,
So looking at the section on environment variables
herehttp://spark.incubator.apache.org/docs/latest/configuration.html#environment-variables,
are you saying to set
, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Hey Patrick,
I've created SPARK-1458https://issues.apache.org/jira/browse/SPARK-1458 to
track this request, in case the team/community wants to implement it in the
future.
Nick
On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas
Marco,
If you call spark-ec2 launch without specifying an AMI, it will default to
the Spark-provided AMI.
Nick
On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini
silvio.costant...@granatads.com wrote:
Hi there,
To answer your question; no there is no reason NOT to use an AMI that
Spark has
And for the record, that AMI is ami-35b1885c. Again, you don't need to
specify it explicitly; spark-ec2 will default to it.
On Wed, Apr 9, 2014 at 11:08 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Marco,
If you call spark-ec2 launch without specifying an AMI, it will default
Hey Patrick,
I've created SPARK-1458 https://issues.apache.org/jira/browse/SPARK-1458 to
track this request, in case the team/community wants to implement it in the
future.
Nick
On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
No use case at the moment
A very nice addition for us PySpark users in 0.9.1 is the addition of
RDD.repartition(), which is not mentioned in the release
noteshttp://spark.apache.org/releases/spark-release-0-9-1.html
!
This is super helpful for when you create an RDD from a gzipped file and
then need to explicitly shuffle
at 3:58 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
A very nice addition for us PySpark users in 0.9.1 is the addition of
RDD.repartition(), which is not mentioned in the release
noteshttp://spark.apache.org/releases/spark-release-0-9-1.html
!
This is super helpful for when you
Just took a quick look at the overview
herehttp://phoenix.incubator.apache.org/ and
the quick start guide
herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html
.
It looks like Apache Phoenix aims to provide flexible SQL access to data,
both for transactional and analytic
, logically, but, that's not to say that the
machine it's on shouldn't do work.
--
Sean Owen | Director, Data Science | London
On Tue, Apr 8, 2014 at 8:24 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
So I have a cluster in EC2 doing some work, and when I take a look here
http
If you're running on one machine with 2 cores, I believe all you can get
out of it are 2 concurrent tasks at any one time. So setting your default
parallelism to 20 won't help.
On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia
e.costaalf...@unibs.it wrote:
Hi all,
I have put this line
. Setting it to a higher number won't do anything helpful.
On Fri, Apr 4, 2014 at 2:47 PM, Eduardo Costa Alfaia e.costaalf...@unibs.it
wrote:
What do you advice me Nicholas?
Em 4/4/14, 19:05, Nicholas Chammas escreveu:
If you're running on one machine with 2 cores, I believe all you can
)
.partitionBy(numPartitions)
.map(lambda (counter, data): data))
If there's supposed to be a built-in Spark method to do this, I'd love to
learn more about it.
Nick
On Tue, Apr 1, 2014 at 7:59 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Hmm, doing help(rdd
Is this a
Scala-onlyhttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFilefeature?
On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell pwend...@gmail.com wrote:
For textFile I believe we overload it and let you set a codec directly:
, Apr 2, 2014 at 3:00 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Is this a
Scala-onlyhttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFilefeature?
On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell pwend...@gmail.comwrote:
For textFile I
/pyspark/rdd.py#L1128
On Wed, Apr 2, 2014 at 2:44 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Update: I'm now using this ghetto function to partition the RDD I get
back when I call textFile() on a gzipped file:
# Python 2.6
def partitionRDD(rdd, numPartitions):
counter
Watch out with loading data from gzipped files. Spark cannot parallelize
the load of gzipped files, and if you do not explicitly repartition your
RDD created from such a file, everything you do on that RDD will run on a
single core.
On Wed, Apr 2, 2014 at 8:22 PM, K Koh den...@gmail.com wrote:
to something splittable.
Otherwise, if I want to speed up subsequent computation on the RDD, I
should explicitly partition it with a call to RDD.partitionBy(10).
Is that correct?
On Mon, Mar 31, 2014 at 1:15 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
OK sweet. Thanks for walking
Just an FYI, it's not obvious from the
docshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBythat
the following code should fail:
a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
a._jrdd.splits().size()
a.count()
b = a.partitionBy(5)
Are you trying to access the UI from another machine? If so, first confirm
that you don't have a network issue by opening the UI from the master node
itself.
For example:
yum -y install lynx
lynx ip_address:8080
If this succeeds, then you likely have something blocking you from
accessing the
...@mail.gmail.com%3E
On Tue, Apr 1, 2014 at 1:51 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Alright, so I've upped the minSplits parameter on my call to textFile,
but the resulting RDD still has only 1 partition, which I assume means it
was read in on a single process. I am
in
the latest docs. Sorry about that, I also didn't realize partitionBy() had
this behavior from reading the Python docs (though it is consistent with
the Scala API, just more type-safe there).
On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Just an FYI
Howdy-doody,
I have a single, very large file sitting in S3 that I want to read in with
sc.textFile(). What are the best practices for reading in this file as
quickly as possible? How do I parallelize the read as much as possible?
Similarly, say I have a single, very large RDD sitting in memory
31, 2014 at 9:46 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
So setting
minSplitshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.context.SparkContext-class.html#textFile
will
set the parallelism on the read in SparkContext.textFile(), assuming I have
the cores
201 - 300 of 314 matches
Mail list logo