If you want to force materialization use .count()
Also if you can simply don't unpersist anything, unless you really need to free
the memory
—
Sent from Mailbox
On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim
wrote:
> BTW, it is possible that rdd.first() does not compute the whole p
How many users and items do you have?
Each iteration will first iterate through users and then items, so each
iteration of ALS actually ends up having 2 flatMap operations. I'd assume
that you have many more users than items (or vice versa), which is why one
of the operations generates more data.
can you not use a Cassandra OutputFormat? Seems they have BulkOutputFormat.
An example of using it with Hadoop is here:
http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html
Using it with Spark will be similar to the examples:
https://github.com/apache/spark/blob/maste
).
>
> We could not get round that issue. (Any pointers in that direction?)
>
> That's why I'm trying the direct CQLSSTableWriter way but it looks blocked
> as well.
>
> -kr, Gerard.
>
>
>
>
> On Wed, Jun 25, 2014 at 8:57 PM, Nick Pentreath
> wrote:
You can just add elasticsearch-hadoop as a dependency to your project to
user the ESInputFormat and ESOutputFormat (
https://github.com/elasticsearch/elasticsearch-hadoop). Some other basics
here:
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html
For testing, yes I thin
I've not tried this - but numpy is a tricky and complex package with many
dependencies on Fortran/C libraries etc. I'd say by the time you figure out
correctly deploying numpy in this manner, you may as well have just built
it into your cluster bootstrap process, or PSSH install it on each node...
t; The dependencies would get tricky but I think this is the sort of situation
> it's built for.
>
>
> On 6/27/14, 11:06 AM, Avishek Saha wrote:
>
> I too felt the same Nick but I don't have root privileges on the cluster,
> unfortunately. Are there any alternatives?
>
&
Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions
For svm there are a couple of ad click prediction datasets of pretty large size.
For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/
—
Sent from Mailbox
On Thu, Jul 3, 2014 at 3
me
across which are easily publicly available (very happy to be proved wrong about
this though :)
—
Sent from Mailbox
On Thu, Jul 3, 2014 at 4:39 PM, AlexanderRiggers
wrote:
> Nick Pentreath wrote
>> Take a look at Kaggle competition datasets
>> - https://www.kaggle.com/competi
You should be able to use DynamoDBInputFormat (I think this should be part
of AWS libraries for Java) and create a HadoopRDD from that.
On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson wrote:
> Hi,
>
> I noticed mention of DynamoDB as input source in
>
> http://ampcamp.berkeley.edu/wp-content/uplo
Do you mind posting a little more detail about what your code looks like?
It appears you might be trying to reference another RDD from within your
RDD in the foreach.
On Fri, Jul 4, 2014 at 2:28 AM, Honey Joshi wrote:
> Original Message ---
I’m going to be working with python primarily. Are you aware of
> comparable boto support?
>
> ian
>
>> On 4 Jul 2014, at 16:32, Nick Pentreath wrote:
>>
>> You should be able to use DynamoDBInputFormat (I think this should be part
>> of AWS libraries for Java) a
Fri, Jul 4, 2014 at 8:51 AM, Ian Wilkinson wrote:
> Excellent. Let me get browsing on this.
> Huge thanks,
> ian
> On 4 Jul 2014, at 16:47, Nick Pentreath wrote:
>> No boto support for that.
>>
>> In master there is Python support for loading Hadoop inputFormat.
che-hadoop-hive-dynamodb
> .
> Unsure whether this represents the latest situation…
>
> ian
>
>
> On 4 Jul 2014, at 16:58, Nick Pentreath wrote:
>
> I should qualify by saying there is boto support for dynamodb - but not
> for the inputFormat. You could roll your own python
To make it efficient in your case you may need to do a bit of custom code to
emit the top k per partition and then only send those to the driver. On the
driver you can just top k the combined top k from each partition (assuming you
have (object, count) for each top k list).
—
Sent from Mailbox
tyQueueMonoid.build to limit the sizes
> of the queues).
> but this still means i am sending k items per partition to my driver, so k
> x p, while i only need k.
> thanks! koert
> On Sat, Jul 5, 2014 at 1:21 PM, Nick Pentreath
> wrote:
>> To make it efficient in your case
For linear models the 3rd option is by far most efficient and I suspect what
Evan is alluding to.
Unfortunately it's not directly possible with the classes in Mllib now so
you'll have to roll your own using underlying sgd / bfgs primitives.
—
Sent from Mailbox
On Sat, Jul 5, 2014 at 10:45 AM,
You may look into the new Azkaban - which while being quite heavyweight is
actually quite pleasant to use when set up.
You can run spark jobs (spark-submit) using azkaban shell commands and pass
paremeters between jobs. It supports dependencies, simple dags and scheduling
with retries.
I'
ot. Finally we almost rewrite
> it totally. Don’t recommend it really.
>
> 发件人: Nick Pentreath
> 答复:
> 日期: 2014年7月11日 星期五 下午3:18
> 至:
> 主题: Re: Recommended pipeline automation tool? Oozie?
>
> You may look into the new Azkaban - which while being quite heavyweight is
You could try the following: create a minimal project using sbt or Maven,
add spark-streaming-twitter as a dependency, run sbt assembly (or mvn
package) on that to create a fat jar (with Spark as provided dependency),
and add that to the shell classpath when starting up.
On Tue, Jul 15, 2014 at 9
You can use .distinct.count on your user RDD.
What are you trying to achieve with the time group by?
—
Sent from Mailbox
On Tue, Jul 15, 2014 at 8:14 PM, buntu wrote:
> Hi --
> New to Spark and trying to figure out how to do a generate unique counts per
> page by date given this raw data:
> ti
It is very true that making predictions in batch for all 1 million users
against the 10k items will be quite onerous in terms of computation. I have
run into this issue too in making batch predictions.
Some ideas:
1. Do you really need to generate recommendations for each user in batch?
How are y
Agree GPUs may be interesting for this kind of massively parallel linear
algebra on reasonable size vectors.
These projects might be of interest in this regard:
https://github.com/BIDData/BIDMach
https://github.com/BIDData/BIDMat
https://github.com/dlwh/gust
Nick
On Fri, Jul 18, 2014 at 7:40 P
I got this working locally a little while ago when playing around with
AvroKeyInputFile: https://gist.github.com/MLnick/5864741781b9340cb211
But not sure about AvroSequenceFile. Any chance you have an example
datafile or records?
On Sat, Jul 19, 2014 at 11:00 AM, Sparky wrote:
> To be more sp
At the moment your best bet for sharing SparkContexts across jobs will be
Ooyala job server: https://github.com/ooyala/spark-jobserver
It doesn't yet support spark 1.0 though I did manage to amend it to get it to
build and run on 1.0
—
Sent from Mailbox
On Wed, Jul 23, 2014 at 1:21 AM, Asaf La
Load from sequenceFile for PySpark is in master and save is in this PR
underway (https://github.com/apache/spark/pull/1338)
I hope that Kan will have it ready to merge in time for 1.1 release window
(it should be, the PR just needs a final review or two).
In the meantime you can check out master
IScala itself seems to be a bit dead unfortunately.
I did come across this today: https://github.com/tribbloid/ISpark
On Fri, Jul 18, 2014 at 4:59 AM, ericjohnston1989 <
ericjohnston1...@gmail.com> wrote:
> Hey everyone,
>
> I know this was asked before but I'm wondering if there have since bee
parallelize uses the default Serializer (PickleSerializer) while textFile
uses UTF8Serializer.
You can get around this with index.zip(input_data._reserialize()) (or
index.zip(input_data.map(lambda x: x)))
(But if you try to just do this, you run into the issue with different
number of partitions
Ryan, did you come right with this?
I've just ran into the same problem on a new 1.0.0 cluster I spun up. The
issue was that my app was not running against the Spark master, but in
local mode (a default setting in my app that was a throwback from 0.9.1 and
was overriding the spark defaults on the
I'm also getting this - Ryan we both seem to be running into this issue
with elasticsearch-hadoop :)
I tried spark.files.userClassPathFirst true on command line and that
doesn;t work
If I put it that line in spark/conf/spark-defaults it works but now I'm
getting:
java.lang.NoClassDefFoundError: o
By the way, for anyone using elasticsearch-hadoop, there is a fix for this
here: https://github.com/elasticsearch/elasticsearch-hadoop/issues/239
Ryan - using the nightly snapshot build of 2.1.0.BUILD-SNAPSHOT fixed this
for me.
On Thu, Aug 7, 2014 at 3:58 PM, Nick Pentreath
wrote:
> I
Have you set spark.local.dir (I think this is the config setting)?
It needs to point to a volume with plenty of space.
By default if I recall it point to /tmp
Sent from my iPhone
> On 19 Sep 2014, at 23:35, "jw.cmu" wrote:
>
> I'm trying to run Spark ALS using the netflix dataset but failed d
forgot to copy user list
On Sat, Oct 4, 2014 at 3:12 PM, Nick Pentreath
wrote:
> what version did you put in the pom.xml?
>
> it does seem to be in Maven central:
> http://search.maven.org/#artifactdetails%7Corg.apache.hbase%7Chbase%7C0.98.6-hadoop2%7Cpom
>
>
> org.apa
Currently I see the word2vec model is collected onto the master, so the model
itself is not distributed.
I guess the question is why do you need a distributed model? Is the vocab size
so large that it's necessary? For model serving in general, unless the model is
truly massive (ie cannot fit
For ALS if you want real time recs (and usually this is order 10s to a few 100s
ms response), then Spark is not the way to go - a serving layer like Oryx, or
prediction.io is what you want.
(At graphflow we've built our own).
You hold the factor matrices in memory and do the dot product in
els, and then
> merge the results at the end at the single master model.
> On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath
> wrote:
>> Currently I see the word2vec model is collected onto the master, so the
>> model itself is not distributed.
>>
>> I guess the questi
Feel free to add that converter as an option in the Spark examples via a PR :)
—
Sent from Mailbox
On Wed, Nov 12, 2014 at 3:27 AM, alaa wrote:
> Hey freedafeng, I'm exactly where you are. I want the output to show the
> rowkey and all column qualifiers that correspond to it. How did you write
copying user group - I keep replying directly vs reply all :)
On Wed, Nov 26, 2014 at 2:03 PM, Nick Pentreath
wrote:
> ALS will be guaranteed to decrease the squared error (therefore RMSE) in
> each iteration, on the *training* set.
>
> This does not hold for the *test* set / cros
Looks interesting thanks for sharing.
Does it support cosine similarity ? I only saw jaccard mentioned from a quick
glance.
—
Sent from Mailbox
On Mon, Dec 22, 2014 at 4:12 AM, morr0723 wrote:
> I've pushed out an implementation of locality sensitive hashing for spark.
> LSH has a number of
As I recall Oryx (the old version, and I assume the new one too) provide
something like this:
http://cloudera.github.io/oryx/apidocs/com/cloudera/oryx/als/common/OryxRecommender.html#recommendToAnonymous-java.lang.String:A-float:A-int-
though Sean will be more on top of that than me :)
On Mon, Ja
Yes, you''re initiating a scan for each count call. The normal way to
improve this would be to use cache(), which is what you have in your
commented out line:
// hBaseRDD.cache()
If you uncomment that line, you should see an improvement overall.
If caching is not an option for some reason (maybe
t; of 'hBaseRDD.count' call.
>
>
>
> On Mon, Feb 24, 2014 at 11:29 PM, Nick Pentreath > wrote:
>
>> Yes, you''re initiating a scan for each count call. The normal way to
>> improve this would be to use cache(), which is what you have in your
>> commente
ll only be doing one pass through the data anyway (like running a
count every time on the full dataset) then caching is not going to help you.
On Tue, Feb 25, 2014 at 4:59 PM, Soumitra Kumar wrote:
> Thanks Nick.
>
> How do I figure out if the RDD fits in memory?
>
>
> On Tue, Feb 2
so one job was
> taking 90% of time.
> BTW, is there a way to save the details available port 4040 after job is
> finished?
> On Tue, Feb 25, 2014 at 7:26 AM, Nick Pentreath
> wrote:
>> It's tricky really since you may not know upfront how much data is in
>> there. Yo
filter comes from the Scala collection method "filter". I'd say it's best
to keep in line with the Scala collections API, as Spark has done with RDDs
generally (map, flatMap, take etc), so that is is easier and natural for
developers to apply the same thinking for Scala (parallel) collections to
Sp
stand the explanation but I had to try. However, the change could be
> made without breaking anything but that's another story.
> Regards
> Bertrand
> Bertrand Dechoux
> On Thu, Feb 27, 2014 at 2:05 PM, Nick Pentreath
> wrote:
>> filter comes from the Scala collection m
Thought that Spark users may be interested in the outcome of the Spark /
scikit-learn sprint that happened last month just after Strata...
-- Forwarded message --
From: Olivier Grisel
Date: Fri, Feb 21, 2014 at 6:30 PM
Subject: Re: [Scikit-learn-general] Spark+sklearn sprint outc
There is #3 which is use mapPartitions and init one jodatime obj per partition,
which is less overhead for large objects—
Sent from Mailbox for iPhone
On Sat, Mar 8, 2014 at 2:54 AM, Mayur Rustagi
wrote:
> So the whole function closure you want to apply on your RDD needs to be
> serializable so
It would be helpful to know what parameter inputs you are using.
If the regularization schemes are different (by a factor of alpha, which
can often be quite high) this will mean that the same parameter settings
could give very different results. A higher lambda would be required with
Spark's versi
Please follow the instructions at
http://spark.apache.org/docs/latest/index.html and
http://spark.apache.org/docs/latest/quick-start.html to get started on a local
machine.
—
Sent from Mailbox for iPhone
On Sun, Mar 16, 2014 at 11:39 PM, goi cto wrote:
> Hi,
> I know it is probably not th
Great work Xiangrui thanks for the enhancement!—
Sent from Mailbox for iPhone
On Wed, Mar 19, 2014 at 12:08 AM, Xiangrui Meng wrote:
> Glad to hear the speed-up. Wish we can improve the implementation
> further in the future. -Xiangrui
> On Tue, Mar 18, 2014 at 1:55 PM, Michael Allman wrote:
>>
It shouldn't be too tricky to use the Spark job server to create a job where
the SQL statement is an input argument, which is executed and the result
returned. This gives remote server execution but no metastore layer—
Sent from Mailbox for iPhone
On Mon, Mar 31, 2014 at 6:56 AM, Manoj Samel
wr
I would offer to host one in Cape Town but we're almost certainly the only
Spark users in the country apart from perhaps one in Johanmesburg :)—
Sent from Mailbox for iPhone
On Mon, Mar 31, 2014 at 8:53 PM, Nicholas Chammas
wrote:
> My fellow Bostonians and New Englanders,
> We cannot allow New
Hi Michael
Would you mind setting out exactly what differences you did find between
the Spark and Oryx implementations? Would be good to be clear on them, and
also see if there are further tricks/enhancements from the Oryx one that
can be ported (such as the lambda * numRatings adjustment).
N
O
4. Oryx uses the weighted regularization scheme you alluded to below,
>> multiplying lambda by the number of ratings.
>>
>> I've patched the spark impl to support (4) but haven't pushed it to my
>> clone on github. I think it would be a valuable feature to suppor
Hi
I'm using Spark 0.9.0.
When calling saveAsTextFile on a custom hadoop inputformat (loaded with
newAPIHadoopRDD), I get the following error below.
If I call count, I get the correct count of number of records, so the
inputformat is being read correctly... the issue only appears when trying
to
4:50 PM, Nick Pentreath wrote:
> Hi
>
> I'm using Spark 0.9.0.
>
> When calling saveAsTextFile on a custom hadoop inputformat (loaded with
> newAPIHadoopRDD), I get the following error below.
>
> If I call count, I get the correct count of number of records, so the
>
ing there?
>
> Matei
>
> On Apr 9, 2014, at 11:38 PM, Nick Pentreath
> wrote:
>
> Anyone have a chance to look at this?
>
> Am I just doing something silly somewhere?
>
> If it makes any difference, I am using the elasticsearch-hadoop plugin for
> ESInputForm
There was a closure over the config object lurking around - but in any case
upgrading to 1.2.0 for config did the trick as it seems to have been a bug
in Typesafe config,
Thanks Matei!
On Thu, Apr 10, 2014 at 8:46 AM, Nick Pentreath wrote:
> Ok I thought it may be closing over the con
I'd also say that running for 100 iterations is a waste of resources, as
ALS will typically converge pretty quickly, as in within 10-20 iterations.
On Wed, Apr 16, 2014 at 3:54 AM, Xiaoli Li wrote:
> Thanks a lot for your information. It really helps me.
>
>
> On Tue, Apr 15, 2014 at 7:57 PM, C
ES formats are pretty easy to use:
Reading:
val conf = new Configuration()
conf.set("es.resource", "index/type")
conf.set("es.query", "?q=*")
val rdd = sc.newAPIHadoopRDD(
conf,
classOf[EsInputFormat[NullWritable, LinkedMapWritable]],
classOf[NullWritable],
classOf[LinkedMapWritable]
)
The only g
Also see: https://github.com/apache/spark/pull/455
This will add support for reading sequencefile and other inputformat in
PySpark, as long as the Writables are either simple (primitives, maps and
arrays of same), or reasonably simple Java objects.
I'm about to push a change from MsgPack to
There's no easy way to d this currently. The pieces are there from the PySpark
code for regression which should be adaptable.
But you'd have to roll your own solution.
This is something I also want so I intend to put together a pull request for
this soon
—
Sent from Mailbox
On Tue, Apr 29,
Hi
I see from the docs for 1.0.0 that the new "spark-submit" mechanism seems
to support specifying the jar with hdfs:// or http://
Does this support S3? (It doesn't seem to as I have tried it on EC2 but
doesn't seem to work):
./bin/spark-submit --master local[2] --class myclass s3n://bucket/myap
You need to use mapPartitions (or foreachPartition) to instantiate your
client in each partition as it is not serializable by the pickle library.
Something like
def mapper(iter):
db = MongoClient()['spark_test_db']
*collec = db['programs']*
*for val in iter:*
asc = val.encode('
Yes actually if you could possibly test the patch out and see how easy it is to
load HBase Rdds that would be great.
That way I could make any amendments required to make HBase / Cassandra etc
easier
—
Sent from Mailbox
On Wed, May 21, 2014 at 4:41 AM, Matei Zaharia
wrote:
> Unfortunately
Hi
In my opinion, running HBase for immutable data is generally overkill in
particular if you are using Shark anyway to cache and analyse the data and
provide the speed.
HBase is designed for random-access data patterns and high throughput R/W
activities. If you are only ever writing immutable lo
It's not possible currently to write anything other than text (or pickle
files I think in 1.0.0 or if not then in 1.0.1) from PySpark.
I have an outstanding pull request to add READING any InputFormat from
PySpark, and after that is in I will look into OutputFormat too.
What does your data look l
Hi Tommer,
I'm working on updating and improving the PR, and will work on getting an
HBase example working with it. Will feed back as soon as I have had the
chance to work on this a bit more.
N
On Thu, May 29, 2014 at 3:27 AM, twizansk wrote:
> The code which causes the error is:
>
> The code
201 - 269 of 269 matches
Mail list logo