Re: Rename filter() into keep(), remove() or take() ?

2014-02-27 Thread Nick Pentreath
to try. However, the change could be made without breaking anything but that's another story. Regards Bertrand Bertrand Dechoux On Thu, Feb 27, 2014 at 2:05 PM, Nick Pentreath nick.pentre...@gmail.comwrote: filter comes from the Scala collection method filter. I'd say it's best to keep in line

Re: Running actions in loops

2014-03-07 Thread Nick Pentreath
There is #3 which is use mapPartitions and init one jodatime obj per partition, which is less overhead for large objects— Sent from Mailbox for iPhone On Sat, Mar 8, 2014 at 2:54 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: So the whole function closure you want to apply on your RDD needs

Re: Running Spark on a single machine

2014-03-16 Thread Nick Pentreath
Please follow the instructions at  http://spark.apache.org/docs/latest/index.html and  http://spark.apache.org/docs/latest/quick-start.html to get started on a local machine. — Sent from Mailbox for iPhone On Sun, Mar 16, 2014 at 11:39 PM, goi cto goi@gmail.com wrote: Hi, I know it is

Re: Calling Spahk enthusiasts in Boston

2014-03-31 Thread Nick Pentreath
I would offer to host one in Cape Town but we're almost certainly the only Spark users in the country apart from perhaps one in Johanmesburg :)— Sent from Mailbox for iPhone On Mon, Mar 31, 2014 at 8:53 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: My fellow Bostonians and New

NPE using saveAsTextFile

2014-04-08 Thread Nick Pentreath
Hi I'm using Spark 0.9.0. When calling saveAsTextFile on a custom hadoop inputformat (loaded with newAPIHadoopRDD), I get the following error below. If I call count, I get the correct count of number of records, so the inputformat is being read correctly... the issue only appears when trying to

Re: NPE using saveAsTextFile

2014-04-10 Thread Nick Pentreath
there? Matei On Apr 9, 2014, at 11:38 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Anyone have a chance to look at this? Am I just doing something silly somewhere? If it makes any difference, I am using the elasticsearch-hadoop plugin for ESInputFormat. But as I say, I can parse the data

Re: NPE using saveAsTextFile

2014-04-10 Thread Nick Pentreath
There was a closure over the config object lurking around - but in any case upgrading to 1.2.0 for config did the trick as it seems to have been a bug in Typesafe config, Thanks Matei! On Thu, Apr 10, 2014 at 8:46 AM, Nick Pentreath nick.pentre...@gmail.comwrote: Ok I thought it may

Re: StackOverflow Error when run ALS with 100 iterations

2014-04-16 Thread Nick Pentreath
I'd also say that running for 100 iterations is a waste of resources, as ALS will typically converge pretty quickly, as in within 10-20 iterations. On Wed, Apr 16, 2014 at 3:54 AM, Xiaoli Li lixiaolima...@gmail.com wrote: Thanks a lot for your information. It really helps me. On Tue, Apr

Re: User/Product Clustering with pySpark ALS

2014-04-29 Thread Nick Pentreath
There's no easy way to d this currently. The pieces are there from the PySpark code for regression which should be adaptable. But you'd have to roll your own solution. This is something I also want so I intend to put together a pull request for this soon — Sent from Mailbox On Tue, Apr

spark-submit / S3

2014-05-16 Thread Nick Pentreath
Hi I see from the docs for 1.0.0 that the new spark-submit mechanism seems to support specifying the jar with hdfs:// or http:// Does this support S3? (It doesn't seem to as I have tried it on EC2 but doesn't seem to work): ./bin/spark-submit --master local[2] --class myclass

Re: Spark on HBase vs. Spark on HDFS

2014-05-22 Thread Nick Pentreath
Hi In my opinion, running HBase for immutable data is generally overkill in particular if you are using Shark anyway to cache and analyse the data and provide the speed. HBase is designed for random-access data patterns and high throughput R/W activities. If you are only ever writing immutable

Re: Writing RDDs from Python Spark progrma (pyspark) to HBase

2014-05-28 Thread Nick Pentreath
It's not possible currently to write anything other than text (or pickle files I think in 1.0.0 or if not then in 1.0.1) from PySpark. I have an outstanding pull request to add READING any InputFormat from PySpark, and after that is in I will look into OutputFormat too. What does your data look

Re: Python, Spark and HBase

2014-05-29 Thread Nick Pentreath
Hi Tommer, I'm working on updating and improving the PR, and will work on getting an HBase example working with it. Will feed back as soon as I have had the chance to work on this a bit more. N On Thu, May 29, 2014 at 3:27 AM, twizansk twiza...@gmail.com wrote: The code which causes the

Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Nick Pentreath
@Sean, the %% syntax in SBT should automatically add the Scala major version qualifier (_2.10, _2.11 etc) for you, so that does appear to be correct syntax for the build. I seemed to run into this issue with some missing Jackson deps, and solved it by including the jar explicitly on the driver

Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Nick Pentreath
I learned for the day. The issue is that classes from that particular artifact are missing though. Worth interrogating the resulting .jar file with jar tf to see if it made it in? On Wed, Jun 4, 2014 at 2:12 PM, Nick Pentreath nick.pentre...@gmail.com wrote: @Sean, the %% syntax in SBT

Re: Cassandra examples don't work for me

2014-06-05 Thread Nick Pentreath
Yyou need cassandra 1.2.6 for Spark examples — Sent from Mailbox On Thu, Jun 5, 2014 at 12:02 AM, Tim Kellogg t...@2lemetry.com wrote: Hi, I’m following the directions to run the cassandra example “org.apache.spark.examples.CassandraTest” and I get this error Exception in thread main

Re: compress in-memory cache?

2014-06-05 Thread Nick Pentreath
Have you set the persistence level of the RDD to MEMORY_ONLY_SER ( http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)? If you're calling cache, the default persistence level is MEMORY_ONLY so that setting will have no impact. On Thu, Jun 5, 2014 at 4:41 PM, Xu (Simon)

Re: error loading large files in PySpark 0.9.0

2014-06-07 Thread Nick Pentreath
Ah looking at that inputformat it should just work out the box using sc.newAPIHadoopFile ... Would be interested to hear if it works as expected for you (in python you'll end up with bytearray values). N — Sent from Mailbox On Fri, Jun 6, 2014 at 9:38 PM, Jeremy Freeman

Re: Are scala.MatchError messages a problem?

2014-06-08 Thread Nick Pentreath
When you use match, the match must be exhaustive. That is, a match error is thrown if the match fails.  That's why you usually handle the default case using case _ = ... Here it looks like your taking the text of all statuses - which means not all of them will be commands... Which means

Re: mllib, python and SVD

2014-06-09 Thread Nick Pentreath
Don't think SVD is exposed via MLlib in Python yet, but you can also check out: https://github.com/ogrisel/spylearn where Jeremy Freeman put together a numpy-based SVD algorithm (this is a bit outdated but should still work I assume) (also https://github.com/freeman-lab/thunder has a PCA

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-10 Thread Nick Pentreath
Can you key your RDD by some key and use reduceByKey? In fact if you are merging bunch of maps you can create a set of (k, v) in your mapPartitions and then reduceByKey using some merge function. The reduce will happen in parallel on multiple nodes in this case. You'll end up with just a single

RE: Question about RDD cache, unpersist, materialization

2014-06-11 Thread Nick Pentreath
If you want to force materialization use .count() Also if you can simply don't unpersist anything, unless you really need to free the memory  — Sent from Mailbox On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim taeyun@innowireless.co.kr wrote: BTW, it is possible that rdd.first()

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Nick Pentreath
can you not use a Cassandra OutputFormat? Seems they have BulkOutputFormat. An example of using it with Hadoop is here: http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html Using it with Spark will be similar to the examples:

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Nick Pentreath
round that issue. (Any pointers in that direction?) That's why I'm trying the direct CQLSSTableWriter way but it looks blocked as well. -kr, Gerard. On Wed, Jun 25, 2014 at 8:57 PM, Nick Pentreath nick.pentre...@gmail.com wrote: can you not use a Cassandra OutputFormat? Seems they have

Re: ElasticSearch enrich

2014-06-26 Thread Nick Pentreath
You can just add elasticsearch-hadoop as a dependency to your project to user the ESInputFormat and ESOutputFormat ( https://github.com/elasticsearch/elasticsearch-hadoop). Some other basics here: http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html For testing, yes I

Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread Nick Pentreath
Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions For svm there are a couple of ad click prediction datasets of pretty large size. For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/ — Sent from Mailbox On Thu, Jul 3, 2014 at

Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread Nick Pentreath
which are easily publicly available (very happy to be proved wrong about this though :) — Sent from Mailbox On Thu, Jul 3, 2014 at 4:39 PM, AlexanderRiggers alexander.rigg...@gmail.com wrote: Nick Pentreath wrote Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath
You should be able to use DynamoDBInputFormat (I think this should be part of AWS libraries for Java) and create a HadoopRDD from that. On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson ia...@me.com wrote: Hi, I noticed mention of DynamoDB as input source in

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath
to be working with python primarily. Are you aware of comparable boto support? ian On 4 Jul 2014, at 16:32, Nick Pentreath nick.pentre...@gmail.com wrote: You should be able to use DynamoDBInputFormat (I think this should be part of AWS libraries for Java) and create a HadoopRDD from

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath
On Fri, Jul 4, 2014 at 8:51 AM, Ian Wilkinson ia...@me.com wrote: Excellent. Let me get browsing on this. Huge thanks, ian On 4 Jul 2014, at 16:47, Nick Pentreath nick.pentre...@gmail.com wrote: No boto support for that. In master there is Python support for loading Hadoop inputFormat

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath
. Unsure whether this represents the latest situation… ian On 4 Jul 2014, at 16:58, Nick Pentreath nick.pentre...@gmail.com wrote: I should qualify by saying there is boto support for dynamodb - but not for the inputFormat. You could roll your own python-based connection but this involves

Re: taking top k values of rdd

2014-07-05 Thread Nick Pentreath
To make it efficient in your case you may need to do a bit of custom code to emit the top k per partition and then only send those to the driver. On the driver you can just top k the combined top k from each partition (assuming you have (object, count) for each top k list). — Sent from Mailbox

Re: taking top k values of rdd

2014-07-05 Thread Nick Pentreath
com.twitter.algebird.mutable.PriorityQueueMonoid.build to limit the sizes of the queues). but this still means i am sending k items per partition to my driver, so k x p, while i only need k. thanks! koert On Sat, Jul 5, 2014 at 1:21 PM, Nick Pentreath nick.pentre...@gmail.com wrote: To make it efficient in your case you may need to do

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Nick Pentreath
For linear models the 3rd option is by far most efficient and I suspect what Evan is alluding to.  Unfortunately it's not directly possible with the classes in Mllib now so you'll have to roll your own using underlying sgd / bfgs primitives. — Sent from Mailbox On Sat, Jul 5, 2014 at 10:45

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Nick Pentreath
You may look into the new Azkaban - which while being quite heavyweight is actually quite pleasant to use when set up. You can run spark jobs (spark-submit) using azkaban shell commands and pass paremeters between jobs. It supports dependencies, simple dags and scheduling with retries. 

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Nick Pentreath
almost rewrite it totally. Don’t recommend it really. 发件人: Nick Pentreath nick.pentre...@gmail.com 答复: user@spark.apache.org 日期: 2014年7月11日 星期五 下午3:18 至: user@spark.apache.org 主题: Re: Recommended pipeline automation tool? Oozie? You may look into the new Azkaban - which while being quite

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-15 Thread Nick Pentreath
You could try the following: create a minimal project using sbt or Maven, add spark-streaming-twitter as a dependency, run sbt assembly (or mvn package) on that to create a fat jar (with Spark as provided dependency), and add that to the shell classpath when starting up. On Tue, Jul 15, 2014 at

Re: Count distinct with groupBy usage

2014-07-15 Thread Nick Pentreath
You can use .distinct.count on your user RDD. What are you trying to achieve with the time group by? — Sent from Mailbox On Tue, Jul 15, 2014 at 8:14 PM, buntu buntu...@gmail.com wrote: Hi -- New to Spark and trying to figure out how to do a generate unique counts per page by date given

Re: Large scale ranked recommendation

2014-07-18 Thread Nick Pentreath
It is very true that making predictions in batch for all 1 million users against the 10k items will be quite onerous in terms of computation. I have run into this issue too in making batch predictions. Some ideas: 1. Do you really need to generate recommendations for each user in batch? How are

Re: Large scale ranked recommendation

2014-07-18 Thread Nick Pentreath
Agree GPUs may be interesting for this kind of massively parallel linear algebra on reasonable size vectors. These projects might be of interest in this regard: https://github.com/BIDData/BIDMach https://github.com/BIDData/BIDMat https://github.com/dlwh/gust Nick On Fri, Jul 18, 2014 at 7:40

Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Nick Pentreath
I got this working locally a little while ago when playing around with AvroKeyInputFile: https://gist.github.com/MLnick/5864741781b9340cb211 But not sure about AvroSequenceFile. Any chance you have an example datafile or records? On Sat, Jul 19, 2014 at 11:00 AM, Sparky gullo_tho...@bah.com

Re: Spark clustered client

2014-07-23 Thread Nick Pentreath
At the moment your best bet for sharing SparkContexts across jobs will be Ooyala job server: https://github.com/ooyala/spark-jobserver It doesn't yet support spark 1.0 though I did manage to amend it to get it to build and run on 1.0 — Sent from Mailbox On Wed, Jul 23, 2014 at 1:21 AM, Asaf

Re: Workarounds for accessing sequence file data via PySpark?

2014-07-23 Thread Nick Pentreath
Load from sequenceFile for PySpark is in master and save is in this PR underway (https://github.com/apache/spark/pull/1338) I hope that Kan will have it ready to merge in time for 1.1 release window (it should be, the PR just needs a final review or two). In the meantime you can check out master

Re: iScala or Scala-notebook

2014-07-29 Thread Nick Pentreath
IScala itself seems to be a bit dead unfortunately. I did come across this today: https://github.com/tribbloid/ISpark On Fri, Jul 18, 2014 at 4:59 AM, ericjohnston1989 ericjohnston1...@gmail.com wrote: Hey everyone, I know this was asked before but I'm wondering if there have since been

Re: zip two RDD in pyspark

2014-07-30 Thread Nick Pentreath
parallelize uses the default Serializer (PickleSerializer) while textFile uses UTF8Serializer. You can get around this with index.zip(input_data._reserialize()) (or index.zip(input_data.map(lambda x: x))) (But if you try to just do this, you run into the issue with different number of

Re: NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass with spark-submit

2014-08-07 Thread Nick Pentreath
I'm also getting this - Ryan we both seem to be running into this issue with elasticsearch-hadoop :) I tried spark.files.userClassPathFirst true on command line and that doesn;t work If I put it that line in spark/conf/spark-defaults it works but now I'm getting: java.lang.NoClassDefFoundError:

Re: NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass with spark-submit

2014-08-08 Thread Nick Pentreath
By the way, for anyone using elasticsearch-hadoop, there is a fix for this here: https://github.com/elasticsearch/elasticsearch-hadoop/issues/239 Ryan - using the nightly snapshot build of 2.1.0.BUILD-SNAPSHOT fixed this for me. On Thu, Aug 7, 2014 at 3:58 PM, Nick Pentreath nick.pentre

Re: Failed running Spark ALS

2014-09-19 Thread Nick Pentreath
Have you set spark.local.dir (I think this is the config setting)? It needs to point to a volume with plenty of space. By default if I recall it point to /tmp Sent from my iPhone On 19 Sep 2014, at 23:35, jw.cmu jinliangw...@gmail.com wrote: I'm trying to run Spark ALS using the netflix

Re: spark 1.1.0 - hbase 0.98.6-hadoop2 version - py4j.protocol.Py4JJavaError java.lang.ClassNotFoundException

2014-10-04 Thread Nick Pentreath
forgot to copy user list On Sat, Oct 4, 2014 at 3:12 PM, Nick Pentreath nick.pentre...@gmail.com wrote: what version did you put in the pom.xml? it does seem to be in Maven central: http://search.maven.org/#artifactdetails%7Corg.apache.hbase%7Chbase%7C0.98.6-hadoop2%7Cpom dependency

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath
Currently I see the word2vec model is collected onto the master, so the model itself is not distributed.  I guess the question is why do you need  a distributed model? Is the vocab size so large that it's necessary? For model serving in general, unless the model is truly massive (ie cannot

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath
For ALS if you want real time recs (and usually this is order 10s to a few 100s ms response), then Spark is not the way to go - a serving layer like Oryx, or prediction.io is what you want. (At graphflow we've built our own). You hold the factor matrices in memory and do the dot product in

Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread Nick Pentreath
Feel free to add that converter as an option in the Spark examples via a PR :) — Sent from Mailbox On Wed, Nov 12, 2014 at 3:27 AM, alaa contact.a...@gmail.com wrote: Hey freedafeng, I'm exactly where you are. I want the output to show the rowkey and all column qualifiers that correspond to

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-26 Thread Nick Pentreath
copying user group - I keep replying directly vs reply all :) On Wed, Nov 26, 2014 at 2:03 PM, Nick Pentreath nick.pentre...@gmail.com wrote: ALS will be guaranteed to decrease the squared error (therefore RMSE) in each iteration, on the *training* set. This does not hold for the *test* set

Re: locality sensitive hashing for spark

2014-12-21 Thread Nick Pentreath
Looks interesting thanks for sharing. Does it support cosine similarity ? I only saw jaccard mentioned from a quick glance. — Sent from Mailbox On Mon, Dec 22, 2014 at 4:12 AM, morr0723 michael.d@gmail.com wrote: I've pushed out an implementation of locality sensitive hashing for

Re: SaveAsTextFile to S3 bucket

2015-01-26 Thread Nick Pentreath
Your output folder specifies rdd.saveAsTextFile(s3n://nexgen-software/dev/output); So it will try to write to /dev/output which is as expected. If you create the directory /dev/output upfront in your bucket, and try to save it to that (empty) directory, what is the behaviour? On Tue, Jan 27,

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-07 Thread Nick Pentreath
As I recall Oryx (the old version, and I assume the new one too) provide something like this: http://cloudera.github.io/oryx/apidocs/com/cloudera/oryx/als/common/OryxRecommender.html#recommendToAnonymous-java.lang.String:A-float:A-int- though Sean will be more on top of that than me :) On Mon,

Re: Did DataFrames break basic SQLContext?

2015-03-18 Thread Nick Pentreath
To answer your first question - yes 1.3.0 did break backward compatibility for the change from SchemaRDD - DataFrame. SparkSQL was an alpha component so api breaking changes could happen. It is no longer an alpha component as of 1.3.0 so this will not be the case in future. Adding toDF

Re: Iterative Algorithms with Spark Streaming

2015-03-16 Thread Nick Pentreath
MLlib supports streaming linear models: http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression and k-means: http://spark.apache.org/docs/latest/mllib-clustering.html#k-means With an iteration parameter of 1, this amounts to mini-batch SGD where the mini-batch is

Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Nick Pentreath
As Sean says, precomputing recommendations is pretty inefficient. Though with 500k items its easy to get all the item vectors in memory so pre-computing is not too bad. Still, since you plan to serve these via a REST service anyway, computing on demand via a serving layer such as Oryx or

Re: Spark Release 1.3.0 DataFrame API

2015-03-14 Thread Nick Pentreath
I've found people.toDF gives you a data frame (roughly equivalent to the previous Row RDD), And you can then call registerTempTable on that DataFrame. So people.toDF.registerTempTable(people) should work — Sent from Mailbox On Sat, Mar 14, 2015 at 5:33 PM, David Mitchell

Re: How do you write Dataframes to elasticsearch

2015-03-25 Thread Nick Pentreath
Spark 1.3 is not supported by elasticsearch-hadoop yet but will be very soon:  https://github.com/elastic/elasticsearch-hadoop/issues/400 However in the meantime you could use df.toRDD.saveToEs - though you may have to manipulate the Row object perhaps to extract fields, not sure if it will

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Nick Pentreath
What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile  — Sent from Mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote: I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project .

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Nick Pentreath
, Nick Pentreath nick.pentre...@gmail.com wrote: What version of Spark do the other dependencies rely on (Adam and H2O?) - that could be it Or try sbt clean compile — Sent from Mailbox https://www.dropbox.com/mailbox On Wed, Mar 25, 2015 at 5:58 PM, roni roni.epi...@gmail.com wrote

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread Nick Pentreath
www.AnnaiSystems.com http://www.annaisystems.com/ On Mar 25, 2015, at 11:43 PM, Nick Pentreath nick.pentre...@gmail.com wrote: From a quick look at this link - http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce - it seems you need to call some static methods on AccumuloInputFormat in order

Re: hadoop input/output format advanced control

2015-03-24 Thread Nick Pentreath
You can indeed override the Hadoop configuration at a per-RDD level - though it is a little more verbose, as in the below example, and you need to effectively make a copy of the hadoop Configuration: val thisRDDConf = new Configuration(sc.hadoopConfiguration)

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread Nick Pentreath
From a quick look at this link - http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce - it seems you need to call some static methods on AccumuloInputFormat in order to set the auth, table, and range settings. Try setting these config options first and then call newAPIHadoopRDD? On

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Nick Pentreath
Fair enough but I'd say you hit that diminishing return after 20 iterations or so... :) On Thu, Apr 2, 2015 at 9:39 AM, Justin Yip yipjus...@gmail.com wrote: Thanks Xiangrui, I used 80 iterations to demonstrates the marginal diminishing return in prediction quality :) Justin On Apr 2,

Re: RE: ElasticSearch for Spark times out

2015-04-22 Thread Nick Pentreath
Is your ES cluster reachable from your Spark cluster via network / firewall? Can you run the same query from the spark master and slave nodes via curl / one of the other clients? Seems odd that GC issues would be a problem from the scan but not when running query from a browser plugin...

Re: MLlib -Collaborative Filtering

2015-04-20 Thread Nick Pentreath
You will have to get the two user factor vectors from the ALS model and compute the cosine similarity between them. You can do this using Breeze vectors: import breeze.linalg._ val user1 = new DenseVector[Double](userFactors.lookup(user1).head) val user2 = new

Re: solr in spark

2015-04-28 Thread Nick Pentreath
I haven't used Solr for a long time, and haven't used Solr in Spark. However, why do you say Elasticsearch is not a good option ...? ES absolutely supports full-text search and not just filtering and grouping (in fact it's original purpose was and still is text search, though filtering, grouping

Re: solr in spark

2015-04-28 Thread Nick Pentreath
Gangele gangele...@gmail.com wrote: Thanks for reply. Elastic search index will be within my Cluster? or I need the separate host the elastic search? On 28 April 2015 at 22:03, Nick Pentreath nick.pentre...@gmail.com wrote: I haven't used Solr for a long time, and haven't used Solr in Spark

Re: Content based filtering

2015-05-12 Thread Nick Pentreath
Content based filtering is a pretty broad term - do you have any particular approach in mind? MLLib does not have any purely content-based methods. Your main alternative is ALS collaborative filtering. However, using a system like Oryx / PredictionIO / elasticsearch etc you can combine

Re: Passing Elastic Search Mappings in Spark Conf

2015-04-15 Thread Nick Pentreath
If you want to specify mapping you must first create the mappings for your index types before indexing. As far as I know there is no way to specify this via ES-hadoop. But it's best practice to explicitly create mappings prior to indexing, or to use index templates when dynamically creating

Re: When querying ElasticSearch, score is 0

2015-04-18 Thread Nick Pentreath
ES-hadoop uses a scan scroll search to efficiently retrieve large result sets. Scores are not tracked in a scan and sorting is not supported hence 0 scores. http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan — Sent from Mailbox

Re: MLlib -Collaborative Filtering

2015-04-18 Thread Nick Pentreath
What do you mean by similarity table of 2 users? Do you mean the similarity between 2 users? — Sent from Mailbox On Sat, Apr 18, 2015 at 11:09 AM, riginos samarasrigi...@gmail.com wrote: Is there any way that i can see the similarity table of 2 users in that algorithm? -- View this

Re: Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data

2015-04-07 Thread Nick Pentreath
There is no difference - textFile calls hadoopFile with a TextInputFormat, and maps each value to a String.  — Sent from Mailbox On Tue, Apr 7, 2015 at 1:46 PM, Puneet Kumar Ojha puneet.ku...@pubmatic.com wrote: Hi , Is there any difference between Difference between textFile Vs hadoopFile

Re: Migrating from Spark 0.8.0 to Spark 1.3.0

2015-04-04 Thread Nick Pentreath
It shouldn't be too bad - pertinent changes migration notes are here:  http://spark.apache.org/docs/1.0.0/programming-guide.html#migrating-from-pre-10-versions-of-spark  for pre-1.0 and here:  http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13  for

Re: Spark and Google Cloud Storage

2015-06-18 Thread Nick Pentreath
I believe it is available here: https://cloud.google.com/hadoop/google-cloud-storage-connector 2015-06-18 15:31 GMT+02:00 Klaus Schaefers klaus.schaef...@ligatus.com: Hi, is there a kind adapter to use GoogleCloudStorage with Spark? Cheers, Klaus -- -- Klaus Schaefers Senior

Re: Spark Titan

2015-06-21 Thread Nick Pentreath
Something like this works (or at least worked with titan 0.4 back when I was using it): val graph = sc.newAPIHadoopRDD( configuration, fClass = classOf[TitanHBaseInputFormat], kClass = classOf[NullWritable], vClass = classOf[FaunusVertex]) graph.flatMap { vertex =

Re: Velox Model Server

2015-06-24 Thread Nick Pentreath
is async mini-batch near-real-time scoring, pushing results to some store for retrieval, which could be entirely suitable for your use case. On Tue, Jun 23, 2015 at 8:52 AM, Nick Pentreath nick.pentre...@gmail.com wrote: If your recommendation needs are real-time (1s) I am not sure job server

Re: Matrix Multiplication and mllib.recommendation

2015-06-18 Thread Nick Pentreath
case 10K products form one block. Note that you would then have to union your recommendations. And if there lots of product blocks, you might also want to checkpoint once every few times. Regards Sab On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath nick.pentre...@gmail.com wrote: One

Re: Velox Model Server

2015-06-22 Thread Nick Pentreath
How large are your models? Spark job server does allow synchronous job execution and with a warm long-lived context it will be quite fast - but still in the order of a second or a few seconds usually (depending on model size - for very large models possibly quite a lot more than that).

Re: Velox Model Server

2015-06-21 Thread Nick Pentreath
Is there a presentation up about this end-to-end example? I'm looking into velox now - our internal model pipeline just saves factors to S3 and model server loads them periodically from S3 — Sent from Mailbox On Sat, Jun 20, 2015 at 9:46 PM, Debasish Das debasish.da...@gmail.com wrote:

RE: Matrix Multiplication and mllib.recommendation

2015-06-17 Thread Nick Pentreath
One issue is that you broadcast the product vectors and then do a dot product one-by-one with the user vector. You should try forming a matrix of the item vectors and doing the dot product as a matrix-vector multiply which will make things a lot faster. Another optimisation that is

Re: ALS predictALL not completing

2015-06-17 Thread Nick Pentreath
-- *From:* Nick Pentreath nick.pentre...@gmail.com *To:* user@spark.apache.org user@spark.apache.org *Sent:* Tuesday, June 16, 2015 4:23 AM *Subject:* Re: ALS predictALL not completing Which version of Spark are you using? On Tue, Jun 16, 2015 at 6:20 AM, afarahat

Re: ALS predictALL not completing

2015-06-16 Thread Nick Pentreath
Which version of Spark are you using? On Tue, Jun 16, 2015 at 6:20 AM, afarahat ayman.fara...@yahoo.com wrote: Hello; I have a data set of about 80 Million users and 12,000 items (very sparse ). I can get the training part working no problem. (model has 20 factors), However, when i try

Re: Spark job workflow engine recommendations

2015-08-11 Thread Nick Pentreath
I also tend to agree that Azkaban is somehqat easier to get set up. Though I haven't used the new UI for Oozie that is part of CDH, so perhaps that is another good option. It's a pity Azkaban is a little rough in terms of documenting its API, and the scalability is an issue. However it

Re: Is there any tool that i can prove to customer that spark is faster then hive ?

2015-08-12 Thread Nick Pentreath
Perhaps you could time the end-to-end runtime for each pipeline, and each stage? Through Id be fairly confidant that Spark will outperform hive/mahout on MR, that's not he only consideration - having everything on a single platform and the Spark / data frame API is a huge win just by itself

Re: RDD[Future[T]] = Future[RDD[T]]

2015-07-27 Thread Nick Pentreath
is completed, no ? 2015-07-27 7:24 GMT+02:00 Nick Pentreath [hidden email] http:///user/SendEmail.jtp?type=nodenode=24005i=0: You could use Iterator.single on the future[iterator]. However if you collect all the partitions I'm not sure if it will work across executor boundaries. Perhaps you may need

Re: Velox Model Server

2015-07-13 Thread Nick Pentreath
not sure if it is used as a full fledged distributed cache or not. May be it is being used as zookeeper alternative. On Wed, Jun 24, 2015 at 2:02 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Ok My view is with only 100k items, you are better off serving in-memory for items vectors

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread Nick Pentreath
Yup, currently PMML export, or Java serialization, are the options realistically available. Though PMML may deter some, there are not many viable cross-platform alternatives (with nearly as much coverage). On Thu, Nov 12, 2015 at 1:42 PM, Sean Owen wrote: > This is all

Re: DynamoDB Connector?

2015-11-16 Thread Nick Pentreath
See this thread for some info: http://apache-spark-user-list.1001560.n3.nabble.com/DynamoDB-input-source-td8814.html I don't think the situation has changed that much - if you're using Spark on EMR, then I think the InputFormat is available in a JAR (though I haven't tested that). Otherwise

Re: Machine learning with spark (book code example error)

2015-10-14 Thread Nick Pentreath
Hi there. I'm the author of the book (thanks for buying it by the way :) Ideally if you're having any trouble with the book or code, it's best to contact the publisher and submit a query ( https://www.packtpub.com/books/content/support/17400) However, I can help with this issue. The problem is

Re: How to specify the numFeatures in HashingTF

2015-10-15 Thread Nick Pentreath
Setting the numfeatures higher than vocab size will tend to reduce the chance of hash collisions, but it's not strictly necessary - it becomes a memory / accuracy trade off. Surprisingly, the impact on model performance of moderate hash collisions is often not significant. So it may

Re: Spark job workflow engine recommendations

2015-10-07 Thread Nick Pentreath
ability is a known issue due the the current architecture. >>>>>> However this will be applicable if you run more 20K jobs per day. >>>>>> >>>>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu <yuzhih...@gmail.com> wrote: >&g

Re: thought experiment: use spark ML to real time prediction

2015-11-17 Thread Nick Pentreath
I think the issue with pulling in all of spark-core is often with dependencies (and versions) conflicting with the web framework (or Akka in many cases). Plus it really is quite heavy if you just want a fairly lightweight model-serving app. For example we've built a fairly simple but scalable ALS

Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread Nick Pentreath
While it's true locality might speed things up, I'd say it's a very bad idea to mix your Spark and ES clusters - if your ES cluster is serving production queries (and in particular using aggregations), you'll run into performance issues on your production ES cluster. ES-hadoop uses ES scan

Re: Spark ANN

2015-09-07 Thread Nick Pentreath
Haven't checked the actual code but that doc says "MLPC employes backpropagation for learning the model. .."? — Sent from Mailbox On Mon, Sep 7, 2015 at 8:18 PM, Ruslan Dautkhanov wrote: > http://people.apache.org/~pwendell/spark-releases/latest/ml-ann.html >

Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Nick Pentreath
You might want to check out https://github.com/lensacom/sparkit-learn Though it's true for random Forests / trees you will need to use MLlib — Sent from Mailbox On Sat, Sep 12, 2015 at 9:00 PM, Jörn Franke wrote: > I fear you have to do the plumbing all yourself.

Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-13 Thread Nick Pentreath
pipelines, if you do test both out. — Sent from Mailbox On Sat, Sep 12, 2015 at 10:52 PM, Rex X <dnsr...@gmail.com> wrote: > Jorn and Nick, > Thanks for answering. > Nick, the sparkit-learn project looks interesting. Thanks for mentioning it. > Rex > On Sat, Sep 12, 2015 at 12:

  1   2   3   >