Re: Storing an action result in HDFS

2015-06-22 Thread Chris Gore
Hi Ravi,

Welcome, you probably want RDD.saveAsTextFile(“hdfs:///my_file”)

Chris

 On Jun 22, 2015, at 5:28 PM, ravi tella ddpis...@gmail.com wrote:
 
 
 Hello All,
 I am new to Spark. I have a very basic question.How do I write the output of 
 an action on a RDD to HDFS? 
 
 Thanks in advance for the help.
 
 Cheers,
 Ravi
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Storing an action result in HDFS

2015-06-22 Thread Chris Gore
Hi Ravi,

For this case, you could simply do 
sc.parallelize([rdd.first()]).saveAsTextFile(“hdfs:///my_file”) using pyspark
or sc.parallelize(Array(rdd.first())).saveAsTextFile(“hdfs:///my_file”) using 
Scala

Chris

 On Jun 22, 2015, at 5:53 PM, ddpis...@gmail.com wrote:
 
 Hi Chris,
 Thanks for the quick reply and the welcome. I am trying to read a file from 
 hdfs and then writing back just the first line to hdfs. 
 
 I calling first() on the RDD to get the first line. 
 
 Sent from my iPhone
 
 On Jun 22, 2015, at 7:42 PM, Chris Gore cdg...@cdgore.com wrote:
 
 Hi Ravi,
 
 Welcome, you probably want RDD.saveAsTextFile(“hdfs:///my_file”)
 
 Chris
 
 On Jun 22, 2015, at 5:28 PM, ravi tella ddpis...@gmail.com wrote:
 
 
 Hello All,
 I am new to Spark. I have a very basic question.How do I write the output 
 of an action on a RDD to HDFS? 
 
 Thanks in advance for the help.
 
 Cheers,
 Ravi
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread Chris Gore
I tried running this data set as described with my own implementation of L2 
regularized logistic regression using LBFGS to compare:
https://github.com/cdgore/fitbox https://github.com/cdgore/fitbox

Intercept: -0.886745823033
Weights (['gre', 'gpa', 'rank']):[ 0.28862268  0.19402388 -0.36637964]
Area under ROC: 0.724056603774

The difference could be from the feature preprocessing as mentioned.  I 
normalized the features around 0:

binary_train_normalized = (binary_train - binary_train.mean()) / 
binary_train.std()
binary_test_normalized = (binary_test - binary_train.mean()) / 
binary_train.std()

On a data set this small, the difference in models could also be the result of 
how the training/test sets were split.

Have you tried running k-folds cross validation on a larger data set?

Chris

 On May 20, 2015, at 6:15 PM, DB Tsai d...@netflix.com.INVALID wrote:
 
 Hi Xin,
 
 If you take a look at the model you trained, the intercept from Spark
 is significantly smaller than StatsModel, and the intercept represents
 a prior on categories in LOR which causes the low accuracy in Spark
 implementation. In LogisticRegressionWithLBFGS, the intercept is
 regularized due to the implementation of Updater, and the intercept
 should not be regularized.
 
 In the new pipleline APIs, a LOR with elasticNet is implemented, and
 the intercept is properly handled.
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
 As you can see the tests,
 https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 the result is exactly the same as R now.
 
 BTW, in both version, the feature scalings are done before training,
 and we train the model in scaled space but transform the model weights
 back to original space. The only difference is in the mllib version,
 LogisticRegressionWithLBFGS regularizes the intercept while in the ml
 version, the intercept is excluded from regularization. As a result,
 if lambda is zero, the model should be the same.
 
 
 
 On Wed, May 20, 2015 at 3:42 PM, Xin Liu liuxin...@gmail.com wrote:
 Hi,
 
 I have tried a few models in Mllib to train a LogisticRegression model.
 However, I consistently get much better results using other libraries such
 as statsmodel (which gives similar results as R) in terms of AUC. For
 illustration purpose, I used a small data (I have tried much bigger data)
 http://www.ats.ucla.edu/stat/data/binary.csv in
 http://www.ats.ucla.edu/stat/r/dae/logit.htm
 
 Here is the snippet of my usage of LogisticRegressionWithLBFGS.
 
 val algorithm = new LogisticRegressionWithLBFGS
 algorithm.setIntercept(true)
 algorithm.optimizer
   .setNumIterations(100)
   .setRegParam(0.01)
   .setConvergenceTol(1e-5)
 val model = algorithm.run(training)
 model.clearThreshold()
 val scoreAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 val metrics = new BinaryClassificationMetrics(scoreAndLabels)
 val auROC = metrics.areaUnderROC()
 
 I did a (0.6, 0.4) split for training/test. The response is admit and
 features are GRE score, GPA, and college Rank.
 
 Spark:
 Weights (GRE, GPA, Rank):
 [0.0011576276331509304,0.048544858567336854,-0.394202150286076]
 Intercept: -0.6488972641282202
 Area under ROC: 0.6294070512820512
 
 StatsModel:
 Weights [0.0018, 0.7220, -0.3148]
 Intercept: -3.5913
 Area under ROC: 0.69
 
 The weights from statsmodel seems more reasonable if you consider for a one
 unit increase in gpa, the log odds of being admitted to graduate school
 increases by 0.72 in statsmodel than 0.04 in Spark.
 
 I have seen much bigger difference with other data. So my question is has
 anyone compared the results with other libraries and is anything wrong with
 my code to invoke LogisticRegressionWithLBFGS?
 
 As the real data I am processing is pretty big and really want to use Spark
 to get this to work. Please let me know if you have similar experience and
 how you resolve it.
 
 Thanks,
 Xin
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 



Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Chris Gore
Good to hear there will be partitioning support.  I’ve had some success loading 
partitioned data specified with Unix glowing format.  i.e.:

sc.textFile(s3:/bucket/directory/dt=2014-11-{2[4-9],30}T00-00-00”)

would load dates 2014-11-24 through 2014-11-30.  Not the most ideal solution, 
but it seems to work for loading data from a range.

Best,
Chris

 On Jan 26, 2015, at 10:55 AM, Cheng Lian lian.cs@gmail.com wrote:
 
 Currently no if you don't want to use Spark SQL's HiveContext. But we're 
 working on adding partitioning support to the external data sources API, with 
 which you can create, for example, partitioned Parquet tables without using 
 Hive.
 
 Cheng
 
 On 1/26/15 8:47 AM, Danny Yates wrote:
 Thanks Michael.
 
 I'm not actually using Hive at the moment - in fact, I'm trying to avoid it 
 if I can. I'm just wondering whether Spark has anything similar I can 
 leverage?
 
 Thanks
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLLib sparse vector

2014-09-15 Thread Chris Gore
Hi Sameer,

MLLib uses Breeze’s vector format under the hood.  You can use that.  
http://www.scalanlp.org/api/breeze/index.html#breeze.linalg.SparseVector

For example:

import breeze.linalg.{DenseVector = BDV, SparseVector = BSV, Vector = BV}

val numClasses = classes.distinct.count.toInt

val userWithClassesAsSparseVector = rows.map(x = (x.userID, new 
BSV[Double](x.classIDs.sortWith(_  _), 
Seq.fill(x.classIDs.length)(1.0).toArray, numClasses).asInstanceOf[BV[Double]]))

Chris

On Sep 15, 2014, at 11:28 AM, Sameer Tilak ssti...@live.com wrote:

 Hi All,
 I have transformed the data into following format: First column is user id, 
 and then all the other columns are class ids. For a user only class ids that 
 appear in this row have value 1 and others are 0.  I need to crease a sparse 
 vector from this. Does the API for creating a sparse vector that can directly 
 support this format?  
 
 User idProduct class ids
 
 2622572   145447  162013421   28565   285556  293 455367261   
 130 3646167118806   183576  328651715   57671   57476



Re: MLLib sparse vector

2014-09-15 Thread Chris Gore
Probably worth noting that the factory methods in mllib create an object of 
type org.apache.spark.mllib.linalg.Vector which stores data in a similar format 
as Breeze vectors

Chris

On Sep 15, 2014, at 3:24 PM, Xiangrui Meng men...@gmail.com wrote:

 Or you can use the factory method `Vectors.sparse`:
 
 val sv = Vectors.sparse(numProducts, productIds.map(x = (x, 1.0)))
 
 where numProducts should be the largest product id plus one.
 
 Best,
 Xiangrui
 
 On Mon, Sep 15, 2014 at 12:46 PM, Chris Gore cdg...@cdgore.com wrote:
 Hi Sameer,
 
 MLLib uses Breeze’s vector format under the hood.  You can use that.
 http://www.scalanlp.org/api/breeze/index.html#breeze.linalg.SparseVector
 
 For example:
 
 import breeze.linalg.{DenseVector = BDV, SparseVector = BSV, Vector = BV}
 
 val numClasses = classes.distinct.count.toInt
 
 val userWithClassesAsSparseVector = rows.map(x = (x.userID, new
 BSV[Double](x.classIDs.sortWith(_  _),
 Seq.fill(x.classIDs.length)(1.0).toArray,
 numClasses).asInstanceOf[BV[Double]]))
 
 Chris
 
 On Sep 15, 2014, at 11:28 AM, Sameer Tilak ssti...@live.com wrote:
 
 Hi All,
 I have transformed the data into following format: First column is user id,
 and then all the other columns are class ids. For a user only class ids that
 appear in this row have value 1 and others are 0.  I need to crease a sparse
 vector from this. Does the API for creating a sparse vector that can
 directly support this format?
 
 User idProduct class ids
 
 2622572 145447 1620 13421 28565 285556 293 4553 67261 130 3646 1671 18806
 183576 3286 51715 57671 57476
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Chris Gore
There is support for Spark in ElasticSearch’s Hadoop integration package.

http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html

Maybe you could split and insert all of your documents from Spark and then 
query for “MoreLikeThis” on the ElasticSearch index.  I haven’t tried it, but 
maybe someone else has more experience using Spark with ElasticSearch.  At some 
point, maybe there could be an information retrieval package for Spark with 
locality sensitive hashing and other similar functions.

 
On Sep 3, 2014, at 10:40 AM, Victor Tso-Guillen v...@paxata.com wrote:

 Interestingly, there was an almost identical question posed on Aug 22 by 
 cjwang. Here's the link to the archive: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664
 
 
 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) 
 r.dan...@elsevier.com wrote:
 Hi all,
 
 Assume I have read the lines of a text file into an RDD:
 
 textFile = sc.textFile(SomeArticle.txt)
 
 Also assume that the sentence breaks in SomeArticle.txt were done by machine 
 and have some errors, such as the break at Fig. in the sample text below.
 
 Index   Text
 N...as shown in Fig.
 N+1 1.
 N+2 The figure shows...
 
 What I want is an RDD with:
 
 N   ... as shown in Fig. 1.
 N+1 The figure shows...
 
 Is there some way a filter() can look at neighboring elements in an RDD? That 
 way I could look, in parallel, at neighboring elements in an RDD and come up 
 with a new RDD that may have a different number of elements.  Or do I just 
 have to sequentially iterate through the RDD?
 
 Thanks,
 Ron
 
 
 



Re: Error: No space left on device

2014-07-16 Thread Chris Gore
Hi Chris,

I've encountered this error when running Spark’s ALS methods too.  In my case, 
it was because I set spark.local.dir improperly, and every time there was a 
shuffle, it would spill many GB of data onto the local drive.  What fixed it 
was setting it to use the /mnt directory, where a network drive is mounted.  
For example, setting an environmental variable:

export SPACE=$(mount | grep mnt | awk '{print $3/spark/}' | xargs | sed 's/ 
/,/g’)

Then adding -Dspark.local.dir=$SPACE or simply 
-Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver application

Chris

On Jul 15, 2014, at 11:39 PM, Xiangrui Meng men...@gmail.com wrote:

 Check the number of inodes (df -i). The assembly build may create many
 small files. -Xiangrui
 
 On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois chris.dub...@gmail.com wrote:
 Hi all,
 
 I am encountering the following error:
 
 INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No space
 left on device [duplicate 4]
 
 For each slave, df -h looks roughtly like this, which makes the above error
 surprising.
 
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  4.4G  3.5G  57% /
 tmpfs 7.4G  4.0K  7.4G   1% /dev/shm
 /dev/xvdb  37G  3.3G   32G  10% /mnt
 /dev/xvdf  37G  2.0G   34G   6% /mnt2
 /dev/xvdv 500G   33M  500G   1% /vol
 
 I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
 spark-ec2 scripts and a clone of spark from today. The job I am running
 closely resembles the collaborative filtering example. This issue happens
 with the 1M version as well as the 10 million rating version of the
 MovieLens dataset.
 
 I have seen previous questions, but they haven't helped yet. For example, I
 tried setting the Spark tmp directory to the EBS volume at /vol/, both by
 editing the spark conf file (and copy-dir'ing it to the slaves) as well as
 through the SparkConf. Yet I still get the above error. Here is my current
 Spark config below. Note that I'm launching via ~/spark/bin/spark-submit.
 
 conf = SparkConf()
 conf.setAppName(RecommendALS).set(spark.local.dir,
 /vol/).set(spark.executor.memory, 7g).set(spark.akka.frameSize,
 100).setExecutorEnv(SPARK_JAVA_OPTS,  -Dspark.akka.frameSize=100)
 sc = SparkContext(conf=conf)
 
 Thanks for any advice,
 Chris
 



Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Chris Gore
We'd love to see a Spark user group in Los Angeles and connect with others 
working with it here.

Ping me if you're in the LA area and use Spark at your company ( 
ch...@retentionscience.com ).

Chris
 
Retention Science
call: 734.272.3099
visit: Site | like: Facebook | follow: Twitter

On Mar 31, 2014, at 10:42 AM, Anurag Dodeja anu...@anuragdodeja.com wrote:

 How about Chicago?
 
 
 On Mon, Mar 31, 2014 at 12:38 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
 Montreal or Toronto?
 
 
 On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson mar...@skimlinks.com wrote:
 How about London?
 
 
 -- 
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240  
 image.png
 
 
 On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski andykonwin...@gmail.com 
 wrote:
 Hi folks,
 
 We have seen a lot of community growth outside of the Bay Area and we are 
 looking to help spur even more!
 
 For starters, the organizers of the Spark meetups here in the Bay Area want 
 to help anybody that is interested in setting up a meetup in a new city.
 
 Some amazing Spark champions have stepped forward in Seattle, Vancouver, 
 Boulder/Denver, and a few other areas already.
 
 Right now, we are looking to connect with you Spark enthusiasts in NYC about 
 helping to run an inaugural Spark Meetup in your area.
 
 You can reply to me directly if you are interested and I can tell you about 
 all of the resources we have to offer (speakers from the core community, a 
 budget for food, help scheduling, etc.), and let's make this happen!
 
 Andy