Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Shashidhar Rao
Thanks Sean, your suggestions and the links provided are just what I needed
to start off with.

On Sun, Mar 15, 2015 at 6:16 PM, Sean Owen so...@cloudera.com wrote:

 I think you're assuming that you will pre-compute recommendations and
 store them in Mongo. That's one way to go, with certain tradeoffs. You
 can precompute offline easily, and serve results at large scale
 easily, but, you are forced to precompute everything -- lots of wasted
 effort, not completely up to date.

 The front-end part of the stack looks right.

 Spark would do the model building; you'd have to write a process to
 score recommendations and store the result. Mahout is the same thing,
 really.

 500K items isn't all that large. Your requirements aren't driven just
 by items though. Number of users and latent features matter too. It
 matters how often you want to build the model too. I'm guessing you
 would get away with a handful of modern machines for a problem this
 size.


 In a way what you describe reminds me of Wibidata, since it built
 recommender-like solutions on top of data and results published to a
 NoSQL store. You might glance at the related OSS project Kiji
 (http://kiji.org/) for ideas about how to manage the schema.

 You should have a look at things like Nick's architecture for
 Graphflow, however it's more concerned with computing recommendation
 on the fly, and describes a shift from an architecture originally
 built around something like a NoSQL store:

 http://spark-summit.org/wp-content/uploads/2014/07/Using-Spark-and-Shark-to-Power-a-Realt-time-Recommendation-and-Customer-Intelligence-Platform-Nick-Pentreath.pdf

 This is also the kind of ground the oryx project is intended to cover,
 something I've worked on personally:
 https://github.com/OryxProject/oryx   -- a layer on and around the
 core model building in Spark + Spark Streaming to provide a whole
 recommender (for example), down to the REST API.

 On Sun, Mar 15, 2015 at 10:45 AM, Shashidhar Rao
 raoshashidhar...@gmail.com wrote:
  Hi,
 
  Can anyone who has developed recommendation engine suggest what could be
 the
  possible software stack for such an application.
 
  I am basically new to recommendation engine , I just found out Mahout and
  Spark Mlib which are available .
  I am thinking the below software stack.
 
  1. The user is going to use Android app.
  2.  Rest Api sent to app server from the android app to get
 recommendations.
  3. Spark Mlib core engine for recommendation engine
  4. MongoDB database backend.
 
  I would like to know more on the cluster configuration( how many nodes
 etc)
  part of spark for calculating the recommendations for 500,000 items. This
  items include products for day care etc.
 
  Other software stack suggestions would also be very useful.It has to run
 on
  multiple vendor machines.
 
  Please suggest.
 
  Thanks
  shashi



Re: [Spark SQL]: Convert JavaSchemaRDD back to JavaRDD of a specific class

2015-03-15 Thread Cheng Lian
Currently there’s no convenient way to convert a 
|SchemaRDD|/|JavaSchemaRDD| back to an |RDD|/|JavaRDD| of some case 
class. But you can convert an |RDD|/|JavaRDD| into an 
|RDD[Row]|/|JavaRDDRow| using |schemaRdd.rdd| and |new 
JavaRDDRow(schemaRdd.rdd)|.


Cheng

On 3/15/15 10:22 PM, Renato Marroquín Mogrovejo wrote:


Hi Spark experts,

Is there a way to convert a JavaSchemaRDD (for instance loaded from a 
parquet file) back to a JavaRDD of a given case class? I read on 
stackOverFlow[1] that I could do a select over the parquet file and 
then by reflection get the fields out, but I guess that would be an 
overkill.
Then I saw [2] from 2014 which says that this feature would be 
available in the future. So could you please let me know how I can 
accomplish this? Thanks in advance!



Renato M.

[1] 
http://stackoverflow.com/questions/26181353/how-to-convert-spark-schemardd-into-rdd-of-my-case-class
[2] 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-td9071.html


​


Re: Spark Streaming on Yarn Input from Flume

2015-03-15 Thread tarek_abouzeid
have you fixed this issue ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-Yarn-Input-from-Flume-tp11755p22055.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: 1.3 release

2015-03-15 Thread Sean Owen
I think (I hope) it's because the generic builds just work. Even
though these are of course distributed mostly verbatim in CDH5, with
tweaks to be compatible with other stuff at the edges, the stock
builds should be fine too. Same for HDP as I understand.

The CDH4 build may work on some builds of CDH4, but I think is lurking
there as a Hadoop 2.0.x plus a certain YARN beta build. I'd prefer
to rename it that way, myself, since it doesn't actually work with all
of CDH4 anyway.

Are the MapR builds there because the stock Hadoop build doesn't work
on MapR? that would actually surprise me, but then, why are these two
builds distributed?


On Sun, Mar 15, 2015 at 6:22 AM, Eric Friedman
eric.d.fried...@gmail.com wrote:
 Is there a reason why the prebuilt releases don't include current CDH distros 
 and YARN support?

 
 Eric Friedman
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[Spark SQL]: Convert JavaSchemaRDD back to JavaRDD of a specific class

2015-03-15 Thread Renato Marroquín Mogrovejo
Hi Spark experts,

Is there a way to convert a JavaSchemaRDD (for instance loaded from a
parquet file) back to a JavaRDD of a given case class? I read on
stackOverFlow[1] that I could do a select over the parquet file and then by
reflection get the fields out, but I guess that would be an overkill.
Then I saw [2] from 2014 which says that this feature would be available in
the future. So could you please let me know how I can accomplish this?
Thanks in advance!


Renato M.

[1]
http://stackoverflow.com/questions/26181353/how-to-convert-spark-schemardd-into-rdd-of-my-case-class
[2]
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-td9071.html


Re: Spark Release 1.3.0 DataFrame API

2015-03-15 Thread David Mitchell
Thank you for your help.  toDF() solved my first problem.  And, the
second issue was a non-issue, since the second example worked without any
modification.

David


On Sun, Mar 15, 2015 at 1:37 AM, Rishi Yadav ri...@infoobjects.com wrote:

 programmatically specifying Schema needs

  import org.apache.spark.sql.type._

 for StructType and StructField to resolve.

 On Sat, Mar 14, 2015 at 10:07 AM, Sean Owen so...@cloudera.com wrote:

 Yes I think this was already just fixed by:

 https://github.com/apache/spark/pull/4977

 a .toDF() is missing

 On Sat, Mar 14, 2015 at 4:16 PM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
  I've found people.toDF gives you a data frame (roughly equivalent to the
  previous Row RDD),
 
  And you can then call registerTempTable on that DataFrame.
 
  So people.toDF.registerTempTable(people) should work
 
 
 
  —
  Sent from Mailbox
 
 
  On Sat, Mar 14, 2015 at 5:33 PM, David Mitchell 
 jdavidmitch...@gmail.com
  wrote:
 
 
  I am pleased with the release of the DataFrame API.  However, I started
  playing with it, and neither of the two main examples in the
 documentation
  work: http://spark.apache.org/docs/1.3.0/sql-programming-guide.html
 
  Specfically:
 
  Inferring the Schema Using Reflection
  Programmatically Specifying the Schema
 
 
  Scala 2.11.6
  Spark 1.3.0 prebuilt for Hadoop 2.4 and later
 
  Inferring the Schema Using Reflection
  scala people.registerTempTable(people)
  console:31: error: value registerTempTable is not a member of
  org.apache.spark
  .rdd.RDD[Person]
people.registerTempTable(people)
   ^
 
  Programmatically Specifying the Schema
  scala val peopleDataFrame = sqlContext.createDataFrame(people, schema)
  console:41: error: overloaded method value createDataFrame with
  alternatives:
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass:
  Class[_])org.apache.spar
  k.sql.DataFrame and
(rdd: org.apache.spark.rdd.RDD[_],beanClass:
  Class[_])org.apache.spark.sql.Dat
  aFrame and
(rowRDD:
  org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],columns:
  java.util.List[String])org.apache.spark.sql.DataFrame and
(rowRDD:
  org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: o
  rg.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
 and
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema:
  org.apache
  .spark.sql.types.StructType)org.apache.spark.sql.DataFrame
   cannot be applied to (org.apache.spark.rdd.RDD[String],
  org.apache.spark.sql.ty
  pes.StructType)
 val df = sqlContext.createDataFrame(people, schema)
 
  Any help would be appreciated.
 
  David
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
### Confidential e-mail, for recipient's (or recipients') eyes only, not
for distribution. ###


Saving Dstream into a single file

2015-03-15 Thread tarek_abouzeid
i am doing word count example on flume stream and trying to save output as
text files in HDFS , but in the save directory i got multiple sub
directories each having files with small size , i wonder if there is a way
to append in a large file instead of saving in multiple files , as i intend
to save the output in hive hdfs directory so i can query the result using
hive 

hope anyone have a workaround for this issue , Thanks in advance 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Saving-Dstream-into-a-single-file-tp22058.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Sean Owen
I think you're assuming that you will pre-compute recommendations and
store them in Mongo. That's one way to go, with certain tradeoffs. You
can precompute offline easily, and serve results at large scale
easily, but, you are forced to precompute everything -- lots of wasted
effort, not completely up to date.

The front-end part of the stack looks right.

Spark would do the model building; you'd have to write a process to
score recommendations and store the result. Mahout is the same thing,
really.

500K items isn't all that large. Your requirements aren't driven just
by items though. Number of users and latent features matter too. It
matters how often you want to build the model too. I'm guessing you
would get away with a handful of modern machines for a problem this
size.


In a way what you describe reminds me of Wibidata, since it built
recommender-like solutions on top of data and results published to a
NoSQL store. You might glance at the related OSS project Kiji
(http://kiji.org/) for ideas about how to manage the schema.

You should have a look at things like Nick's architecture for
Graphflow, however it's more concerned with computing recommendation
on the fly, and describes a shift from an architecture originally
built around something like a NoSQL store:
http://spark-summit.org/wp-content/uploads/2014/07/Using-Spark-and-Shark-to-Power-a-Realt-time-Recommendation-and-Customer-Intelligence-Platform-Nick-Pentreath.pdf

This is also the kind of ground the oryx project is intended to cover,
something I've worked on personally:
https://github.com/OryxProject/oryx   -- a layer on and around the
core model building in Spark + Spark Streaming to provide a whole
recommender (for example), down to the REST API.

On Sun, Mar 15, 2015 at 10:45 AM, Shashidhar Rao
raoshashidhar...@gmail.com wrote:
 Hi,

 Can anyone who has developed recommendation engine suggest what could be the
 possible software stack for such an application.

 I am basically new to recommendation engine , I just found out Mahout and
 Spark Mlib which are available .
 I am thinking the below software stack.

 1. The user is going to use Android app.
 2.  Rest Api sent to app server from the android app to get recommendations.
 3. Spark Mlib core engine for recommendation engine
 4. MongoDB database backend.

 I would like to know more on the cluster configuration( how many nodes etc)
 part of spark for calculating the recommendations for 500,000 items. This
 items include products for day care etc.

 Other software stack suggestions would also be very useful.It has to run on
 multiple vendor machines.

 Please suggest.

 Thanks
 shashi

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Need Advice about reading lots of text files

2015-03-15 Thread Pat Ferrel
Ah most interesting—thanks.

So it seems sc.textFile(longFileList) has to read all metadata before starting 
the read for partitioning purposes so what you do is not use it? 

You create a task per file that reads one file (in parallel) per task without 
scanning for _all_ metadata. Can’t argue with the logic but perhaps Spark 
should incorporate something like this in sc.textFile? My case can’t be that 
unusual especially since I am periodically processing micro-batches from Spark 
Streaming. In fact Actually I have to scan HDFS to create the longFileList to 
begin with so get file status and therefore probably all the metadata needed by 
sc.textFile. Your method would save one scan, which is good.

Might a better sc.textFile take a beginning URI, a file pattern regex, and a 
recursive flag? Then one scan could create all metadata automatically for a 
large subset of people using the function, something like 

sc.textFile(beginDir: String, filePattern: String = “^part.*”, recursive: 
Boolean = false)

I fact it should be easy to create BetterSC that overrides the textFile method 
with a re-implementation that only requires one scan to get metadata. 

Just thinking on email…

On Mar 14, 2015, at 11:11 AM, Michael Armbrust mich...@databricks.com wrote:


Here is how I have dealt with many small text files (on s3 though this should 
generalize) in the past:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E
 
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E


 
FromMichael Armbrust mich...@databricks.com 
mailto:mich...@databricks.com
Subject Re: S3NativeFileSystem inefficient implementation when calling 
sc.textFile
DateThu, 27 Nov 2014 03:20:14 GMT
In the past I have worked around this problem by avoiding sc.textFile().
Instead I read the data directly inside of a Spark job.  Basically, you
start with an RDD where each entry is a file in S3 and then flatMap that
with something that reads the files and returns the lines.

Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe 
https://gist.github.com/marmbrus/fff0b058f134fa7752fe

Using this class you can do something like:

sc.parallelize(s3n://mybucket/file1 :: s3n://mybucket/file1 ... ::
Nil).flatMap(new ReadLinesSafe(_))

You can also build up the list of files by running a Spark job:
https://gist.github.com/marmbrus/15e72f7bc22337cf6653 
https://gist.github.com/marmbrus/15e72f7bc22337cf6653

Michael

On Sat, Mar 14, 2015 at 10:38 AM, Pat Ferrel p...@occamsmachete.com 
mailto:p...@occamsmachete.com wrote:
It’s a long story but there are many dirs with smallish part- files in them 
so we create a list of the individual files as input to 
sparkContext.textFile(fileList). I suppose we could move them and rename them 
to be contiguous part- files in one dir. Would that be better than passing 
in a long list of individual filenames? We could also make the part files much 
larger by collecting the smaller ones. But would any of this make a difference 
in IO speed?

I ask because using the long file list seems to read, what amounts to a not 
very large data set rather slowly. If it were all in large part files in one 
dir I’d expect it to go much faster but this is just intuition.


On Mar 14, 2015, at 9:58 AM, Koert Kuipers ko...@tresata.com 
mailto:ko...@tresata.com wrote:

why can you not put them in a directory and read them as one input? you will 
get a task per file, but spark is very fast at executing many tasks (its not a 
jvm per task).

On Sat, Mar 14, 2015 at 12:51 PM, Pat Ferrel p...@occamsmachete.com 
mailto:p...@occamsmachete.com wrote:
Any advice on dealing with a large number of separate input files?


On Mar 13, 2015, at 4:06 PM, Pat Ferrel p...@occamsmachete.com 
mailto:p...@occamsmachete.com wrote:

We have many text files that we need to read in parallel. We can create a comma 
delimited list of files to pass in to sparkContext.textFile(fileList). The list 
can get very large (maybe 1) and is all on hdfs.

The question is: what is the most performant way to read them? Should they be 
broken up and read in groups appending the resulting RDDs or should we just 
pass in the entire list at once? In effect I’m asking if Spark does some 
optimization of whether we should do it explicitly. If the later, what rule 
might we use depending on our cluster setup?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org 
mailto:user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
mailto:user-unsubscr...@spark.apache.org
For additional commands, 

Re: deploying Spark on standalone cluster

2015-03-15 Thread tarek_abouzeid
i was having a similar issue but it was in spark and flume integration i was
getting failed to bind error , but got it fixed by shutting down firewall
for both machines (make sure : service iptables status = firewall stopped)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/deploying-Spark-on-standalone-cluster-tp22049p22057.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



1.3 release

2015-03-15 Thread Eric Friedman
Is there a reason why the prebuilt releases don't include current CDH distros 
and YARN support?


Eric Friedman
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Null Pointer Exception due to mapVertices function in GraphX

2015-03-15 Thread James
I have got NullPointerException in aggregateMessages on a graph which is
the output of mapVertices function of a graph. I found the problem is
because of the mapVertices funciton did not affect all the triplet of the
graph.

// Initial the graph, assign a counter to each vertex that contains
the vertex id onlyvar anfGraph = graph.mapVertices { case (vid, _) =
   val counter = new HyperLogLog(5)
   counter.offer(vid)
   counter}
 val nullVertex = anfGraph.triplets.filter(edge = edge.srcAttr == 
 null).first// There is an edge whose src attr is null

 anfGraph.vertices.filter(_._1 == nullVertex).first// I could see that the 
 vertex has a not null attribute

 messages = anfGraph.aggregateMessages(msgFun, mergeMessage)   // - 
 NullPointerException


My spark version:1.2.0


Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-15 Thread DB Tsai
In LBFGS version of logistic regression, the data is properly
standardized, so this should not happen. Can you provide a copy of
your dataset to us so we can test it? If the dataset can not be
public, can you have just send me a copy so I can dig into this? I'm
the author of LORWithLBFGS. Thanks.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Fri, Mar 13, 2015 at 2:41 PM, cjwang c...@cjwang.us wrote:
 I am running LogisticRegressionWithLBFGS.  I got these lines on my console:

 2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to 0.5
 2015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to 0.25
 2015-03-12 17:38:04,036 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to 0.125
 2015-03-12 17:38:04,105 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to
 0.0625
 2015-03-12 17:38:04,176 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to
 0.03125
 2015-03-12 17:38:04,247 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to
 0.015625
 2015-03-12 17:38:04,317 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to
 0.0078125
 2015-03-12 17:38:04,461 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to
 0.005859375
 2015-03-12 17:38:04,605 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:04,672 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:04,747 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:04,818 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:04,890 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:04,962 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:05,038 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:05,107 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:05,186 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:05,256 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:05,257 ERROR breeze.optimize.LBFGS | Failure! Resetting
 history: breeze.optimize.FirstOrderException: Line search zoom failed


 What causes them and how do I fix them?

 I checked my data and there seemed nothing out of the ordinary.  The
 resulting prediction model seemed acceptable to me.  So, are these ERRORs
 actually WARNINGs?  Could we or should we tune the level of these messages
 down one notch?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/LogisticRegressionWithLBFGS-shows-ERRORs-tp22042.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: order preservation with RDDs

2015-03-15 Thread Sean Owen
Yes I don't think this is entirely reliable in general. I would emit
(label,features) pairs and then transform the values.

In practice, this may happen to work fine in simple cases.

On Sun, Mar 15, 2015 at 3:51 AM, kian.ho hui.kian.ho+sp...@gmail.com wrote:
 Hi, I was taking a look through the mllib examples in the official spark
 documentation and came across the following:
 http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2

 specifically the lines:

 label = data.map(lambda x: x.label)
 features = data.map(lambda x: x.features)
 ...
 ...
 data1 = label.zip(scaler1.transform(features))

 my question:
 wouldn't it be possible that some labels in the pairs returned by the
 label.zip(..) operation are not paired with their original features? i.e.
 are the original orderings of `labels` and `features` preserved after the
 scaler1.transform(..) and label.zip(..) operations?

 This issue was also mentioned in
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p19433.html

 I would greatly appreciate some clarification on this, as I've run into this
 issue whilst experimenting with feature extraction for text classification,
 where (correct me if I'm wrong) there is no built-in mechanism to keep track
 of document-ids through the HashingTF and IDF fitting and transformations.

 Thanks.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Shashidhar Rao
Hi,

Can anyone who has developed recommendation engine suggest what could be
the possible software stack for such an application.

I am basically new to recommendation engine , I just found out Mahout and
Spark Mlib which are available .
I am thinking the below software stack.

1. The user is going to use Android app.
2.  Rest Api sent to app server from the android app to get recommendations.
3. Spark Mlib core engine for recommendation engine
4. MongoDB database backend.

I would like to know more on the cluster configuration( how many nodes etc)
part of spark for calculating the recommendations for 500,000 items. This
items include products for day care etc.

Other software stack suggestions would also be very useful.It has to run on
multiple vendor machines.

Please suggest.

Thanks
shashi


Re: Explanation on the Hive in the Spark assembly

2015-03-15 Thread Cheng Lian
Spark SQL supports most commonly used features of HiveQL. However, 
different HiveQL statements are executed in different manners:


1.

   DDL statements (e.g. |CREATE TABLE|, |DROP TABLE|, etc.) and
   commands (e.g. |SET key = value|, |ADD FILE|, |ADD JAR|, etc.)

   In most cases, Spark SQL simply delegates these statements to Hive,
   as they don’t need to issue any distributed jobs and don’t rely on
   the computation engine (Spark, MR, or Tez).

2.

   |SELECT| queries, |CREATE TABLE ... AS SELECT ...| statements and
   insertions

   These statements are executed using Spark as the execution engine.

The Hive classes packaged in the assembly jar are used to provide entry 
points to Hive features, for example:


1. HiveQL parser
2. Talking to Hive metastore to execute DDL statements
3. Accessing UDF/UDAF/UDTF

As for the differences between Hive on Spark and Spark SQL’s Hive 
support, please refer to this article by Reynold: 
https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html


Cheng

On 3/14/15 10:53 AM, bit1...@163.com wrote:


Thanks Daoyuan.
What do you mean by running some native command, I never thought that 
hive will run without an computing engine like Hadoop MR or spark. Thanks.



bit1...@163.com

*From:* Wang, Daoyuan mailto:daoyuan.w...@intel.com
*Date:* 2015-03-13 16:39
*To:* bit1...@163.com mailto:bit1...@163.com; user
mailto:user@spark.apache.org
*Subject:* RE: Explanation on the Hive in the Spark assembly

Hi bit1129,

1, hive in spark assembly removed most dependencies of hive on
Hadoop to avoid conflicts.

2, this hive is used to run some native command, which does not
rely on spark or mapreduce.

Thanks,

Daoyuan

*From:*bit1...@163.com [mailto:bit1...@163.com]
*Sent:* Friday, March 13, 2015 4:24 PM
*To:* user
*Subject:* Explanation on the Hive in the Spark assembly

Hi, sparkers,

I am kind of confused about hive in the spark assembly.  I think
hive in the spark assembly is not the same thing as Hive On
Spark(Hive On Spark,  is meant to run hive using spark execution
engine).

So, my question is:

1. What is the difference between Hive in the spark assembly and
Hive on Hadoop?

2. Does Hive in the spark assembly use Spark execution engine or
Hadoop MR engine?

Thanks.



bit1...@163.com mailto:bit1...@163.com


​


Benchmarks of 'Hive on Tez' vs 'Hive on Spark' vs Spark SQL

2015-03-15 Thread Slim Baltagi
Hi

I would like to share with you my comments on Hortonworks' benchmarks of
'Hive on Tez' vs 'Hive on Spark' vs 'Spark SQL'.
Please check them in my related blog entry at http://goo.gl/K5mk0U

Thanks

Slim Baltagi
Chicago, IL
http://www.SparkBigData.com 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Benchmarks-of-Hive-on-Tez-vs-Hive-on-Spark-vs-Spark-SQL-tp22060.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Problem connecting to HBase

2015-03-15 Thread HARIPRIYA AYYALASOMAYAJULA
Hello all,

Thank you for your responses. I did try to include the
zookeeper.znode.parent property in the hbase-site.xml. It still continues
to give the same error.

I am using Spark 1.2.0 and hbase 0.98.9.

Could you please suggest what else could be done?


On Fri, Mar 13, 2015 at 10:25 PM, Ted Yu yuzhih...@gmail.com wrote:

 In HBaseTest.scala:
 val conf = HBaseConfiguration.create()
 You can add some log (for zookeeper.znode.parent, e.g.) to see if the
 values from hbase-site.xml are picked up correctly.

 Please use pastebin next time you want to post errors.

 Which Spark release are you using ?
 I assume it contains SPARK-1297

 Cheers

 On Fri, Mar 13, 2015 at 7:47 PM, HARIPRIYA AYYALASOMAYAJULA 
 aharipriy...@gmail.com wrote:


 Hello,

 I am running a HBase test case. I am using the example from the following:

 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala

 I created a very small HBase table with 5 rows and 2 columns.
 I have attached a screenshot of the error log. I believe it is a problem
 where the driver program is unable to establish connection to the hbase.

 The following is my simple.sbt:

 name := Simple Project

 version := 1.0

 scalaVersion := 2.10.4

 libraryDependencies ++= Seq(

  org.apache.spark %% spark-core % 1.2.0,

  org.apache.hbase % hbase % 0.98.9-hadoop2 % provided,

  org.apache.hbase % hbase-client % 0.98.9-hadoop2 % provided,

  org.apache.hbase % hbase-server % 0.98.9-hadoop2 % provided,

  org.apache.hbase % hbase-common % 0.98.9-hadoop2 % provided
 )

 I am using a 23 node cluster, did copy hbase-site.xml into /spark/conf
 folder
 and set spark.executor.extraClassPath pointing to the /hbase/ folder in
 the spark-defaults.conf

 Also, while submitting the spark job I am including the required jars :

 spark-submit --class HBaseTest --master yarn-cluster
 --driver-class-path
  
 /opt/hbase/0.98.9/lib/hbase-server-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-protocol-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-hadoop2-compat-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-client-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-common-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/htrace-core-2.04.jar
  /home/priya/usingHBase/Spark/target/scala-2.10/simple-project_2.10-1.0.jar
 /Priya/sparkhbase-test1

 It would be great if you could point where I am going wrong, and what
 could be done to correct it.

 Thank you for your time.
 --
 Regards,
 Haripriya Ayyalasomayajula
 Graduate Student
 Department of Computer Science
 University of Houston
 Contact : 650-796-7112


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Regards,
Haripriya Ayyalasomayajula
Graduate Student
Department of Computer Science
University of Houston
Contact : 650-796-7112


Re: Running spark function on parquet without sql

2015-03-15 Thread Cheng Lian
That's an unfortunate documentation bug in the programming guide... We 
failed to update it after making the change.


Cheng

On 2/28/15 8:13 AM, Deborah Siegel wrote:

Hi Michael,

Would you help me understand  the apparent difference here..

The Spark 1.2.1 programming guide indicates:

Note that if you call |schemaRDD.cache()| rather than 
|sqlContext.cacheTable(...)|, tables will /not/ be cached using the 
in-memory columnar format, and therefore 
|sqlContext.cacheTable(...)| is strongly recommended for this use case.


Yet the API doc shows that :


def cache(): SchemaRDD

https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html.this.type


Overridden cache function will always use the in-memory
columnar caching.



links
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD

Thanks
Sincerely
Deb

On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust 
mich...@databricks.com mailto:mich...@databricks.com wrote:


From Zhan Zhang's reply, yes I still get the parquet's advantage.

You will need to at least use SQL or the DataFrame API (coming in
Spark 1.3) to specify the columns that you want in order to get
the parquet benefits.   The rest of your operations can be
standard Spark.

My next question is, if I operate on SchemaRdd will I get the
advantage of
Spark SQL's in memory columnar store when cached the table using
cacheTable()?


Yes, SchemaRDDs always use the in-memory columnar cache for
cacheTable and .cache() since Spark 1.2+






Re: Issue with yarn cluster - hangs in accepted state.

2015-03-15 Thread abhi
Thanks,
It worked.

-Abhi

On Tue, Mar 3, 2015 at 5:15 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Wed, Mar 4, 2015 at 6:20 AM, Zhan Zhang zzh...@hortonworks.com wrote:

  Do you have enough resource in your cluster? You can check your resource
 manager to see the usage.


 Yep, I can confirm that this is a very annoying issue. If there is not
 enough memory or VCPUs available, your app will just stay in ACCEPTED state
 until resources are available.

 You can have a look at

 https://github.com/jubatus/jubaql-docker/blob/master/hadoop/yarn-site.xml#L35
 to see some settings that might help.

 Tobias





Re: Writing wide parquet file in Spark SQL

2015-03-15 Thread Cheng Lian
This article by Ryan Blue should be helpful to understand the problem 
http://ingest.tips/2015/01/31/parquet-row-group-size/


The TL;DR is, you may decrease |parquet.block.size| to reduce memory 
consumption. Anyway, 100K columns is a really big burden for Parquet, 
but I guess your data should be pretty sparse.


Cheng

On 3/11/15 4:13 AM, kpeng1 wrote:


Hi All,

I am currently trying to write a very wide file into parquet using spark
sql.  I have 100K column records that I am trying to write out, but of
course I am running into space issues(out of memory - heap space).  I was
wondering if there are any tweaks or work arounds for this.

I am basically calling saveAsParquetFile on the schemaRDD.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-wide-parquet-file-in-Spark-SQL-tp21995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



​


Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Shashidhar Rao
Thanks Nick, for your suggestions.

On Sun, Mar 15, 2015 at 10:41 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 As Sean says, precomputing recommendations is pretty inefficient. Though
 with 500k items its easy to get all the item vectors in memory so
 pre-computing is not too bad.

 Still, since you plan to serve these via a REST service anyway, computing
 on demand via a serving layer such as Oryx or PredictionIO (or the newly
 open sourced Seldon.io) is a good option. You can also cache the
 recommendations quite aggressively - once you compute a user or item top-K
 list, just stick the result in mem cache / redis / whatever and evict it
 when you recompute your offline model, or every hour or whatever.


 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Sun, Mar 15, 2015 at 3:03 PM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Thanks Sean, your suggestions and the links provided are just what I
 needed to start off with.

 On Sun, Mar 15, 2015 at 6:16 PM, Sean Owen so...@cloudera.com wrote:

 I think you're assuming that you will pre-compute recommendations and
 store them in Mongo. That's one way to go, with certain tradeoffs. You
 can precompute offline easily, and serve results at large scale
 easily, but, you are forced to precompute everything -- lots of wasted
 effort, not completely up to date.

 The front-end part of the stack looks right.

 Spark would do the model building; you'd have to write a process to
 score recommendations and store the result. Mahout is the same thing,
 really.

 500K items isn't all that large. Your requirements aren't driven just
 by items though. Number of users and latent features matter too. It
 matters how often you want to build the model too. I'm guessing you
 would get away with a handful of modern machines for a problem this
 size.


 In a way what you describe reminds me of Wibidata, since it built
 recommender-like solutions on top of data and results published to a
 NoSQL store. You might glance at the related OSS project Kiji
 (http://kiji.org/) for ideas about how to manage the schema.

 You should have a look at things like Nick's architecture for
 Graphflow, however it's more concerned with computing recommendation
 on the fly, and describes a shift from an architecture originally
 built around something like a NoSQL store:

 http://spark-summit.org/wp-content/uploads/2014/07/Using-Spark-and-Shark-to-Power-a-Realt-time-Recommendation-and-Customer-Intelligence-Platform-Nick-Pentreath.pdf

 This is also the kind of ground the oryx project is intended to cover,
 something I've worked on personally:
 https://github.com/OryxProject/oryx   -- a layer on and around the
 core model building in Spark + Spark Streaming to provide a whole
 recommender (for example), down to the REST API.

 On Sun, Mar 15, 2015 at 10:45 AM, Shashidhar Rao
 raoshashidhar...@gmail.com wrote:
  Hi,
 
  Can anyone who has developed recommendation engine suggest what could
 be the
  possible software stack for such an application.
 
  I am basically new to recommendation engine , I just found out Mahout
 and
  Spark Mlib which are available .
  I am thinking the below software stack.
 
  1. The user is going to use Android app.
  2.  Rest Api sent to app server from the android app to get
 recommendations.
  3. Spark Mlib core engine for recommendation engine
  4. MongoDB database backend.
 
  I would like to know more on the cluster configuration( how many nodes
 etc)
  part of spark for calculating the recommendations for 500,000 items.
 This
  items include products for day care etc.
 
  Other software stack suggestions would also be very useful.It has to
 run on
  multiple vendor machines.
 
  Please suggest.
 
  Thanks
  shashi






Re: Streaming linear regression example question

2015-03-15 Thread Margus Roo

Hi again

Tried the same 
examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala 
from 1.3.0

and getting in case testing file content is:
  (0.0,[3.0,4.0,3.0])
  (0.0,[4.0,4.0,4.0])
  (4.0,[5.0,5.0,5.0])
  (5.0,[5.0,6.0,6.0])
  (6.0,[7.0,4.0,7.0])
  (7.0,[8.0,6.0,8.0])
  (8.0,[44.0,9.0,9.0])
  (9.0,[14.0,30.0,10.0])

and the answer:
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(4.0,0.0)
(5.0,0.0)
(6.0,0.0)
(7.0,0.0)
(8.0,0.0)
(9.0,0.0)

What is wrong?
I can see that model's weights are changing in case I put new data into 
training dir.


Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
+372 51 480

On 14/03/15 09:05, Margus Roo wrote:

Hi

I try to understand example provided in 
https://spark.apache.org/docs/1.2.1/mllib-linear-methods.html - 
Streaming linear regression


Code:
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.DStream

object StreamingLinReg {

  def main(args: Array[String]) {

val conf = new 
SparkConf().setAppName(StreamLinReg).setMaster(local[2])

val ssc = new StreamingContext(conf, Seconds(10))


val trainingData = 
ssc.textFileStream(/Users/margusja/Documents/workspace/sparcdemo/training/).map(LabeledPoint.parse).cache()


val testData = 
ssc.textFileStream(/Users/margusja/Documents/workspace/sparcdemo/testing/).map(LabeledPoint.parse)


val numFeatures = 3
val model = new 
StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(numFeatures))


model.trainOn(trainingData)
model.predictOnValues(testData.map(lp = (lp.label, 
lp.features))).print()


ssc.start()
ssc.awaitTermination()

  }

}

Compiled code and run it
Put file contains
  (1.0,[2.0,2.0,2.0])
  (2.0,[3.0,3.0,3.0])
  (3.0,[4.0,4.0,4.0])
  (4.0,[5.0,5.0,5.0])
  (5.0,[6.0,6.0,6.0])
  (6.0,[7.0,7.0,7.0])
  (7.0,[8.0,8.0,8.0])
  (8.0,[9.0,9.0,9.0])
  (9.0,[10.0,10.0,10.0])
in to training directory.

I can see that models weight change:
15/03/14 08:53:40 INFO StreamingLinearRegressionWithSGD: Current 
model: weights, [7.333,7.333,7.333]


No I can put what ever in to testing directory but I can not 
understand answer.
In example I can put the same file I used for training in to testing 
directory. File content is

  (1.0,[2.0,2.0,2.0])
  (2.0,[3.0,3.0,3.0])
  (3.0,[4.0,4.0,4.0])
  (4.0,[5.0,5.0,5.0])
  (5.0,[6.0,6.0,6.0])
  (6.0,[7.0,7.0,7.0])
  (7.0,[8.0,8.0,8.0])
  (8.0,[9.0,9.0,9.0])
  (9.0,[10.0,10.0,10.0])

And answer will be
(1.0,0.0)
(2.0,0.0)
(3.0,0.0)
(4.0,0.0)
(5.0,0.0)
(6.0,0.0)
(7.0,0.0)
(8.0,0.0)
(9.0,0.0)

And in case my file content is
  (0.0,[2.0,2.0,2.0])
  (0.0,[3.0,3.0,3.0])
  (0.0,[4.0,4.0,4.0])
  (0.0,[5.0,5.0,5.0])
  (0.0,[6.0,6.0,6.0])
  (0.0,[7.0,7.0,7.0])
  (0.0,[8.0,8.0,8.0])
  (0.0,[9.0,9.0,9.0])
  (0.0,[10.0,10.0,10.0])

the answer will be:
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)

I except to get label predicted by model.
--
Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
+372 51 480




Submitting spark application using Yarn Rest API

2015-03-15 Thread Srini Karri
Hi All,

I am trying to submit the spark application using yarn rest API. I am able
to submit the application but final status shows as 'UNDEFINED.'. Couple of
other observations:

User shows as Dr.who
Application type is empty though I specify it as Spark

Is any one had this problem before?

I am creating the app id using:

http://{cluster host}/ws/v1/cluster/apps/new-application

Post Request Url: http://{cluster host}/ws/v1/cluster/apps

Here is the request body:

{
application-id:application_1426273041023_0055,
application-name:test,
am-container-spec:
{
credentials:
  {
  secrets:
  {

entry:
[
  {
key:user.name,
value:x
  }
]
  }
  },
  commands:
  {
command:%SPARK_HOME%/bin/spark-submit.cmd --class
org.apache.spark.examples.SparkPi --conf
spark.yarn.jar=hdfs://xxx/apps/spark/spark-1.2.1-hadoop2.6/spark-assembly-1.2.1-hadoop2.6.0.jar
--master yarn-cluster
hdfs://xxx/apps/spark/spark-1.2.1-hadoop2.6/spark-examples-1.2.1-hadoop2.6.0.jar
  },
 application-type:SPARK
  }


Re: Read Parquet file from scala directly

2015-03-15 Thread Cheng Lian

The parquet-tools code should be pretty helpful (although it's Java)

https://github.com/apache/incubator-parquet-mr/tree/master/parquet-tools/src/main/java/parquet/tools/command

On 3/10/15 12:25 AM, Shuai Zheng wrote:


Hi All,

I have a lot of parquet files, and I try to open them directly instead 
of load them into RDD in driver (so I can optimize some performance 
through special logic).


But I do some research online and can’t find any example to access 
parquet directly from scala, anyone has done this before?


Regards,

Shuai





Re: From Spark web ui, how to prove the parquet column pruning working

2015-03-15 Thread Cheng Lian

Hey Yong,

It seems that Hadoop `FileSystem` adds the size of a block to the 
metrics even if you only touch a fraction of it (reading Parquet 
metadata for example). This behavior can be verified by the following 
snippet:


```scala
import org.apache.spark.sql.Row
import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
import sc._
import sqlContext._

case class KeyValue(key: Int, value: String)

parallelize(1 to 1024 * 1024 * 20).
  flatMap(i = Seq.fill(10)(KeyValue(i, i.toString))).
  saveAsParquetFile(large.parquet)

hadoopConfiguration.set(parquet.task.side.metadata, true)
sql(SET spark.sql.parquet.filterPushdown=true)

parquetFile(large.parquet).where('key === 
0).queryExecution.toRdd.mapPartitions { _ =

  new Iterator[Row] {
def hasNext = false
def next() = ???
  }
}.collect()
```

Apparently we’re reading nothing here (except for Parquet metadata in 
the footers), but the web UI still suggests that the input size of all 
tasks equals to the file size.


Cheng


On 3/10/15 3:15 AM, java8964 wrote:
Hi, Currently most of the data in our production is using Avro + 
Snappy. I want to test the benefits if we store the data in Parquet 
format. I changed the our ETL to generate the Parquet format, instead 
of Avor, and want to test a simple sql in Spark SQL, to verify the 
benefits from Parquet.


I generated the same dataset in both Avro and Parquet in HDFS, and 
load them both in Spark-SQL. Now I run the same query like 
select colum1 from src_table_avro/parqut where colum2=xxx, I can see 
that for the parquet data format, the job runs much fast. The test 
files size for both format are around 930M. So Avro job generated 8 
tasks to read the data with 21s as the median duration, vs parquet job 
generate 7 tasks to read the data with 0.4s as the median duration.


Since the dataset has more than 100 columns, I can see the parquet 
file really coming with fast read. But my question is that from the 
spark UI, both job show 900M as the input size, and 0 for rest, in 
this case, how do I know column pruning really works? I think it is 
due to that, so parquet file can be read so fast, but is there any 
statistic can prove that to me on the Spark UI? Something like the 
input total file size is 900M, but only 10M really read due to column 
pruning? So in case that the columns pruning not work in parquet due 
to what kind of SQL query, I can identify in the first place.


Thanks

Yong




Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Nick Pentreath
As Sean says, precomputing recommendations is pretty inefficient. Though with 
500k items its easy to get all the item vectors in memory so pre-computing is 
not too bad.




Still, since you plan to serve these via a REST service anyway, computing on 
demand via a serving layer such as Oryx or PredictionIO (or the newly open 
sourced Seldon.io) is a good option. You can also cache the recommendations 
quite aggressively - once you compute a user or item top-K list, just stick the 
result in mem cache / redis / whatever and evict it when you recompute your 
offline model, or every hour or whatever.






—
Sent from Mailbox

On Sun, Mar 15, 2015 at 3:03 PM, Shashidhar Rao
raoshashidhar...@gmail.com wrote:

 Thanks Sean, your suggestions and the links provided are just what I needed
 to start off with.
 On Sun, Mar 15, 2015 at 6:16 PM, Sean Owen so...@cloudera.com wrote:
 I think you're assuming that you will pre-compute recommendations and
 store them in Mongo. That's one way to go, with certain tradeoffs. You
 can precompute offline easily, and serve results at large scale
 easily, but, you are forced to precompute everything -- lots of wasted
 effort, not completely up to date.

 The front-end part of the stack looks right.

 Spark would do the model building; you'd have to write a process to
 score recommendations and store the result. Mahout is the same thing,
 really.

 500K items isn't all that large. Your requirements aren't driven just
 by items though. Number of users and latent features matter too. It
 matters how often you want to build the model too. I'm guessing you
 would get away with a handful of modern machines for a problem this
 size.


 In a way what you describe reminds me of Wibidata, since it built
 recommender-like solutions on top of data and results published to a
 NoSQL store. You might glance at the related OSS project Kiji
 (http://kiji.org/) for ideas about how to manage the schema.

 You should have a look at things like Nick's architecture for
 Graphflow, however it's more concerned with computing recommendation
 on the fly, and describes a shift from an architecture originally
 built around something like a NoSQL store:

 http://spark-summit.org/wp-content/uploads/2014/07/Using-Spark-and-Shark-to-Power-a-Realt-time-Recommendation-and-Customer-Intelligence-Platform-Nick-Pentreath.pdf

 This is also the kind of ground the oryx project is intended to cover,
 something I've worked on personally:
 https://github.com/OryxProject/oryx   -- a layer on and around the
 core model building in Spark + Spark Streaming to provide a whole
 recommender (for example), down to the REST API.

 On Sun, Mar 15, 2015 at 10:45 AM, Shashidhar Rao
 raoshashidhar...@gmail.com wrote:
  Hi,
 
  Can anyone who has developed recommendation engine suggest what could be
 the
  possible software stack for such an application.
 
  I am basically new to recommendation engine , I just found out Mahout and
  Spark Mlib which are available .
  I am thinking the below software stack.
 
  1. The user is going to use Android app.
  2.  Rest Api sent to app server from the android app to get
 recommendations.
  3. Spark Mlib core engine for recommendation engine
  4. MongoDB database backend.
 
  I would like to know more on the cluster configuration( how many nodes
 etc)
  part of spark for calculating the recommendations for 500,000 items. This
  items include products for day care etc.
 
  Other software stack suggestions would also be very useful.It has to run
 on
  multiple vendor machines.
 
  Please suggest.
 
  Thanks
  shashi


Re: Is there any problem in having a long opened connection to spark sql thrift server

2015-03-15 Thread Cheng Lian
It should be OK. If you encountered problems in having a long opened 
connection to the Thrift server, it should be a bug.


Cheng

On 3/9/15 6:41 PM, fanooos wrote:

I have some applications developed using PHP and currently we have a problem
in connecting these applications to spark sql thrift server. ( Here is the
problem I am talking about.
http://apache-spark-user-list.1001560.n3.nabble.com/Connection-PHP-application-to-Spark-Sql-thrift-server-td21925.html
)


Until I find a solution to this problem, there is a suggestion to make a
little java application that connects to spark sql thrift server and provide
an API to the PHP applications to executes the required queries.

From my little experience, opening a connection and closing it for each
query is not recommended (I am talking from my experience in working with
CRUD applications the deals with some kind of database).

1- Is the same recommendation applied to working with spark sql thrift
server ?
2- If yes, Is there any problem in having one connection connected for a
long time with the spark sql thrift server?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-in-having-a-long-opened-connection-to-spark-sql-thrift-server-tp21967.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Problem connecting to HBase

2015-03-15 Thread Ted Yu
 org.apache.hbase % hbase % 0.98.9-hadoop2 % provided,

There is no module in hbase 0.98.9 called hbase. But this would not be the
root cause of the error.

Most likely hbase-site.xml was not picked up. Meaning this is classpath
issue.

On Sun, Mar 15, 2015 at 10:04 AM, HARIPRIYA AYYALASOMAYAJULA 
aharipriy...@gmail.com wrote:

 Hello all,

 Thank you for your responses. I did try to include the
 zookeeper.znode.parent property in the hbase-site.xml. It still continues
 to give the same error.

 I am using Spark 1.2.0 and hbase 0.98.9.

 Could you please suggest what else could be done?


 On Fri, Mar 13, 2015 at 10:25 PM, Ted Yu yuzhih...@gmail.com wrote:

 In HBaseTest.scala:
 val conf = HBaseConfiguration.create()
 You can add some log (for zookeeper.znode.parent, e.g.) to see if the
 values from hbase-site.xml are picked up correctly.

 Please use pastebin next time you want to post errors.

 Which Spark release are you using ?
 I assume it contains SPARK-1297

 Cheers

 On Fri, Mar 13, 2015 at 7:47 PM, HARIPRIYA AYYALASOMAYAJULA 
 aharipriy...@gmail.com wrote:


 Hello,

 I am running a HBase test case. I am using the example from the
 following:

 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala

 I created a very small HBase table with 5 rows and 2 columns.
 I have attached a screenshot of the error log. I believe it is a problem
 where the driver program is unable to establish connection to the hbase.

 The following is my simple.sbt:

 name := Simple Project

 version := 1.0

 scalaVersion := 2.10.4

 libraryDependencies ++= Seq(

  org.apache.spark %% spark-core % 1.2.0,

  org.apache.hbase % hbase % 0.98.9-hadoop2 % provided,

  org.apache.hbase % hbase-client % 0.98.9-hadoop2 % provided,

  org.apache.hbase % hbase-server % 0.98.9-hadoop2 % provided,

  org.apache.hbase % hbase-common % 0.98.9-hadoop2 % provided
 )

 I am using a 23 node cluster, did copy hbase-site.xml into /spark/conf
 folder
 and set spark.executor.extraClassPath pointing to the /hbase/ folder in
 the spark-defaults.conf

 Also, while submitting the spark job I am including the required jars :

 spark-submit --class HBaseTest --master yarn-cluster
 --driver-class-path
  
 /opt/hbase/0.98.9/lib/hbase-server-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-protocol-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-hadoop2-compat-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-client-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-common-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/htrace-core-2.04.jar
  /home/priya/usingHBase/Spark/target/scala-2.10/simple-project_2.10-1.0.jar
 /Priya/sparkhbase-test1

 It would be great if you could point where I am going wrong, and what
 could be done to correct it.

 Thank you for your time.
 --
 Regards,
 Haripriya Ayyalasomayajula
 Graduate Student
 Department of Computer Science
 University of Houston
 Contact : 650-796-7112


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 Regards,
 Haripriya Ayyalasomayajula
 Graduate Student
 Department of Computer Science
 University of Houston
 Contact : 650-796-7112



Re: Spark 1.2 – How to change Default (Random) port ….

2015-03-15 Thread Shailesh Birari
Hi SM,

Apologize for delayed response. 
No, the issue is with Spark 1.2.0. There is a bug in Spark 1.2.0.
Recently Spark have latest 1.3.0 release so it might have fixed in it.
I am not planning to test it soon, may be after some time.
You can try for it.

Regards,
  Shailesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-How-to-change-Default-Random-port-tp21306p22063.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Re: Explanation on the Hive in the Spark assembly

2015-03-15 Thread bit1...@163.com
Thanks Cheng for the great explanation!



bit1...@163.com
 
From: Cheng Lian
Date: 2015-03-16 00:53
To: bit1...@163.com; Wang, Daoyuan; user
Subject: Re: Explanation on the Hive in the Spark assembly
Spark SQL supports most commonly used features of HiveQL. However, different 
HiveQL statements are executed in different manners:
DDL statements (e.g. CREATE TABLE, DROP TABLE, etc.) and commands (e.g. SET 
key = value, ADD FILE, ADD JAR, etc.)
In most cases, Spark SQL simply delegates these statements to Hive, as they 
don’t need to issue any distributed jobs and don’t rely on the computation 
engine (Spark, MR, or Tez).
SELECT queries, CREATE TABLE ... AS SELECT ... statements and insertions
These statements are executed using Spark as the execution engine.
The Hive classes packaged in the assembly jar are used to provide entry points 
to Hive features, for example:
HiveQL parser
Talking to Hive metastore to execute DDL statements
Accessing UDF/UDAF/UDTF
As for the differences between Hive on Spark and Spark SQL’s Hive support, 
please refer to this article by Reynold: 
https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
Cheng
On 3/14/15 10:53 AM, bit1...@163.com wrote:
Thanks Daoyuan. 
What do you mean by running some native command, I never thought that hive will 
run without an computing engine like Hadoop MR or spark. Thanks.



bit1...@163.com
 
From: Wang, Daoyuan
Date: 2015-03-13 16:39
To: bit1...@163.com; user
Subject: RE: Explanation on the Hive in the Spark assembly
Hi bit1129,
 
1, hive in spark assembly removed most dependencies of hive on Hadoop to avoid 
conflicts.
2, this hive is used to run some native command, which does not rely on spark 
or mapreduce.
 
Thanks,
Daoyuan
 
From: bit1...@163.com [mailto:bit1...@163.com] 
Sent: Friday, March 13, 2015 4:24 PM
To: user
Subject: Explanation on the Hive in the Spark assembly
 
Hi, sparkers,
 
I am kind of confused about hive in the spark assembly.  I think hive in the 
spark assembly is not the same thing as Hive On Spark(Hive On Spark,  is meant 
to run hive using spark execution engine).
So, my question is:
1. What is the difference between Hive in the spark assembly and Hive on 
Hadoop? 
2. Does Hive in the spark assembly use Spark execution engine or Hadoop MR 
engine?
Thanks.
 
 


bit1...@163.com
​


Re: RE: Building spark over specified tachyon

2015-03-15 Thread fightf...@163.com
Thanks, Jerry
I got that way. Just to make sure whether there can be some option to directly 
specifying tachyon version.




fightf...@163.com
 
From: Shao, Saisai
Date: 2015-03-16 11:10
To: fightf...@163.com
CC: user
Subject: RE: Building spark over specified tachyon
I think you could change the pom file under Spark project to update the Tachyon 
related dependency version and rebuild it again (in case API is compatible, and 
behavior is the same).
 
I’m not sure is there any command you can use to compile against Tachyon 
version.
 
Thanks
Jerry
 
From: fightf...@163.com [mailto:fightf...@163.com] 
Sent: Monday, March 16, 2015 11:01 AM
To: user
Subject: Building spark over specified tachyon
 
Hi, all
Noting that the current spark releases are built-in with tachyon 0.5.0 ,
if we want to recompile spark with maven and targeting on specific tachyon 
version (let's say the most recent 0.6.0 release),
how should that be done? What maven compile command should be like ?
 
Thanks,
Sun. 
 


fightf...@163.com


Re: Input validation for LogisticRegressionWithSGD

2015-03-15 Thread Rishi Yadav
ca you share some sample data

On Sun, Mar 15, 2015 at 8:51 PM, Rohit U rjupadhy...@gmail.com wrote:

 Hi,

 I am trying to run  LogisticRegressionWithSGD on RDD of LabeledPoints
 loaded using loadLibSVMFile:

 val logistic: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,
 s3n://logistic-regression/epsilon_normalized)

 val model = LogisticRegressionWithSGD.train(logistic, 100)

 It gives an input validation error after about 10 minutes:

 org.apache.spark.SparkException: Input validation failed.
 at
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:162)
 at
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
 at
 org.apache.spark.mllib.classification.LogisticRegressionWithSGD$.train(LogisticRegression.scala:157)
 at
 org.apache.spark.mllib.classification.LogisticRegressionWithSGD$.train(LogisticRegression.scala:192)

 From reading this bug report (
 https://issues.apache.org/jira/browse/SPARK-2575) since I am loading
 LibSVM format file there should be only 0/1 in the dataset and should not
 be facing the issue in the bug report. Is there something else I'm missing
 here?

 Thanks!



Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread sandeep vura
Hi Sparkers,



I couldn't able to run spark-sql on spark.Please find the following error

 Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient


Regards,
Sandeep.v


Re: Slides of my talk in LA: 'Spark or Hadoop: is it an either-or proposition?'

2015-03-15 Thread Slim Baltagi
Hi

The video recording of this talk titled Spark or Hadoop: is it an either-or
proposition? at the Los Angeles Spark Users Group on March 12, 2015 is now
available on youtube at this link: http://goo.gl/0iJZ4n

Thanks

Slim Baltagi
http://www.SparkBigData.com 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Slides-of-my-talk-in-LA-Spark-or-Hadoop-is-it-an-either-or-proposition-tp22061p22069.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread Ted Yu
Can you provide more information ?
Such as:
Version of Spark you're using
Command line 

Thanks



 On Mar 15, 2015, at 9:51 PM, sandeep vura sandeepv...@gmail.com wrote:
 
 Hi Sparkers,
 
 
 
 I couldn't able to run spark-sql on spark.Please find the following error 
 
  Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 
 
 Regards,
 Sandeep.v

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Building spark over specified tachyon

2015-03-15 Thread Shao, Saisai
I think you could change the pom file under Spark project to update the Tachyon 
related dependency version and rebuild it again (in case API is compatible, and 
behavior is the same).

I'm not sure is there any command you can use to compile against Tachyon 
version.

Thanks
Jerry

From: fightf...@163.com [mailto:fightf...@163.com]
Sent: Monday, March 16, 2015 11:01 AM
To: user
Subject: Building spark over specified tachyon

Hi, all
Noting that the current spark releases are built-in with tachyon 0.5.0 ,
if we want to recompile spark with maven and targeting on specific tachyon 
version (let's say the most recent 0.6.0 release),
how should that be done? What maven compile command should be like ?

Thanks,
Sun.


fightf...@163.commailto:fightf...@163.com


k-means hang without error/warning

2015-03-15 Thread Xi Shen
Hi,

I am running k-means using Spark in local mode. My data set is about 30k
records, and I set the k = 1000.

The algorithm starts and finished 13 jobs according to the UI monitor, then
it stopped working.

The last log I saw was:

[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned
broadcast *16*

There're many similar log repeated, but it seems it always stop at the 16th.

If I try to low down the *k* value, the algorithm will terminated. So I
just want to know what's wrong with *k=1000*.


Thanks,
David


Re: Input validation for LogisticRegressionWithSGD

2015-03-15 Thread Rohit U
I checked the labels across the entire dataset and it looks like it has -1
and 1 (not the 0 and 1 I originally expected). I will try replacing the -1
with 0 and run it again.

On Mon, Mar 16, 2015 at 12:51 AM, Rishi Yadav ri...@infoobjects.com wrote:

 ca you share some sample data

 On Sun, Mar 15, 2015 at 8:51 PM, Rohit U rjupadhy...@gmail.com wrote:

 Hi,

 I am trying to run  LogisticRegressionWithSGD on RDD of LabeledPoints
 loaded using loadLibSVMFile:

 val logistic: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,
 s3n://logistic-regression/epsilon_normalized)

 val model = LogisticRegressionWithSGD.train(logistic, 100)

 It gives an input validation error after about 10 minutes:

 org.apache.spark.SparkException: Input validation failed.
 at
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:162)
 at
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
 at
 org.apache.spark.mllib.classification.LogisticRegressionWithSGD$.train(LogisticRegression.scala:157)
 at
 org.apache.spark.mllib.classification.LogisticRegressionWithSGD$.train(LogisticRegression.scala:192)

 From reading this bug report (
 https://issues.apache.org/jira/browse/SPARK-2575) since I am loading
 LibSVM format file there should be only 0/1 in the dataset and should not
 be facing the issue in the bug report. Is there something else I'm missing
 here?

 Thanks!





Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread sandeep vura
Hi Ted,

I am using Spark -1.2.1 and hive -0.13.1 you can check my configuration
files attached below.


ERROR IN SPARK

n: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav

 a:346)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkS

 QLCLIDriver.scala:101)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQ

 LCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.

 java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces

 sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.h
   ive.metastore.HiveMetaStoreClient
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStore

 Utils.java:1412)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(Retry

 ingMetaStoreClient.java:62)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(Ret

 ryingMetaStoreClient.java:72)
at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.ja

 va:2453)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465)
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav

 a:340)
... 9 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct

 orAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC

 onstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:534)
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStore

 Utils.java:1410)
... 14 more
Caused by: javax.jdo.JDOFatalInternalException: Error creating
transactional con
 nection factory
NestedThrowables:
java.lang.reflect.InvocationTargetException
at
org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusExc

 eption(NucleusJDOHelper.java:587)
at
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfigurat

 ion(JDOPersistenceManagerFactory.java:788)
at
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenc

 eManagerFactory(JDOPersistenceManagerFactory.java:333)
at
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceMa

 nagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.

 java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces

 sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementatio

 n(JDOHelper.java:1166)
at
javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at
javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at
org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:

 310)
at
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(Ob

 jectStore.java:339)
at
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.j

 ava:248)
at
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java

 :223)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:6

 2)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.ja

 va:117)
at
org.apache.hadoop.hive.metastore.RawStoreProxy.init(RawStoreProxy.j

 ava:58)
at
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy

 .java:67)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore

 (HiveMetaStore.java:497)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveM

 etaStore.java:475)
at

Building spark over specified tachyon

2015-03-15 Thread fightf...@163.com
Hi, all
Noting that the current spark releases are built-in with tachyon 0.5.0 ,
if we want to recompile spark with maven and targeting on specific tachyon 
version (let's say the most recent 0.6.0 release),
how should that be done? What maven compile command should be like ?

Thanks,
Sun. 



fightf...@163.com


Re: Trouble launching application that reads files

2015-03-15 Thread robert.tunney
I figured out how to use local files with file:// but not with either the
persistent or ephemeral-hdfs



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-launching-application-that-reads-files-tp22065p22068.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Re: Building spark over specified tachyon

2015-03-15 Thread fightf...@163.com
Thanks haoyuan.




fightf...@163.com
 
From: Haoyuan Li
Date: 2015-03-16 12:59
To: fightf...@163.com
CC: Shao, Saisai; user
Subject: Re: RE: Building spark over specified tachyon
Here is a patch: https://github.com/apache/spark/pull/4867

On Sun, Mar 15, 2015 at 8:46 PM, fightf...@163.com fightf...@163.com wrote:
Thanks, Jerry
I got that way. Just to make sure whether there can be some option to directly 
specifying tachyon version.




fightf...@163.com
 
From: Shao, Saisai
Date: 2015-03-16 11:10
To: fightf...@163.com
CC: user
Subject: RE: Building spark over specified tachyon
I think you could change the pom file under Spark project to update the Tachyon 
related dependency version and rebuild it again (in case API is compatible, and 
behavior is the same).
 
I’m not sure is there any command you can use to compile against Tachyon 
version.
 
Thanks
Jerry
 
From: fightf...@163.com [mailto:fightf...@163.com] 
Sent: Monday, March 16, 2015 11:01 AM
To: user
Subject: Building spark over specified tachyon
 
Hi, all
Noting that the current spark releases are built-in with tachyon 0.5.0 ,
if we want to recompile spark with maven and targeting on specific tachyon 
version (let's say the most recent 0.6.0 release),
how should that be done? What maven compile command should be like ?
 
Thanks,
Sun. 
 


fightf...@163.com



-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: Streaming linear regression example question

2015-03-15 Thread Jeremy Freeman
Hi Margus, thanks for reporting this, I’ve been able to reproduce and there 
does indeed appear to be a bug. I’ve created a JIRA and have a fix ready, can 
hopefully include in 1.3.1.

In the meantime, you can get the desired result using transform:

 model.trainOn(trainingData)
 
 testingData.transform { rdd =
   val latest = model.latestModel()
   rdd.map(lp = (lp.label, latest.predict(lp.features)))
 }.print()

-
jeremyfreeman.net
@thefreemanlab

On Mar 15, 2015, at 2:56 PM, Margus Roo mar...@roo.ee wrote:

 Hi again
 
 Tried the same 
 examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala
  from 1.3.0
 and getting in case testing file content is:
   (0.0,[3.0,4.0,3.0])
   (0.0,[4.0,4.0,4.0])
   (4.0,[5.0,5.0,5.0])
   (5.0,[5.0,6.0,6.0])
   (6.0,[7.0,4.0,7.0])
   (7.0,[8.0,6.0,8.0])
   (8.0,[44.0,9.0,9.0])
   (9.0,[14.0,30.0,10.0])
 
 and the answer:
 (0.0,0.0)
 (0.0,0.0)
 (0.0,0.0)
 (4.0,0.0)
 (5.0,0.0)
 (6.0,0.0)
 (7.0,0.0)
 (8.0,0.0)
 (9.0,0.0)
 
 What is wrong?
 I can see that model's weights are changing in case I put new data into 
 training dir.
 Margus (margusja) Roo
 http://margus.roo.ee
 skype: margusja
 +372 51 480
 On 14/03/15 09:05, Margus Roo wrote:
 Hi
 
 I try to understand example provided in 
 https://spark.apache.org/docs/1.2.1/mllib-linear-methods.html - Streaming 
 linear regression
 
 Code:
 import org.apache.spark._
 import org.apache.spark.streaming._
 import org.apache.spark.mllib.linalg.Vectors
 import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
 import org.apache.spark.storage.StorageLevel
 import org.apache.spark.streaming.dstream.DStream
 
 object StreamingLinReg {
 
   def main(args: Array[String]) {
   
 val conf = new 
 SparkConf().setAppName(StreamLinReg).setMaster(local[2])
 val ssc = new StreamingContext(conf, Seconds(10))
   
 
 val trainingData = 
 ssc.textFileStream(/Users/margusja/Documents/workspace/sparcdemo/training/).map(LabeledPoint.parse).cache()
 
 val testData = 
 ssc.textFileStream(/Users/margusja/Documents/workspace/sparcdemo/testing/).map(LabeledPoint.parse)
 
 val numFeatures = 3
 val model = new 
 StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(numFeatures))
 
 model.trainOn(trainingData)
 model.predictOnValues(testData.map(lp = (lp.label, 
 lp.features))).print()
 
 ssc.start()
 ssc.awaitTermination()
 
   }
 
 }
 
 Compiled code and run it
 Put file contains
   (1.0,[2.0,2.0,2.0])
   (2.0,[3.0,3.0,3.0])
   (3.0,[4.0,4.0,4.0])
   (4.0,[5.0,5.0,5.0])
   (5.0,[6.0,6.0,6.0])
   (6.0,[7.0,7.0,7.0])
   (7.0,[8.0,8.0,8.0])
   (8.0,[9.0,9.0,9.0])
   (9.0,[10.0,10.0,10.0])
 in to training directory.
 
 I can see that models weight change:
 15/03/14 08:53:40 INFO StreamingLinearRegressionWithSGD: Current model: 
 weights, [7.333,7.333,7.333]
 
 No I can put what ever in to testing directory but I can not understand 
 answer.
 In example I can put the same file I used for training in to testing 
 directory. File content is
   (1.0,[2.0,2.0,2.0])
   (2.0,[3.0,3.0,3.0])
   (3.0,[4.0,4.0,4.0])
   (4.0,[5.0,5.0,5.0])
   (5.0,[6.0,6.0,6.0])
   (6.0,[7.0,7.0,7.0])
   (7.0,[8.0,8.0,8.0])
   (8.0,[9.0,9.0,9.0])
   (9.0,[10.0,10.0,10.0])
 
 And answer will be
 (1.0,0.0)
 (2.0,0.0)
 (3.0,0.0)
 (4.0,0.0)
 (5.0,0.0)
 (6.0,0.0)
 (7.0,0.0)
 (8.0,0.0)
 (9.0,0.0)
 
 And in case my file content is
   (0.0,[2.0,2.0,2.0])
   (0.0,[3.0,3.0,3.0])
   (0.0,[4.0,4.0,4.0])
   (0.0,[5.0,5.0,5.0])
   (0.0,[6.0,6.0,6.0])
   (0.0,[7.0,7.0,7.0])
   (0.0,[8.0,8.0,8.0])
   (0.0,[9.0,9.0,9.0])
   (0.0,[10.0,10.0,10.0])
 
 the answer will be:
 (0.0,0.0)
 (0.0,0.0)
 (0.0,0.0)
 (0.0,0.0)
 (0.0,0.0)
 (0.0,0.0)
 (0.0,0.0)
 (0.0,0.0)
 (0.0,0.0)
 
 I except to get label predicted by model.
 -- 
 Margus (margusja) Roo
 http://margus.roo.ee
 skype: margusja
 +372 51 480
 



Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread sandeep vura
Hi Ted,

Did you find any solution.

Thanks
Sandeep

On Mon, Mar 16, 2015 at 10:44 AM, sandeep vura sandeepv...@gmail.com
wrote:

 Hi Ted,

 I am using Spark -1.2.1 and hive -0.13.1 you can check my configuration
 files attached below.

 
 ERROR IN SPARK
 
 n: Unable to instantiate
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
 at
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav

  a:346)
 at
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkS

  QLCLIDriver.scala:101)
 at
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQ

  LCLIDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.

  java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces

  sorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:622)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.RuntimeException: Unable to instantiate
 org.apache.hadoop.h
ive.metastore.HiveMetaStoreClient
 at
 org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStore

  Utils.java:1412)
 at
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(Retry

  ingMetaStoreClient.java:62)
 at
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(Ret

  ryingMetaStoreClient.java:72)
 at
 org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.ja

  va:2453)
 at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465)
 at
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.jav

  a:340)
 ... 9 more
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct

  orAccessorImpl.java:57)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC

  onstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:534)
 at
 org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStore

  Utils.java:1410)
 ... 14 more
 Caused by: javax.jdo.JDOFatalInternalException: Error creating
 transactional con
  nection factory
 NestedThrowables:
 java.lang.reflect.InvocationTargetException
 at
 org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusExc

  eption(NucleusJDOHelper.java:587)
 at
 org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfigurat

  ion(JDOPersistenceManagerFactory.java:788)
 at
 org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenc

  eManagerFactory(JDOPersistenceManagerFactory.java:333)
 at
 org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceMa

  nagerFactory(JDOPersistenceManagerFactory.java:202)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.

  java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces

  sorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:622)
 at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
 at
 javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementatio

  n(JDOHelper.java:1166)
 at
 javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
 at
 javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:

  310)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(Ob

  jectStore.java:339)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.j

  ava:248)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java

  :223)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:6

  2)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.ja

  va:117)
 at
 org.apache.hadoop.hive.metastore.RawStoreProxy.init(RawStoreProxy.j

  ava:58)
 at
 org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy

  .java:67)
 at