date:20150315

Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Shashidhar Rao

Thanks Sean, your suggestions and the links provided are just what I needed
to start off with.

On Sun, Mar 15, 2015 at 6:16 PM, Sean Owen so...@cloudera.com wrote:

I think you're assuming that you will pre-compute recommendations and
store them in Mongo. That's one way to go, with certain tradeoffs. You
can precompute offline easily, and serve results at large scale
easily, but, you are forced to precompute everything -- lots of wasted
effort, not completely up to date.

The front-end part of the stack looks right.

Spark would do the model building; you'd have to write a process to
score recommendations and store the result. Mahout is the same thing,
really.

500K items isn't all that large. Your requirements aren't driven just
by items though. Number of users and latent features matter too. It
matters how often you want to build the model too. I'm guessing you
would get away with a handful of modern machines for a problem this
size.

In a way what you describe reminds me of Wibidata, since it built
recommender-like solutions on top of data and results published to a
NoSQL store. You might glance at the related OSS project Kiji
(http://kiji.org/) for ideas about how to manage the schema.

http://spark-summit.org/wp-content/uploads/2014/07/Using-Spark-and-Shark-to-Power-a-Realt-time-Recommendation-and-Customer-Intelligence-Platform-Nick-Pentreath.pdf

This is also the kind of ground the oryx project is intended to cover,
something I've worked on personally:
https://github.com/OryxProject/oryx -- a layer on and around the
core model building in Spark + Spark Streaming to provide a whole
recommender (for example), down to the REST API.

On Sun, Mar 15, 2015 at 10:45 AM, Shashidhar Rao
raoshashidhar...@gmail.com wrote:
Hi,

Can anyone who has developed recommendation engine suggest what could be
the
possible software stack for such an application.

I am basically new to recommendation engine , I just found out Mahout and
Spark Mlib which are available .
I am thinking the below software stack.

1. The user is going to use Android app.
2. Rest Api sent to app server from the android app to get
recommendations.
3. Spark Mlib core engine for recommendation engine
4. MongoDB database backend.

I would like to know more on the cluster configuration( how many nodes
etc)
part of spark for calculating the recommendations for 500,000 items. This
items include products for day care etc.

Other software stack suggestions would also be very useful.It has to run
on
multiple vendor machines.

Please suggest.

Thanks
shashi

Re: [Spark SQL]: Convert JavaSchemaRDD back to JavaRDD of a specific class

2015-03-15 Thread Cheng Lian

Currently there’s no convenient way to convert a 
|SchemaRDD|/|JavaSchemaRDD| back to an |RDD|/|JavaRDD| of some case 
class. But you can convert an |RDD|/|JavaRDD| into an 
|RDD[Row]|/|JavaRDDRow| using |schemaRdd.rdd| and |new 
JavaRDDRow(schemaRdd.rdd)|.


Cheng

On 3/15/15 10:22 PM, Renato Marroquín Mogrovejo wrote:


Hi Spark experts,

Is there a way to convert a JavaSchemaRDD (for instance loaded from a 
parquet file) back to a JavaRDD of a given case class? I read on 
stackOverFlow[1] that I could do a select over the parquet file and 
then by reflection get the fields out, but I guess that would be an 
overkill.
Then I saw [2] from 2014 which says that this feature would be 
available in the future. So could you please let me know how I can 
accomplish this? Thanks in advance!



Renato M.

[1] 
http://stackoverflow.com/questions/26181353/how-to-convert-spark-schemardd-into-rdd-of-my-case-class
[2] 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-td9071.html

Re: Spark Streaming on Yarn Input from Flume

2015-03-15 Thread tarek_abouzeid

have you fixed this issue ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-Yarn-Input-from-Flume-tp11755p22055.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: 1.3 release

2015-03-15 Thread Sean Owen

I think (I hope) it's because the generic builds just work. Even
though these are of course distributed mostly verbatim in CDH5, with
tweaks to be compatible with other stuff at the edges, the stock
builds should be fine too. Same for HDP as I understand.

The CDH4 build may work on some builds of CDH4, but I think is lurking
there as a Hadoop 2.0.x plus a certain YARN beta build. I'd prefer
to rename it that way, myself, since it doesn't actually work with all
of CDH4 anyway.

Are the MapR builds there because the stock Hadoop build doesn't work
on MapR? that would actually surprise me, but then, why are these two
builds distributed?


On Sun, Mar 15, 2015 at 6:22 AM, Eric Friedman
eric.d.fried...@gmail.com wrote:
 Is there a reason why the prebuilt releases don't include current CDH distros 
 and YARN support?

 
 Eric Friedman
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[Spark SQL]: Convert JavaSchemaRDD back to JavaRDD of a specific class

2015-03-15 Thread Renato Marroquín Mogrovejo

Hi Spark experts,

Is there a way to convert a JavaSchemaRDD (for instance loaded from a
parquet file) back to a JavaRDD of a given case class? I read on
stackOverFlow[1] that I could do a select over the parquet file and then by
reflection get the fields out, but I guess that would be an overkill.
Then I saw [2] from 2014 which says that this feature would be available in
the future. So could you please let me know how I can accomplish this?
Thanks in advance!


Renato M.

[1]
http://stackoverflow.com/questions/26181353/how-to-convert-spark-schemardd-into-rdd-of-my-case-class
[2]
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-td9071.html

Re: Spark Release 1.3.0 DataFrame API

2015-03-15 Thread David Mitchell

Thank you for your help.  toDF() solved my first problem.  And, the
second issue was a non-issue, since the second example worked without any
modification.

David


On Sun, Mar 15, 2015 at 1:37 AM, Rishi Yadav ri...@infoobjects.com wrote:

 programmatically specifying Schema needs

  import org.apache.spark.sql.type._

 for StructType and StructField to resolve.

 On Sat, Mar 14, 2015 at 10:07 AM, Sean Owen so...@cloudera.com wrote:

 Yes I think this was already just fixed by:

 https://github.com/apache/spark/pull/4977

 a .toDF() is missing

 On Sat, Mar 14, 2015 at 4:16 PM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
  I've found people.toDF gives you a data frame (roughly equivalent to the
  previous Row RDD),
 
  And you can then call registerTempTable on that DataFrame.
 
  So people.toDF.registerTempTable(people) should work
 
 
 
  —
  Sent from Mailbox
 
 
  On Sat, Mar 14, 2015 at 5:33 PM, David Mitchell 
 jdavidmitch...@gmail.com
  wrote:
 
 
  I am pleased with the release of the DataFrame API.  However, I started
  playing with it, and neither of the two main examples in the
 documentation
  work: http://spark.apache.org/docs/1.3.0/sql-programming-guide.html
 
  Specfically:
 
  Inferring the Schema Using Reflection
  Programmatically Specifying the Schema
 
 
  Scala 2.11.6
  Spark 1.3.0 prebuilt for Hadoop 2.4 and later
 
  Inferring the Schema Using Reflection
  scala people.registerTempTable(people)
  console:31: error: value registerTempTable is not a member of
  org.apache.spark
  .rdd.RDD[Person]
people.registerTempTable(people)
   ^
 
  Programmatically Specifying the Schema
  scala val peopleDataFrame = sqlContext.createDataFrame(people, schema)
  console:41: error: overloaded method value createDataFrame with
  alternatives:
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass:
  Class[_])org.apache.spar
  k.sql.DataFrame and
(rdd: org.apache.spark.rdd.RDD[_],beanClass:
  Class[_])org.apache.spark.sql.Dat
  aFrame and
(rowRDD:
  org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],columns:
  java.util.List[String])org.apache.spark.sql.DataFrame and
(rowRDD:
  org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: o
  rg.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
 and
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema:
  org.apache
  .spark.sql.types.StructType)org.apache.spark.sql.DataFrame
   cannot be applied to (org.apache.spark.rdd.RDD[String],
  org.apache.spark.sql.ty
  pes.StructType)
 val df = sqlContext.createDataFrame(people, schema)
 
  Any help would be appreciated.
 
  David
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
### Confidential e-mail, for recipient's (or recipients') eyes only, not
for distribution. ###

Saving Dstream into a single file

2015-03-15 Thread tarek_abouzeid

i am doing word count example on flume stream and trying to save output as
text files in HDFS , but in the save directory i got multiple sub
directories each having files with small size , i wonder if there is a way
to append in a large file instead of saving in multiple files , as i intend
to save the output in hive hdfs directory so i can query the result using
hive 

hope anyone have a workaround for this issue , Thanks in advance 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Saving-Dstream-into-a-single-file-tp22058.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Sean Owen

The front-end part of the stack looks right.

Spark would do the model building; you'd have to write a process to
score recommendations and store the result. Mahout is the same thing,
really.

You should have a look at things like Nick's architecture for
Graphflow, however it's more concerned with computing recommendation
on the fly, and describes a shift from an architecture originally
built around something like a NoSQL store:
http://spark-summit.org/wp-content/uploads/2014/07/Using-Spark-and-Shark-to-Power-a-Realt-time-Recommendation-and-Customer-Intelligence-Platform-Nick-Pentreath.pdf

On Sun, Mar 15, 2015 at 10:45 AM, Shashidhar Rao
raoshashidhar...@gmail.com wrote:
Hi,

Can anyone who has developed recommendation engine suggest what could be the
possible software stack for such an application.

I am basically new to recommendation engine , I just found out Mahout and
Spark Mlib which are available .
I am thinking the below software stack.

1. The user is going to use Android app.
2. Rest Api sent to app server from the android app to get recommendations.
3. Spark Mlib core engine for recommendation engine
4. MongoDB database backend.

I would like to know more on the cluster configuration( how many nodes etc)
part of spark for calculating the recommendations for 500,000 items. This
items include products for day care etc.

Other software stack suggestions would also be very useful.It has to run on
multiple vendor machines.

Please suggest.

Thanks
shashi

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Need Advice about reading lots of text files

2015-03-15 Thread Pat Ferrel

Ah most interesting—thanks.

So it seems sc.textFile(longFileList) has to read all metadata before starting
the read for partitioning purposes so what you do is not use it?

You create a task per file that reads one file (in parallel) per task without
scanning for _all_ metadata. Can’t argue with the logic but perhaps Spark
should incorporate something like this in sc.textFile? My case can’t be that
unusual especially since I am periodically processing micro-batches from Spark
Streaming. In fact Actually I have to scan HDFS to create the longFileList to
begin with so get file status and therefore probably all the metadata needed by
sc.textFile. Your method would save one scan, which is good.

Might a better sc.textFile take a beginning URI, a file pattern regex, and a
recursive flag? Then one scan could create all metadata automatically for a
large subset of people using the function, something like

sc.textFile(beginDir: String, filePattern: String = “^part.*”, recursive:
Boolean = false)

I fact it should be easy to create BetterSC that overrides the textFile method
with a re-implementation that only requires one scan to get metadata.

Just thinking on email…

On Mar 14, 2015, at 11:11 AM, Michael Armbrust mich...@databricks.com wrote:

Here is how I have dealt with many small text files (on s3 though this should
generalize) in the past:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E

http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E

FromMichael Armbrust mich...@databricks.com
mailto:mich...@databricks.com
Subject Re: S3NativeFileSystem inefficient implementation when calling
sc.textFile
DateThu, 27 Nov 2014 03:20:14 GMT
In the past I have worked around this problem by avoiding sc.textFile().
Instead I read the data directly inside of a Spark job. Basically, you
start with an RDD where each entry is a file in S3 and then flatMap that
with something that reads the files and returns the lines.

Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe
https://gist.github.com/marmbrus/fff0b058f134fa7752fe

Using this class you can do something like:

sc.parallelize(s3n://mybucket/file1 :: s3n://mybucket/file1 ... ::
Nil).flatMap(new ReadLinesSafe(_))

You can also build up the list of files by running a Spark job:
https://gist.github.com/marmbrus/15e72f7bc22337cf6653
https://gist.github.com/marmbrus/15e72f7bc22337cf6653

Michael

On Sat, Mar 14, 2015 at 10:38 AM, Pat Ferrel p...@occamsmachete.com
mailto:p...@occamsmachete.com wrote:
It’s a long story but there are many dirs with smallish part- files in them
so we create a list of the individual files as input to
sparkContext.textFile(fileList). I suppose we could move them and rename them
to be contiguous part- files in one dir. Would that be better than passing
in a long list of individual filenames? We could also make the part files much
larger by collecting the smaller ones. But would any of this make a difference
in IO speed?

I ask because using the long file list seems to read, what amounts to a not
very large data set rather slowly. If it were all in large part files in one
dir I’d expect it to go much faster but this is just intuition.

On Mar 14, 2015, at 9:58 AM, Koert Kuipers ko...@tresata.com
mailto:ko...@tresata.com wrote:

why can you not put them in a directory and read them as one input? you will
get a task per file, but spark is very fast at executing many tasks (its not a
jvm per task).

On Sat, Mar 14, 2015 at 12:51 PM, Pat Ferrel p...@occamsmachete.com
mailto:p...@occamsmachete.com wrote:
Any advice on dealing with a large number of separate input files?

On Mar 13, 2015, at 4:06 PM, Pat Ferrel p...@occamsmachete.com
mailto:p...@occamsmachete.com wrote:

We have many text files that we need to read in parallel. We can create a comma
delimited list of files to pass in to sparkContext.textFile(fileList). The list
can get very large (maybe 1) and is all on hdfs.

The question is: what is the most performant way to read them? Should they be
broken up and read in groups appending the resulting RDDs or should we just
pass in the entire list at once? In effect I’m asking if Spark does some
optimization of whether we should do it explicitly. If the later, what rule
might we use depending on our cluster setup?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands,

Re: deploying Spark on standalone cluster

2015-03-15 Thread tarek_abouzeid

i was having a similar issue but it was in spark and flume integration i was
getting failed to bind error , but got it fixed by shutting down firewall
for both machines (make sure : service iptables status = firewall stopped)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/deploying-Spark-on-standalone-cluster-tp22049p22057.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

1.3 release

2015-03-15 Thread Eric Friedman

Is there a reason why the prebuilt releases don't include current CDH distros 
and YARN support?


Eric Friedman
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Null Pointer Exception due to mapVertices function in GraphX

2015-03-15 Thread James

I have got NullPointerException in aggregateMessages on a graph which is
the output of mapVertices function of a graph. I found the problem is
because of the mapVertices funciton did not affect all the triplet of the
graph.

// Initial the graph, assign a counter to each vertex that contains
the vertex id onlyvar anfGraph = graph.mapVertices { case (vid, _) =
   val counter = new HyperLogLog(5)
   counter.offer(vid)
   counter}
 val nullVertex = anfGraph.triplets.filter(edge = edge.srcAttr == 
 null).first// There is an edge whose src attr is null

 anfGraph.vertices.filter(_._1 == nullVertex).first// I could see that the 
 vertex has a not null attribute

 messages = anfGraph.aggregateMessages(msgFun, mergeMessage)   // - 
 NullPointerException


My spark version:1.2.0

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-15 Thread DB Tsai

In LBFGS version of logistic regression, the data is properly
standardized, so this should not happen. Can you provide a copy of
your dataset to us so we can test it? If the dataset can not be
public, can you have just send me a copy so I can dig into this? I'm
the author of LORWithLBFGS. Thanks.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Fri, Mar 13, 2015 at 2:41 PM, cjwang c...@cjwang.us wrote:
 I am running LogisticRegressionWithLBFGS.  I got these lines on my console:

 2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to 0.5
 2015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to 0.25
 2015-03-12 17:38:04,036 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to 0.125
 2015-03-12 17:38:04,105 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to
 0.0625
 2015-03-12 17:38:04,176 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to
 0.03125
 2015-03-12 17:38:04,247 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to
 0.015625
 2015-03-12 17:38:04,317 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to
 0.0078125
 2015-03-12 17:38:04,461 ERROR breeze.optimize.StrongWolfeLineSearch |
 Encountered bad values in function evaluation. Decreasing step size to
 0.005859375
 2015-03-12 17:38:04,605 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:04,672 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:04,747 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:04,818 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:04,890 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:04,962 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:05,038 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:05,107 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:05,186 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:05,256 INFO breeze.optimize.StrongWolfeLineSearch | Line
 search t: NaN fval: NaN rhs: NaN cdd: NaN
 2015-03-12 17:38:05,257 ERROR breeze.optimize.LBFGS | Failure! Resetting
 history: breeze.optimize.FirstOrderException: Line search zoom failed


 What causes them and how do I fix them?

 I checked my data and there seemed nothing out of the ordinary.  The
 resulting prediction model seemed acceptable to me.  So, are these ERRORs
 actually WARNINGs?  Could we or should we tune the level of these messages
 down one notch?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/LogisticRegressionWithLBFGS-shows-ERRORs-tp22042.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: order preservation with RDDs

2015-03-15 Thread Sean Owen

Yes I don't think this is entirely reliable in general. I would emit
(label,features) pairs and then transform the values.

In practice, this may happen to work fine in simple cases.

On Sun, Mar 15, 2015 at 3:51 AM, kian.ho hui.kian.ho+sp...@gmail.com wrote:
Hi, I was taking a look through the mllib examples in the official spark
documentation and came across the following:
http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2

specifically the lines:

label = data.map(lambda x: x.label)
features = data.map(lambda x: x.features)
...
...
data1 = label.zip(scaler1.transform(features))

my question:
wouldn't it be possible that some labels in the pairs returned by the
label.zip(..) operation are not paired with their original features? i.e.
are the original orderings of `labels` and `features` preserved after the
scaler1.transform(..) and label.zip(..) operations?

This issue was also mentioned in
http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p19433.html

I would greatly appreciate some clarification on this, as I've run into this
issue whilst experimenting with feature extraction for text classification,
where (correct me if I'm wrong) there is no built-in mechanism to keep track
of document-ids through the HashingTF and IDF fitting and transformations.

Thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Shashidhar Rao

Hi,

Can anyone who has developed recommendation engine suggest what could be
the possible software stack for such an application.

I am basically new to recommendation engine , I just found out Mahout and
Spark Mlib which are available .
I am thinking the below software stack.

1. The user is going to use Android app.
2.  Rest Api sent to app server from the android app to get recommendations.
3. Spark Mlib core engine for recommendation engine
4. MongoDB database backend.

I would like to know more on the cluster configuration( how many nodes etc)
part of spark for calculating the recommendations for 500,000 items. This
items include products for day care etc.

Other software stack suggestions would also be very useful.It has to run on
multiple vendor machines.

Please suggest.

Thanks
shashi

Re: Explanation on the Hive in the Spark assembly

2015-03-15 Thread Cheng Lian

Spark SQL supports most commonly used features of HiveQL. However, 
different HiveQL statements are executed in different manners:


1.

   DDL statements (e.g. |CREATE TABLE|, |DROP TABLE|, etc.) and
   commands (e.g. |SET key = value|, |ADD FILE|, |ADD JAR|, etc.)

   In most cases, Spark SQL simply delegates these statements to Hive,
   as they don’t need to issue any distributed jobs and don’t rely on
   the computation engine (Spark, MR, or Tez).

2.

   |SELECT| queries, |CREATE TABLE ... AS SELECT ...| statements and
   insertions

   These statements are executed using Spark as the execution engine.

The Hive classes packaged in the assembly jar are used to provide entry 
points to Hive features, for example:


1. HiveQL parser
2. Talking to Hive metastore to execute DDL statements
3. Accessing UDF/UDAF/UDTF

As for the differences between Hive on Spark and Spark SQL’s Hive 
support, please refer to this article by Reynold: 
https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html


Cheng

On 3/14/15 10:53 AM, bit1...@163.com wrote:


Thanks Daoyuan.
What do you mean by running some native command, I never thought that 
hive will run without an computing engine like Hadoop MR or spark. Thanks.



bit1...@163.com

*From:* Wang, Daoyuan mailto:daoyuan.w...@intel.com
*Date:* 2015-03-13 16:39
*To:* bit1...@163.com mailto:bit1...@163.com; user
mailto:user@spark.apache.org
*Subject:* RE: Explanation on the Hive in the Spark assembly

Hi bit1129,

1, hive in spark assembly removed most dependencies of hive on
Hadoop to avoid conflicts.

2, this hive is used to run some native command, which does not
rely on spark or mapreduce.

Thanks,

Daoyuan

*From:*bit1...@163.com [mailto:bit1...@163.com]
*Sent:* Friday, March 13, 2015 4:24 PM
*To:* user
*Subject:* Explanation on the Hive in the Spark assembly

Hi, sparkers,

I am kind of confused about hive in the spark assembly.  I think
hive in the spark assembly is not the same thing as Hive On
Spark(Hive On Spark,  is meant to run hive using spark execution
engine).

So, my question is:

1. What is the difference between Hive in the spark assembly and
Hive on Hadoop?

2. Does Hive in the spark assembly use Spark execution engine or
Hadoop MR engine?

Thanks.



bit1...@163.com mailto:bit1...@163.com

Benchmarks of 'Hive on Tez' vs 'Hive on Spark' vs Spark SQL

2015-03-15 Thread Slim Baltagi

Hi

I would like to share with you my comments on Hortonworks' benchmarks of
'Hive on Tez' vs 'Hive on Spark' vs 'Spark SQL'.
Please check them in my related blog entry at http://goo.gl/K5mk0U

Thanks

Slim Baltagi
Chicago, IL
http://www.SparkBigData.com 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Benchmarks-of-Hive-on-Tez-vs-Hive-on-Spark-vs-Spark-SQL-tp22060.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Problem connecting to HBase

2015-03-15 Thread HARIPRIYA AYYALASOMAYAJULA

Hello all,

Thank you for your responses. I did try to include the
zookeeper.znode.parent property in the hbase-site.xml. It still continues
to give the same error.

I am using Spark 1.2.0 and hbase 0.98.9.

Could you please suggest what else could be done?


On Fri, Mar 13, 2015 at 10:25 PM, Ted Yu yuzhih...@gmail.com wrote:

 In HBaseTest.scala:
 val conf = HBaseConfiguration.create()
 You can add some log (for zookeeper.znode.parent, e.g.) to see if the
 values from hbase-site.xml are picked up correctly.

 Please use pastebin next time you want to post errors.

 Which Spark release are you using ?
 I assume it contains SPARK-1297

 Cheers

 On Fri, Mar 13, 2015 at 7:47 PM, HARIPRIYA AYYALASOMAYAJULA 
 aharipriy...@gmail.com wrote:


 Hello,

 I am running a HBase test case. I am using the example from the following:

 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala

 I created a very small HBase table with 5 rows and 2 columns.
 I have attached a screenshot of the error log. I believe it is a problem
 where the driver program is unable to establish connection to the hbase.

 The following is my simple.sbt:

 name := Simple Project

 version := 1.0

 scalaVersion := 2.10.4

 libraryDependencies ++= Seq(

  org.apache.spark %% spark-core % 1.2.0,

  org.apache.hbase % hbase % 0.98.9-hadoop2 % provided,

  org.apache.hbase % hbase-client % 0.98.9-hadoop2 % provided,

  org.apache.hbase % hbase-server % 0.98.9-hadoop2 % provided,

  org.apache.hbase % hbase-common % 0.98.9-hadoop2 % provided
 )

 I am using a 23 node cluster, did copy hbase-site.xml into /spark/conf
 folder
 and set spark.executor.extraClassPath pointing to the /hbase/ folder in
 the spark-defaults.conf

 Also, while submitting the spark job I am including the required jars :

 spark-submit --class HBaseTest --master yarn-cluster
 --driver-class-path
  
 /opt/hbase/0.98.9/lib/hbase-server-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-protocol-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-hadoop2-compat-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-client-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/hbase-common-0.98.9-hadoop2.jar:/opt/hbase/0.98.9/lib/htrace-core-2.04.jar
  /home/priya/usingHBase/Spark/target/scala-2.10/simple-project_2.10-1.0.jar
 /Priya/sparkhbase-test1

 It would be great if you could point where I am going wrong, and what
 could be done to correct it.

 Thank you for your time.
 --
 Regards,
 Haripriya Ayyalasomayajula
 Graduate Student
 Department of Computer Science
 University of Houston
 Contact : 650-796-7112


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Regards,
Haripriya Ayyalasomayajula
Graduate Student
Department of Computer Science
University of Houston
Contact : 650-796-7112

Re: Running spark function on parquet without sql

2015-03-15 Thread Cheng Lian

That's an unfortunate documentation bug in the programming guide... We 
failed to update it after making the change.


Cheng

On 2/28/15 8:13 AM, Deborah Siegel wrote:

Hi Michael,

Would you help me understand  the apparent difference here..

The Spark 1.2.1 programming guide indicates:

Note that if you call |schemaRDD.cache()| rather than 
|sqlContext.cacheTable(...)|, tables will /not/ be cached using the 
in-memory columnar format, and therefore 
|sqlContext.cacheTable(...)| is strongly recommended for this use case.


Yet the API doc shows that :


def cache(): SchemaRDD

https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html.this.type


Overridden cache function will always use the in-memory
columnar caching.



links
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD

Thanks
Sincerely
Deb

On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust 
mich...@databricks.com mailto:mich...@databricks.com wrote:


From Zhan Zhang's reply, yes I still get the parquet's advantage.

You will need to at least use SQL or the DataFrame API (coming in
Spark 1.3) to specify the columns that you want in order to get
the parquet benefits.   The rest of your operations can be
standard Spark.

My next question is, if I operate on SchemaRdd will I get the
advantage of
Spark SQL's in memory columnar store when cached the table using
cacheTable()?


Yes, SchemaRDDs always use the in-memory columnar cache for
cacheTable and .cache() since Spark 1.2+

Re: Issue with yarn cluster - hangs in accepted state.

2015-03-15 Thread abhi

Thanks,
It worked.

-Abhi

On Tue, Mar 3, 2015 at 5:15 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Wed, Mar 4, 2015 at 6:20 AM, Zhan Zhang zzh...@hortonworks.com wrote:

  Do you have enough resource in your cluster? You can check your resource
 manager to see the usage.


 Yep, I can confirm that this is a very annoying issue. If there is not
 enough memory or VCPUs available, your app will just stay in ACCEPTED state
 until resources are available.

 You can have a look at

 https://github.com/jubatus/jubaql-docker/blob/master/hadoop/yarn-site.xml#L35
 to see some settings that might help.

 Tobias

Re: Writing wide parquet file in Spark SQL

2015-03-15 Thread Cheng Lian

This article by Ryan Blue should be helpful to understand the problem 
http://ingest.tips/2015/01/31/parquet-row-group-size/


The TL;DR is, you may decrease |parquet.block.size| to reduce memory 
consumption. Anyway, 100K columns is a really big burden for Parquet, 
but I guess your data should be pretty sparse.


Cheng

On 3/11/15 4:13 AM, kpeng1 wrote:


Hi All,

I am currently trying to write a very wide file into parquet using spark
sql.  I have 100K column records that I am trying to write out, but of
course I am running into space issues(out of memory - heap space).  I was
wondering if there are any tweaks or work arounds for this.

I am basically calling saveAsParquetFile on the schemaRDD.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-wide-parquet-file-in-Spark-SQL-tp21995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Shashidhar Rao

Thanks Nick, for your suggestions.

On Sun, Mar 15, 2015 at 10:41 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

As Sean says, precomputing recommendations is pretty inefficient. Though
with 500k items its easy to get all the item vectors in memory so
pre-computing is not too bad.

Still, since you plan to serve these via a REST service anyway, computing
on demand via a serving layer such as Oryx or PredictionIO (or the newly
open sourced Seldon.io) is a good option. You can also cache the
recommendations quite aggressively - once you compute a user or item top-K
list, just stick the result in mem cache / redis / whatever and evict it
when you recompute your offline model, or every hour or whatever.

—
Sent from Mailbox https://www.dropbox.com/mailbox

On Sun, Mar 15, 2015 at 3:03 PM, Shashidhar Rao
raoshashidhar...@gmail.com wrote:

Thanks Sean, your suggestions and the links provided are just what I
needed to start off with.

On Sun, Mar 15, 2015 at 6:16 PM, Sean Owen so...@cloudera.com wrote:

The front-end part of the stack looks right.

Spark would do the model building; you'd have to write a process to
score recommendations and store the result. Mahout is the same thing,
really.

http://spark-summit.org/wp-content/uploads/2014/07/Using-Spark-and-Shark-to-Power-a-Realt-time-Recommendation-and-Customer-Intelligence-Platform-Nick-Pentreath.pdf

On Sun, Mar 15, 2015 at 10:45 AM, Shashidhar Rao
raoshashidhar...@gmail.com wrote:
Hi,

Can anyone who has developed recommendation engine suggest what could
be the
possible software stack for such an application.

I am basically new to recommendation engine , I just found out Mahout
and
Spark Mlib which are available .
I am thinking the below software stack.

1. The user is going to use Android app.
2. Rest Api sent to app server from the android app to get
recommendations.
3. Spark Mlib core engine for recommendation engine
4. MongoDB database backend.

I would like to know more on the cluster configuration( how many nodes
etc)
part of spark for calculating the recommendations for 500,000 items.
This
items include products for day care etc.

Other software stack suggestions would also be very useful.It has to
run on
multiple vendor machines.

Please suggest.

Thanks
shashi

Re: Streaming linear regression example question

2015-03-15 Thread Margus Roo


Hi again

Tried the same 
examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala 
from 1.3.0

and getting in case testing file content is:
  (0.0,[3.0,4.0,3.0])
  (0.0,[4.0,4.0,4.0])
  (4.0,[5.0,5.0,5.0])
  (5.0,[5.0,6.0,6.0])
  (6.0,[7.0,4.0,7.0])
  (7.0,[8.0,6.0,8.0])
  (8.0,[44.0,9.0,9.0])
  (9.0,[14.0,30.0,10.0])

and the answer:
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(4.0,0.0)
(5.0,0.0)
(6.0,0.0)
(7.0,0.0)
(8.0,0.0)
(9.0,0.0)

What is wrong?
I can see that model's weights are changing in case I put new data into 
training dir.


Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
+372 51 480

On 14/03/15 09:05, Margus Roo wrote:

Hi

I try to understand example provided in 
https://spark.apache.org/docs/1.2.1/mllib-linear-methods.html - 
Streaming linear regression


Code:
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.DStream

object StreamingLinReg {

  def main(args: Array[String]) {

val conf = new 
SparkConf().setAppName(StreamLinReg).setMaster(local[2])

val ssc = new StreamingContext(conf, Seconds(10))


val trainingData = 
ssc.textFileStream(/Users/margusja/Documents/workspace/sparcdemo/training/).map(LabeledPoint.parse).cache()


val testData = 
ssc.textFileStream(/Users/margusja/Documents/workspace/sparcdemo/testing/).map(LabeledPoint.parse)


val numFeatures = 3
val model = new 
StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(numFeatures))


model.trainOn(trainingData)
model.predictOnValues(testData.map(lp = (lp.label, 
lp.features))).print()


ssc.start()
ssc.awaitTermination()

  }

}

Compiled code and run it
Put file contains
  (1.0,[2.0,2.0,2.0])
  (2.0,[3.0,3.0,3.0])
  (3.0,[4.0,4.0,4.0])
  (4.0,[5.0,5.0,5.0])
  (5.0,[6.0,6.0,6.0])
  (6.0,[7.0,7.0,7.0])
  (7.0,[8.0,8.0,8.0])
  (8.0,[9.0,9.0,9.0])
  (9.0,[10.0,10.0,10.0])
in to training directory.

I can see that models weight change:
15/03/14 08:53:40 INFO StreamingLinearRegressionWithSGD: Current 
model: weights, [7.333,7.333,7.333]


No I can put what ever in to testing directory but I can not 
understand answer.
In example I can put the same file I used for training in to testing 
directory. File content is

  (1.0,[2.0,2.0,2.0])
  (2.0,[3.0,3.0,3.0])
  (3.0,[4.0,4.0,4.0])
  (4.0,[5.0,5.0,5.0])
  (5.0,[6.0,6.0,6.0])
  (6.0,[7.0,7.0,7.0])
  (7.0,[8.0,8.0,8.0])
  (8.0,[9.0,9.0,9.0])
  (9.0,[10.0,10.0,10.0])

And answer will be
(1.0,0.0)
(2.0,0.0)
(3.0,0.0)
(4.0,0.0)
(5.0,0.0)
(6.0,0.0)
(7.0,0.0)
(8.0,0.0)
(9.0,0.0)

And in case my file content is
  (0.0,[2.0,2.0,2.0])
  (0.0,[3.0,3.0,3.0])
  (0.0,[4.0,4.0,4.0])
  (0.0,[5.0,5.0,5.0])
  (0.0,[6.0,6.0,6.0])
  (0.0,[7.0,7.0,7.0])
  (0.0,[8.0,8.0,8.0])
  (0.0,[9.0,9.0,9.0])
  (0.0,[10.0,10.0,10.0])

the answer will be:
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)
(0.0,0.0)

I except to get label predicted by model.
--
Margus (margusja) Roo
http://margus.roo.ee
skype: margusja
+372 51 480

Submitting spark application using Yarn Rest API

2015-03-15 Thread Srini Karri

Hi All,

I am trying to submit the spark application using yarn rest API. I am able
to submit the application but final status shows as 'UNDEFINED.'. Couple of
other observations:

User shows as Dr.who
Application type is empty though I specify it as Spark

Is any one had this problem before?

I am creating the app id using:

http://{cluster host}/ws/v1/cluster/apps/new-application

Post Request Url: http://{cluster host}/ws/v1/cluster/apps

Here is the request body:

{
application-id:application_1426273041023_0055,
application-name:test,
am-container-spec:
{
credentials:
  {
  secrets:
  {

entry:
[
  {
key:user.name,
value:x
  }
]
  }
  },
  commands:
  {
command:%SPARK_HOME%/bin/spark-submit.cmd --class
org.apache.spark.examples.SparkPi --conf
spark.yarn.jar=hdfs://xxx/apps/spark/spark-1.2.1-hadoop2.6/spark-assembly-1.2.1-hadoop2.6.0.jar
--master yarn-cluster
hdfs://xxx/apps/spark/spark-1.2.1-hadoop2.6/spark-examples-1.2.1-hadoop2.6.0.jar
  },
 application-type:SPARK
  }

Re: Read Parquet file from scala directly

2015-03-15 Thread Cheng Lian


The parquet-tools code should be pretty helpful (although it's Java)

https://github.com/apache/incubator-parquet-mr/tree/master/parquet-tools/src/main/java/parquet/tools/command

On 3/10/15 12:25 AM, Shuai Zheng wrote:


Hi All,

I have a lot of parquet files, and I try to open them directly instead 
of load them into RDD in driver (so I can optimize some performance 
through special logic).


But I do some research online and can’t find any example to access 
parquet directly from scala, anyone has done this before?


Regards,

Shuai

Re: From Spark web ui, how to prove the parquet column pruning working

2015-03-15 Thread Cheng Lian


Hey Yong,

It seems that Hadoop `FileSystem` adds the size of a block to the 
metrics even if you only touch a fraction of it (reading Parquet 
metadata for example). This behavior can be verified by the following 
snippet:


```scala
import org.apache.spark.sql.Row
import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
import sc._
import sqlContext._

case class KeyValue(key: Int, value: String)

parallelize(1 to 1024 * 1024 * 20).
  flatMap(i = Seq.fill(10)(KeyValue(i, i.toString))).
  saveAsParquetFile(large.parquet)

hadoopConfiguration.set(parquet.task.side.metadata, true)
sql(SET spark.sql.parquet.filterPushdown=true)

parquetFile(large.parquet).where('key === 
0).queryExecution.toRdd.mapPartitions { _ =

  new Iterator[Row] {
def hasNext = false
def next() = ???
  }
}.collect()
```

Apparently we’re reading nothing here (except for Parquet metadata in 
the footers), but the web UI still suggests that the input size of all 
tasks equals to the file size.


Cheng


On 3/10/15 3:15 AM, java8964 wrote:
Hi, Currently most of the data in our production is using Avro + 
Snappy. I want to test the benefits if we store the data in Parquet 
format. I changed the our ETL to generate the Parquet format, instead 
of Avor, and want to test a simple sql in Spark SQL, to verify the 
benefits from Parquet.


I generated the same dataset in both Avro and Parquet in HDFS, and 
load them both in Spark-SQL. Now I run the same query like 
select colum1 from src_table_avro/parqut where colum2=xxx, I can see 
that for the parquet data format, the job runs much fast. The test 
files size for both format are around 930M. So Avro job generated 8 
tasks to read the data with 21s as the median duration, vs parquet job 
generate 7 tasks to read the data with 0.4s as the median duration.


Since the dataset has more than 100 columns, I can see the parquet 
file really coming with fast read. But my question is that from the 
spark UI, both job show 900M as the input size, and 0 for rest, in 
this case, how do I know column pruning really works? I think it is 
due to that, so parquet file can be read so fast, but is there any 
statistic can prove that to me on the Spark UI? Something like the 
input total file size is 900M, but only 10M really read due to column 
pruning? So in case that the columns pruning not work in parquet due 
to what kind of SQL query, I can identify in the first place.


Thanks

Yong

Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Nick Pentreath

As Sean says, precomputing recommendations is pretty inefficient. Though with
500k items its easy to get all the item vectors in memory so pre-computing is
not too bad.

Still, since you plan to serve these via a REST service anyway, computing on
demand via a serving layer such as Oryx or PredictionIO (or the newly open
sourced Seldon.io) is a good option. You can also cache the recommendations
quite aggressively - once you compute a user or item top-K list, just stick the
result in mem cache / redis / whatever and evict it when you recompute your
offline model, or every hour or whatever.

—
Sent from Mailbox

On Sun, Mar 15, 2015 at 3:03 PM, Shashidhar Rao
raoshashidhar...@gmail.com wrote:

Thanks Sean, your suggestions and the links provided are just what I needed
to start off with.
On Sun, Mar 15, 2015 at 6:16 PM, Sean Owen so...@cloudera.com wrote:
I think you're assuming that you will pre-compute recommendations and
store them in Mongo. That's one way to go, with certain tradeoffs. You
can precompute offline easily, and serve results at large scale
easily, but, you are forced to precompute everything -- lots of wasted
effort, not completely up to date.