RDD decouple store implementations

2014-11-27 Thread Guy Doulberg
Hi guys

I am playing with spark, and I was thinking if there is a way to share RDD 
across multiple implementations in  a decoupled way ,

i.e assuming I have RDD that comes from a stream in spark streaming, I want to 
be able to store the same stream on two different s3 folders using two 
different formatters, let's say that one formatter write a JSON file and 
another CSV file. Is it possible?

If it is possible is it possible also to change the implantation of the CSV 
formatter without stopping the JSON file writer?



Thanks, Guy Doulberg


Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-27 Thread Sean Owen
Ah of course. Great explanation. So I suppose you should see desired
results with lambda = 0, although you don't generally want to set this
to 0.

On Wed, Nov 26, 2014 at 7:53 PM, Xiangrui Meng men...@gmail.com wrote:
 The training RMSE may increase due to regularization. Squared loss
 only represents part of the global loss. If you watch the sum of the
 squared loss and the regularization, it should be non-increasing.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-27 Thread vdiwakar.malladi
Hi,

I setup maven environment on a Linux machine and able to build the pom file
in spark home directory. Each module refreshed with corresponding target
directory with jar files.

In order to include all the libraries to classpath, what I need to do?
earlier, I used single assembly jar file to include the same in classpath
(without having hive profile by running pom file available in assembly
folder). But now, I could see jar files generated in the individual module
folders. Could you please advice.

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-generate-assembly-jar-which-includes-jdbc-thrift-server-tp19887p19963.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.1.1 released but not available on maven repositories

2014-11-27 Thread Luis Ángel Vicente Sánchez
I have just read on the website that spark 1.1.1 has been released but when
I upgraded my project to use 1.1.1 I discovered that the artefacts are not
on maven yet.

[info] Resolving org.apache.spark#spark-streaming-kafka_2.10;1.1.1 ...

 [warn] module not found: org.apache.spark#spark-streaming-kafka_2.10;1.1.1

 [warn]  local: tried

 [warn]
 /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-streaming-kafka_2.10/1.1.1/ivys/ivy.xml

 [warn]  public: tried

 [warn]
 https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka_2.10/1.1.1/spark-streaming-kafka_2.10-1.1.1.pom

 [warn]  sonatype snapshots: tried

 [warn]
 https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-streaming-kafka_2.10/1.1.1/spark-streaming-kafka_2.10-1.1.1.pom

 [info] Resolving org.apache.spark#spark-core_2.10;1.1.1 ...

 [warn] module not found: org.apache.spark#spark-core_2.10;1.1.1

 [warn]  local: tried

 [warn]
 /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-core_2.10/1.1.1/ivys/ivy.xml

 [warn]  public: tried

 [warn]
 https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.1.1/spark-core_2.10-1.1.1.pom

 [warn]  sonatype snapshots: tried

 [warn]
 https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-core_2.10/1.1.1/spark-core_2.10-1.1.1.pom

 [info] Resolving org.apache.spark#spark-streaming_2.10;1.1.1 ...

 [warn] module not found: org.apache.spark#spark-streaming_2.10;1.1.1

 [warn]  local: tried

 [warn]
 /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-streaming_2.10/1.1.1/ivys/ivy.xml

 [warn]  public: tried

 [warn]
 https://repo1.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.1.1/spark-streaming_2.10-1.1.1.pom

 [warn]  sonatype snapshots: tried

 [warn]
 https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-streaming_2.10/1.1.1/spark-streaming_2.10-1.1.1.pom




Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-27 Thread Kostas Kloudas
Thanks a lot for your time guys and your quick replies!

 On Nov 26, 2014, at 7:53 PM, Xiangrui Meng men...@gmail.com wrote:
 
 The training RMSE may increase due to regularization. Squared loss
 only represents part of the global loss. If you watch the sum of the
 squared loss and the regularization, it should be non-increasing.
 -Xiangrui
 
 On Wed, Nov 26, 2014 at 9:53 AM, Sean Owen so...@cloudera.com wrote:
 I also modified the example to try 1, 5, 9, ... iterations as you did,
 and also ran with the same default parameters. I used the
 sample_movielens_data.txt file. Is that what you're using?
 
 My result is:
 
 Iteration 1 Test RMSE = 1.426079653593016 Train RMSE = 1.5013155094216357
 Iteration 5 Test RMSE = 1.405598012724468 Train RMSE = 1.4847078708333596
 Iteration 9 Test RMSE = 1.4055990901261632 Train RMSE = 1.484713206769993
 Iteration 13 Test RMSE = 1.4055990999738366 Train RMSE = 1.4847132332994588
 Iteration 17 Test RMSE = 1.40559910003368 Train RMSE = 1.48471323345531
 Iteration 21 Test RMSE = 1.4055991000342158 Train RMSE = 1.4847132334567061
 Iteration 25 Test RMSE = 1.4055991000342174 Train RMSE = 1.4847132334567108
 
 Train error is higher than test error, consistently, which could be
 underfitting. A higher rank=50 gets a reasonable result:
 
 Iteration 1 Test RMSE = 1.5981883186995312 Train RMSE = 1.4841671360432005
 Iteration 5 Test RMSE = 1.5745145659678204 Train RMSE = 1.4672341345080382
 Iteration 9 Test RMSE = 1.5745147110505406 Train RMSE = 1.4672385714907996
 Iteration 13 Test RMSE = 1.5745147108258577 Train RMSE = 1.4672385929631868
 Iteration 17 Test RMSE = 1.5745147108246424 Train RMSE = 1.4672385930428344
 Iteration 21 Test RMSE = 1.5745147108246367 Train RMSE = 1.4672385930431973
 Iteration 25 Test RMSE = 1.5745147108246367 Train RMSE = 1.467238593043199
 
 I'm not sure what the difference is. I looked at your modifications
 and they seem very similar. Is it the data you're using?
 
 
 On Wed, Nov 26, 2014 at 3:34 PM, Kostas Kloudas kklou...@gmail.com wrote:
 For the training I am using the code in the MovieLensALS example with 
 trainImplicit set to false
 and for the training RMSE I use the
 
 val rmseTr = computeRmse(model, training, params.implicitPrefs).
 
 The computeRmse() method is provided in the MovieLensALS class.
 
 
 Thanks a lot,
 Kostas
 
 
 On Nov 26, 2014, at 2:41 PM, Sean Owen so...@cloudera.com wrote:
 
 How are you computing RMSE?
 and how are you training the model -- not with trainImplicit right?
 I wonder if you are somehow optimizing something besides RMSE.
 
 On Wed, Nov 26, 2014 at 2:36 PM, Kostas Kloudas kklou...@gmail.com wrote:
 Once again, the error even with the training dataset increases. The 
 results
 are:
 
 Running 1 iterations
 For 1 iter.: Test RMSE  = 1.2447121194304893  Training RMSE =
 1.2394166987104076 (34.751317636 s).
 Running 5 iterations
 For 5 iter.: Test RMSE  = 1.3253957117600659  Training RMSE =
 1.3206317416138509 (37.69311802304 s).
 Running 9 iterations
 For 9 iter.: Test RMSE  = 1.3255293380139364  Training RMSE =
 1.3207661218210436 (41.046175661 s).
 Running 13 iterations
 For 13 iter.: Test RMSE  = 1.3255295352665748  Training RMSE =
 1.3207663201865092 (47.763619515 s).
 Running 17 iterations
 For 17 iter.: Test RMSE  = 1.32552953555787  Training RMSE =
 1.3207663204794406 (59.68236110305 s).
 Running 21 iterations
 For 21 iter.: Test RMSE  = 1.3255295355583026  Training RMSE =
 1.3207663204798756 (57.210578232 s).
 Running 25 iterations
 For 25 iter.: Test RMSE  = 1.325529535558303  Training RMSE =
 1.3207663204798765 (65.785485882 s).
 
 Thanks a lot,
 Kostas
 
 On Nov 26, 2014, at 12:04 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:
 
 copying user group - I keep replying directly vs reply all :)
 
 On Wed, Nov 26, 2014 at 2:03 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:
 
 ALS will be guaranteed to decrease the squared error (therefore RMSE) in
 each iteration, on the training set.
 
 This does not hold for the test set / cross validation. You would expect
 the test set RMSE to stabilise as iterations increase, since the 
 algorithm
 converges - but not necessarily to decrease.
 
 On Wed, Nov 26, 2014 at 1:57 PM, Kostas Kloudas kklou...@gmail.com
 wrote:
 
 Hi all,
 
 I am getting familiarized with Mllib and a thing I noticed is that
 running the MovieLensALS
 example on the movieLens dataset for increasing number of iterations 
 does
 not decrease the
 rmse.
 
 The results for 0.6% training set and 0.4% test are below. For training
 set to 0.8%, the results
 are almost identical. Shouldn’t it be normal to see a decreasing error?
 Especially going from 1 to 5 iterations.
 
 Running 1 iterations
 Test RMSE for 1 iter. = 1.2452964343277886 (52.75712592704 s).
 Running 5 iterations
 Test RMSE for 5 iter. = 1.3258973764470259 (61.183927666 s).
 Running 9 iterations
 Test RMSE for 9 iter. = 1.3260308117704385 (61.8494887581 s).
 Running 13 iterations
 Test RMSE for 13 iter. = 

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Gerard Maas
Hi TD,

We also struggled with this error for a long while.  The recurring scenario
is when the job takes longer to compute than the job interval and a backlog
starts to pile up.
Hint: Check
If the DStream storage level is set to MEMORY_ONLY_SER and memory runs
out,  then you will get a 'Cannot compute split: Missing block ...'.

What I don't know ATM is whether the new data is dropped or the LRU policy
removes data in the system in favor for the incoming data.
In any case, the DStream processing still thinks the data is there at the
moment the job is scheduled to run and fails to run.

In our case, changing storage to MEMORY_AND_DISK_SER solved the problem
and our streaming job can get through tought times without issues.

Regularly checking 'scheduling delay' and 'total delay' on the Streaming
tab in the UI is a must.  (And soon we will have that on the metrics report
as well!! :-) )

-kr, Gerard.



On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 Hi TD,

 I am using Spark Streaming to consume data from Kafka and do some
 aggregation and ingest the results into RDS. I do use foreachRDD in the
 program. I am planning to use Spark streaming in our production pipeline
 and it performs well in generating the results. Unfortunately, we plan to
 have a production pipeline 24/7 and Spark streaming job usually fails after
 8-20 hours due to the exception cannot compute split. In other cases, the
 Kafka receiver has failure and the program runs without producing any
 result.

 In my pipeline, the batch size is 1 minute and the data volume per minute
 from Kafka is 3G. I have been struggling with this issue for more than a
 month. It will be great if you can provide some solutions for this. Thanks!

 Bill


 On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 Can you elaborate on the usage pattern that lead to cannot compute
 split ? Are you using the RDDs generated by DStream, outside the
 DStream logic? Something like running interactive Spark jobs
 (independent of the Spark Streaming ones) on RDDs generated by
 DStreams? If that is the case, what is happening is that Spark
 Streaming is not aware that some of the RDDs (and the raw input data
 that it will need) will be used by Spark jobs unrelated to Spark
 Streaming. Hence Spark Streaming will actively clear off the raw data,
 leading to failures in the unrelated Spark jobs using that data.

 In case this is your use case, the cleanest way to solve this, is by
 asking Spark Streaming remember stuff for longer, by using
 streamingContext.remember(duration). This will ensure that Spark
 Streaming will keep around all the stuff for at least that duration.
 Hope this helps.

 TD

 On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Just add one more point. If Spark streaming knows when the RDD will not
 be
  used any more, I believe Spark will not try to retrieve data it will
 not use
  any more. However, in practice, I often encounter the error of cannot
  compute split. Based on my understanding, this is  because Spark
 cleared
  out data that will be used again. In my case, the data volume is much
  smaller (30M/s, the batch size is 60 seconds) than the memory (20G each
  executor). If Spark will only keep RDD that are in use, I expect that
 this
  error may not happen.
 
  Bill
 
  On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das 
 tathagata.das1...@gmail.com
  wrote:
 
  Let me further clarify Lalit's point on when RDDs generated by
  DStreams are destroyed, and hopefully that will answer your original
  questions.
 
  1.  How spark (streaming) guarantees that all the actions are taken on
  each input rdd/batch.
  This is isnt hard! By the time you call streamingContext.start(), you
  have already set up the output operations (foreachRDD, saveAs***Files,
  etc.) that you want to do with the DStream. There are RDD actions
  inside the DStream output oeprations that need to be done every batch
  interval. So all the systems does is this - after every batch
  interval, put all the output operations (that will call RDD actions)
  in a job queue, and then keep executing stuff in the queue. If there
  is any failure in running the jobs, the streaming context will stop.
 
  2.  How does spark determines that the life-cycle of a rdd is
  complete. Is there any chance that a RDD will be cleaned out of ram
  before all actions are taken on them?
  Spark Streaming knows when the all the processing related to batch T
  has been completed. And also it keeps track of how much time of the
  previous RDDs does it need to remember and keep around in the cache
  based on what DStream operations have been done. For example, if you
  are using a window 1 minute, the system knows that it needs to keep
  around at least last 1 minute data in the memory. Accordingly, it
  cleans up the input data (actively unpersisted), and cached RDD
  (simply dereferenced from DStream metadata, and then 

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-27 Thread jatinpreet
Hi,

I have been running through some troubles while converting the code to Java.
I have done the matrix operations as directed and tried to find the maximum
score for each category. But the predicted category is mostly different from
the prediction done by MLlib.

I am fetching iterators of the pi, theta and testData to do my calculations.
pi and theta are in  log space while my testData vector is not, could that
be a problem because I didn't see explicit conversion in Mllib also?

For example, for two categories and 5 features, I am doing the following
operation,

[1,2] + [1 2 3 4 5  ] * [1,2,3,4,5]
   [6 7 8 9 10]
These are simple element-wise matrix multiplication and addition operators.

Following is the code,

IteratorTuple2lt;Object, Object piIterator =
piValue.iterator();
IteratorTuple2lt;Tuple2lt;Object, Object, Object
thetaIterator = thetaValue.iterator();
IteratorTuple2lt;Object, Object testDataIterator = null;
  
double[] scores = new double[piValue.size()];
while (piIterator.hasNext()) {
double score = 0.0;
// reset to index 0
testDataIterator = testData.toBreeze().iterator();

while (testDataIterator.hasNext()) {
Tuple2Object, Object testTuple =
testDataIterator.next();
Tuple2Tuple2lt;Object, Object, Object thetaTuple =
thetaIterator.next();
 
score += ((double) testTuple2._2 * (double)
thetaTuple2._2);
}

Tuple2Object, Object piTuple = piIterator.next();
score += (double) piTuple._2;
scores[(int) piTuple._1] = score;
if (maxScore  score) {
predictedCategory = (int) piTuple._1;
maxScore = score;
}
}


Where am I going wrong?

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p19968.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Exception while starting thrift server

2014-11-27 Thread vdiwakar.malladi
Hi,

When I'm starting thrift server, I'm getting the following exception. Could
you any one help me on this.

I placed hive-site.xml in $SPARK_HOME/conf folder and the property
hive.metastore.sasl.enabled set to 'false'.

org.apache.hive.service.ServiceException: Unable to login to kerberos with
given principal/keytab
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIService.init(SparkSQLCLIService.scala:55)
at
org.apache.spark.sql.hive.thriftserver.ReflectedCompositeService$$anonfun$initCompositeService$1.apply(SparkSQLCLIService.scala:66)

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Exception-while-starting-thrift-server-tp19970.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[graphx] failed to submit an application with java.lang.ClassNotFoundException

2014-11-27 Thread Yifan LI
Hi,

I just tried to submit an application from graphx examples directory, but it 
failed:

yifan2:bin yifanli$ MASTER=local[*] ./run-example graphx.PPR_hubs
java.lang.ClassNotFoundException: org.apache.spark.examples.graphx.PPR_hubs
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:249)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:318)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

and also,
yifan2:bin yifanli$ ./spark-submit --class 
org.apache.spark.examples.graphx.PPR_hubs 
../examples/target/scala-2.10/spark-examples-1.2.0-SNAPSHOT-hadoop1.0.4.jar
java.lang.ClassNotFoundException: org.apache.spark.examples.graphx.PPR_hubs
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:249)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:318)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

anyone has some points on this?



Best,
Yifan LI







Best way to do a lookup in Spark

2014-11-27 Thread Ashic Mahtab
Hi,
I'm looking to do an iterative algorithm implementation with data coming in 
from Cassandra. This might be a use case for GraphX, however the ids are 
non-integral, and I would like to avoid a mapping (for now). I'm doing a simple 
hubs and authorities HITS implementation, and the current implementation does a 
lot of db access. It's fine (one half of a full iteration is done in 25 minutes 
on 3M+ vertices), and use of Spark's cache() has achieved that. However, each 
full iteration is 50 minutes, and I would like to improve that.

A high level overview of what I'm trying to do is:

1) Vertex structure (id, in, out, aScore, hScore).
2) Load all the vertices into memory (simple enough).
3) Have a lookup vertexid - (aScore, hScore) in memory (currently, this is 
where I need to do a lot of cassandra queries...which are very fast, but hoping 
to avoid).
4) Iterate n times in 2 statges:
In the Hub Stage:
a) Foreach vertex, get the sum of aScores for vertices it points to. Cache 
this.
b) From the cache, get the max score. Divide each score in the cache by the 
max.
c) Get rid of the cache.
d) Update the lookup (in (3)) with the new hScores.
   
In the Authority Stage:
a) Foreach vertex, get the sum of hScores for vertices that point to it. 
Cache this.
b) From the cache, get the max score. Divide each score in the cache by the 
max.
c) Get rid of the cache.
d) Update the lookup (in (3)) with the new aScores.
 
5) Update the final aScores and hScores from memory to Cassandra.

The one bit that I don't have now is the in memory lookup (i.e. to get the 
hScores and aScores of neighbours in (4-a ). As such, I have to query cassandra 
for each vertex x times where x is the number of neighbours. And as those 
values are used in the next iteration, I also have to update cassandra for each 
run. Is it possibly to have this as an in memory distributed lookup so that I 
can deal with the data store at the start and end?

One option is to identify clusters and run HITS for each cluster entirely in 
memory, however if there's a simpler way I'd prefer that.

Regards,
Ashic.
  

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-27 Thread Sean Owen
No, the feature vector is not converted. It contains count n_i of how
often each term t_i occurs (or a TF-IDF transformation of those). You
are finding the class c such that P(c) * P(t_1|c)^n_1 * ... is
maximized.

In log space it's log(P(c)) + n_1*log(P(t_1|c)) + ...

So your n_1 counts (or TF-IDF values) are used as-is and this is where
the dot product comes from.

Your bug is probably something lower-level and simple. I'd debug the
Spark example and print exactly its values for the log priors and
conditional probabilities, and the matrix operations, and yours too,
and see where the difference is.

On Thu, Nov 27, 2014 at 11:37 AM, jatinpreet jatinpr...@gmail.com wrote:
 Hi,

 I have been running through some troubles while converting the code to Java.
 I have done the matrix operations as directed and tried to find the maximum
 score for each category. But the predicted category is mostly different from
 the prediction done by MLlib.

 I am fetching iterators of the pi, theta and testData to do my calculations.
 pi and theta are in  log space while my testData vector is not, could that
 be a problem because I didn't see explicit conversion in Mllib also?

 For example, for two categories and 5 features, I am doing the following
 operation,

 [1,2] + [1 2 3 4 5  ] * [1,2,3,4,5]
[6 7 8 9 10]
 These are simple element-wise matrix multiplication and addition operators.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Jianshi Huang
Hi Hao,

I'm using inner join as Broadcast join didn't work for left joins (thanks
for the links for the latest improvements).

And I'm using HiveConext and it worked in a previous build (10/12) when
joining 15 dimension tables.

Jianshi

On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao hao.ch...@intel.com wrote:

  Are all of your join keys the same? and I guess the join type are all
 “Left” join, https://github.com/apache/spark/pull/3362 probably is what
 you need.



 And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast
 join) currently,  https://github.com/apache/spark/pull/3270 should be
 another optimization for this.





 *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
 *Sent:* Wednesday, November 26, 2014 4:36 PM
 *To:* user
 *Subject:* Auto BroadcastJoin optimization failed in latest Spark



 Hi,



 I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails
 optimizing auto broadcast join in my query. I have a query that joins a
 huge fact table with 15 tiny dimension tables.



 I'm currently using an older version of Spark which was built on Oct. 12.



 Anyone else has met similar situation?



 --

 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Mesos killing Spark Driver

2014-11-27 Thread Gerard Maas
Hi,

We are currently running our Spark + Spark Streaming jobs on Mesos,
submitting our jobs through Marathon.
We see with some regularity that the Spark Streaming driver gets killed by
Mesos and then restarted on some other node by Marathon.

I've no clue why Mesos is killing the driver and looking at both the Mesos
and Spark logs didn't make me any wiser.

On the Spark Streaming driver logs, I find this entry of Mesos signing
off my driver:

Shutting down
Sending SIGTERM to process tree at pid 17845
Killing the following process trees:
[
-+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
 \-+- 17846 sh ./run-mesos.sh application-ts.conf
   \--- 17847 java -cp core-compute-job.jar
-Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
]
Command terminated with signal Terminated (pid: 17845)


Have anybody seen something similar? Any hints on where to start digging?

-kr, Gerard.


Percentile

2014-11-27 Thread Franco Barrientos
Hi folks!,

 

Anyone known how can I calculate for each elements of a variable in a RDD
its percentile? I tried to calculate trough Spark SQL with subqueries but I
think that is imposible in Spark SQL. Any idea will be welcome.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com 

 http://www.exalitica.com/ www.exalitica.com


  http://exalitica.com/web/img/frim.png 

 



Using Breeze in the Scala Shell

2014-11-27 Thread Dean Jones
Hi,

I'm trying to use the breeze library in the spark scala shell, but I'm
running into the same problem documented here:
http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-td15748.html

As I'm using the shell, I don't have a pom.xml, so the solution
suggested in that thread doesn't work for me. I've tried the
following:

- adding commons-math3 using the --jars option
- adding both breeze and commons-math3 using the --jar option
- using the spark.executor.extraClassPath option on the cmd line as
follows: --conf spark.executor.extraClassPath=commons-math3-3.2.jar

None of these are working for me. Any thoughts on how I can get this working?

Thanks,

Dean.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using Breeze in the Scala Shell

2014-11-27 Thread Debasish Das
I have used breeze fine with scala shell:

scala -cp ./target/spark-mllib_2.10-1.3.0-SNAPSHOT.
jar:/Users/v606014/.m2/repository/com/github/fommil/netlib/core/1.1.2/core-1.1.2.jar:/Users/v606014/.m2/repository/org/jblas/jblas/1.2.3/jblas-1.2.3.jar:/Users/v606014/.m2/repository/org/scalanlp/breeze_2.10/0.10/breeze_2.10-0.10.jar:/Users/v606014/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar:/Users/v606014/.m2/repository/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1.jar
http://jar/Users/v606014/.m2/repository/com/github/fommil/netlib/core/1.1.2/core-1.1.2.jar:/Users/v606014/.m2/repository/org/jblas/jblas/1.2.3/jblas-1.2.3.jar:/Users/v606014/.m2/repository/org/scalanlp/breeze_2.10/0.10/breeze_2.10-0.10.jar:/Users/v606014/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar:/Users/v606014/.m2/repository/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1.jarorg.apache.spark.mllib.optimization.QuadraticMinimizer
100 1 1.0 0.99

For spark-shell my assumption is spark-shell -cp option should work fine

On Thu, Nov 27, 2014 at 9:15 AM, Dean Jones dean.m.jo...@gmail.com wrote:

 Hi,

 I'm trying to use the breeze library in the spark scala shell, but I'm
 running into the same problem documented here:

 http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-td15748.html

 As I'm using the shell, I don't have a pom.xml, so the solution
 suggested in that thread doesn't work for me. I've tried the
 following:

 - adding commons-math3 using the --jars option
 - adding both breeze and commons-math3 using the --jar option
 - using the spark.executor.extraClassPath option on the cmd line as
 follows: --conf spark.executor.extraClassPath=commons-math3-3.2.jar

 None of these are working for me. Any thoughts on how I can get this
 working?

 Thanks,

 Dean.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




ALS failure with size Integer.MAX_VALUE

2014-11-27 Thread Bharath Ravi Kumar
We're training a recommender with ALS in mllib 1.1 against a dataset of
150M users and 4.5K items, with the total number of training records being
1.2 Billion (~30GB data). The input data is spread across 1200 partitions
on HDFS. For the training, rank=10, and we've configured {number of user
data blocks = number of item data blocks}. The number of user/item blocks
was varied  between 50 to 1200. Irrespective of the block size (e.g. at
1200 blocks each), there are atleast a couple of tasks that end up shuffle
reading  9.7G each in the aggregate stage (ALS.scala:337) and failing with
the following exception:

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
at org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
at
org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
at
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)




As for the data, on the user side, the degree of a node in the connectivity
graph is relatively small. However, on the item side, 3.8K out of the 4.5K
items are connected to 10^5 users each on an average, with 100 items being
connected to nearly 10^8 users. The rest of the items are connected to less
than 10^5 users. With such a skew in the connectivity graph, I'm unsure if
additional memory or variation in the block sizes would help (considering
my limited understanding of the implementation in mllib). Any suggestion to
address the problem?


The test is being run on a standalone cluster of 3 hosts, each with 100G
RAM  24 cores dedicated to the application. The additional configs I made
specific to the shuffle and task failure reduction are as follows:

spark.core.connection.ack.wait.timeout=600
spark.shuffle.consolidateFiles=true
spark.shuffle.manager=SORT


The job execution summary is as follows:

Active Stages:

Stage id 2,  aggregate at ALS.scala:337, duration 55 min, Tasks 1197/1200
(3 failed), Shuffle Read :  141.6 GB

Completed Stages (5)
Stage IdDescriptionDuration
Tasks: Succeeded/TotalInputShuffle ReadShuffle Write
6org.apache.spark.rdd.RDD.flatMap(RDD.scala:277) 12 min
1200/120029.9 GB1668.4 MB186.8 GB

5mapPartitionsWithIndex at ALS.scala:250 +details

3map at ALS.scala:231

0aggregate at ALS.scala:337 +details

1map at ALS.scala:228 +details


Thanks,
Bharath


Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Bill Jay
Gerard,

That is a good observation. However, the strange thing I meet is if I use
MEMORY_AND_DISK_SER, the job even fails earlier. In my case, it takes 10
seconds to process my data of every batch, which is one minute. It fails
after 10 hours with the cannot compute split error.

Bill

On Thu, Nov 27, 2014 at 3:31 AM, Gerard Maas gerard.m...@gmail.com wrote:

 Hi TD,

 We also struggled with this error for a long while.  The recurring
 scenario is when the job takes longer to compute than the job interval and
 a backlog starts to pile up.
 Hint: Check
 If the DStream storage level is set to MEMORY_ONLY_SER and memory runs
 out,  then you will get a 'Cannot compute split: Missing block ...'.

 What I don't know ATM is whether the new data is dropped or the LRU policy
 removes data in the system in favor for the incoming data.
 In any case, the DStream processing still thinks the data is there at the
 moment the job is scheduled to run and fails to run.

 In our case, changing storage to MEMORY_AND_DISK_SER solved the problem
 and our streaming job can get through tought times without issues.

 Regularly checking 'scheduling delay' and 'total delay' on the Streaming
 tab in the UI is a must.  (And soon we will have that on the metrics report
 as well!! :-) )

 -kr, Gerard.



 On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi TD,

 I am using Spark Streaming to consume data from Kafka and do some
 aggregation and ingest the results into RDS. I do use foreachRDD in the
 program. I am planning to use Spark streaming in our production pipeline
 and it performs well in generating the results. Unfortunately, we plan to
 have a production pipeline 24/7 and Spark streaming job usually fails after
 8-20 hours due to the exception cannot compute split. In other cases, the
 Kafka receiver has failure and the program runs without producing any
 result.

 In my pipeline, the batch size is 1 minute and the data volume per minute
 from Kafka is 3G. I have been struggling with this issue for more than a
 month. It will be great if you can provide some solutions for this. Thanks!

 Bill


 On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 Can you elaborate on the usage pattern that lead to cannot compute
 split ? Are you using the RDDs generated by DStream, outside the
 DStream logic? Something like running interactive Spark jobs
 (independent of the Spark Streaming ones) on RDDs generated by
 DStreams? If that is the case, what is happening is that Spark
 Streaming is not aware that some of the RDDs (and the raw input data
 that it will need) will be used by Spark jobs unrelated to Spark
 Streaming. Hence Spark Streaming will actively clear off the raw data,
 leading to failures in the unrelated Spark jobs using that data.

 In case this is your use case, the cleanest way to solve this, is by
 asking Spark Streaming remember stuff for longer, by using
 streamingContext.remember(duration). This will ensure that Spark
 Streaming will keep around all the stuff for at least that duration.
 Hope this helps.

 TD

 On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Just add one more point. If Spark streaming knows when the RDD will
 not be
  used any more, I believe Spark will not try to retrieve data it will
 not use
  any more. However, in practice, I often encounter the error of cannot
  compute split. Based on my understanding, this is  because Spark
 cleared
  out data that will be used again. In my case, the data volume is much
  smaller (30M/s, the batch size is 60 seconds) than the memory (20G each
  executor). If Spark will only keep RDD that are in use, I expect that
 this
  error may not happen.
 
  Bill
 
  On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das 
 tathagata.das1...@gmail.com
  wrote:
 
  Let me further clarify Lalit's point on when RDDs generated by
  DStreams are destroyed, and hopefully that will answer your original
  questions.
 
  1.  How spark (streaming) guarantees that all the actions are taken on
  each input rdd/batch.
  This is isnt hard! By the time you call streamingContext.start(), you
  have already set up the output operations (foreachRDD, saveAs***Files,
  etc.) that you want to do with the DStream. There are RDD actions
  inside the DStream output oeprations that need to be done every batch
  interval. So all the systems does is this - after every batch
  interval, put all the output operations (that will call RDD actions)
  in a job queue, and then keep executing stuff in the queue. If there
  is any failure in running the jobs, the streaming context will stop.
 
  2.  How does spark determines that the life-cycle of a rdd is
  complete. Is there any chance that a RDD will be cleaned out of ram
  before all actions are taken on them?
  Spark Streaming knows when the all the processing related to batch T
  has been completed. And also it keeps track of how much time of the
  previous RDDs does 

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-27 Thread Kelly, Jonathan
Yeah, only a few hours after I sent my message I saw some correspondence on 
this other thread: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-lt-string-map-lt-string-int-gt-gt-in-spark-sql-td19603.html,
 which is the exact same issue.  Glad to find that this should be fixed in 
1.2.0!  I'll give that a try later.

Thanks a lot,
Jonathan

From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com
Date: Thursday, November 27, 2014 at 4:37 PM
To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded 
from a JSON file using schema auto-detection

Hello Jonathan,

There was a bug regarding casting data types before inserting into a Hive 
table. Hive does not have the notion of containsNull for array values. So, 
for a Hive table, the containsNull will be always true for an array and we 
should ignore this field for Hive. This issue has been fixed by 
https://issues.apache.org/jira/browse/SPARK-4245, which will be released with 
1.2.

Thanks,

Yin

On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:
After playing around with this a little more, I discovered that:

1. If test.json contains something like {values:[null,1,2,3]}, the
schema auto-determined by SchemaRDD.jsonFile() will have element: integer
(containsNull = true), and then
SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
makes sense but doesn't really help).
2. If I specify the schema myself (e.g., sqlContext.jsonFile(test.json,
StructType(Seq(StructField(values, ArrayType(IntegerType, true),
true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
work, though as I mentioned before, this is less than ideal.

Why don't saveAsTable/insertInto work when the containsNull properties
don't match?  I can understand how inserting data with containsNull=true
into a column where containsNull=false might fail, but I think the other
way around (which is the case here) should work.

~ Jonathan


On 11/26/14, 5:23 PM, Kelly, Jonathan 
jonat...@amazon.commailto:jonat...@amazon.com wrote:

I've noticed some strange behavior when I try to use
SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
that contains elements with nested arrays.  For example, with a file
test.json that contains the single line:

   {values:[1,2,3]}

and with code like the following:

scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala val test = sqlContext.jsonFile(test.json)
scala test.saveAsTable(test)

it creates the table but fails when inserting the data into it.  Here¹s
the exception:

scala.MatchError: ArrayType(IntegerType,true) (of class
org.apache.spark.sql.catalyst.types.ArrayType)
   at
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
2
47)
   at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
   at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
   at
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
a
:84)
   at
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
l
y(Projection.scala:66)
   at
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
l
y(Projection.scala:50)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.orghttp://org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$s
q
l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
a
la:149)
   at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
e
File$1.apply(InsertIntoHiveTable.scala:158)
   at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
e
File$1.apply(InsertIntoHiveTable.scala:158)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1
145)
   at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
:
615)
   at java.lang.Thread.run(Thread.java:745)

I'm guessing that this is due to the slight difference in the schemas of
these tables:

scala test.printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = false)


scala sqlContext.table(test).printSchema
root
 |-- values: array (nullable = true)
 ||-- element: integer (containsNull = true)

If I reload the file using the schema that was created for the Hive table

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Tathagata Das
If it regularly fails after 8 hours then could you get me the log4j logs?
To limit the size, set default log level to Warn and the level of logs for
all classes in package o.a.s.streaming to Debug. Then I can take a look.
On Nov 27, 2014 11:01 AM, Bill Jay bill.jaypeter...@gmail.com wrote:

 Gerard,

 That is a good observation. However, the strange thing I meet is if I use
 MEMORY_AND_DISK_SER, the job even fails earlier. In my case, it takes 10
 seconds to process my data of every batch, which is one minute. It fails
 after 10 hours with the cannot compute split error.

 Bill

 On Thu, Nov 27, 2014 at 3:31 AM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Hi TD,

 We also struggled with this error for a long while.  The recurring
 scenario is when the job takes longer to compute than the job interval and
 a backlog starts to pile up.
 Hint: Check
 If the DStream storage level is set to MEMORY_ONLY_SER and memory runs
 out,  then you will get a 'Cannot compute split: Missing block ...'.

 What I don't know ATM is whether the new data is dropped or the LRU
 policy removes data in the system in favor for the incoming data.
 In any case, the DStream processing still thinks the data is there at the
 moment the job is scheduled to run and fails to run.

 In our case, changing storage to MEMORY_AND_DISK_SER solved the
 problem and our streaming job can get through tought times without issues.

 Regularly checking 'scheduling delay' and 'total delay' on the Streaming
 tab in the UI is a must.  (And soon we will have that on the metrics report
 as well!! :-) )

 -kr, Gerard.



 On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi TD,

 I am using Spark Streaming to consume data from Kafka and do some
 aggregation and ingest the results into RDS. I do use foreachRDD in the
 program. I am planning to use Spark streaming in our production pipeline
 and it performs well in generating the results. Unfortunately, we plan to
 have a production pipeline 24/7 and Spark streaming job usually fails after
 8-20 hours due to the exception cannot compute split. In other cases, the
 Kafka receiver has failure and the program runs without producing any
 result.

 In my pipeline, the batch size is 1 minute and the data volume per
 minute from Kafka is 3G. I have been struggling with this issue for more
 than a month. It will be great if you can provide some solutions for this.
 Thanks!

 Bill


 On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 Can you elaborate on the usage pattern that lead to cannot compute
 split ? Are you using the RDDs generated by DStream, outside the
 DStream logic? Something like running interactive Spark jobs
 (independent of the Spark Streaming ones) on RDDs generated by
 DStreams? If that is the case, what is happening is that Spark
 Streaming is not aware that some of the RDDs (and the raw input data
 that it will need) will be used by Spark jobs unrelated to Spark
 Streaming. Hence Spark Streaming will actively clear off the raw data,
 leading to failures in the unrelated Spark jobs using that data.

 In case this is your use case, the cleanest way to solve this, is by
 asking Spark Streaming remember stuff for longer, by using
 streamingContext.remember(duration). This will ensure that Spark
 Streaming will keep around all the stuff for at least that duration.
 Hope this helps.

 TD

 On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Just add one more point. If Spark streaming knows when the RDD will
 not be
  used any more, I believe Spark will not try to retrieve data it will
 not use
  any more. However, in practice, I often encounter the error of cannot
  compute split. Based on my understanding, this is  because Spark
 cleared
  out data that will be used again. In my case, the data volume is much
  smaller (30M/s, the batch size is 60 seconds) than the memory (20G
 each
  executor). If Spark will only keep RDD that are in use, I expect that
 this
  error may not happen.
 
  Bill
 
  On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das 
 tathagata.das1...@gmail.com
  wrote:
 
  Let me further clarify Lalit's point on when RDDs generated by
  DStreams are destroyed, and hopefully that will answer your original
  questions.
 
  1.  How spark (streaming) guarantees that all the actions are taken
 on
  each input rdd/batch.
  This is isnt hard! By the time you call streamingContext.start(), you
  have already set up the output operations (foreachRDD,
 saveAs***Files,
  etc.) that you want to do with the DStream. There are RDD actions
  inside the DStream output oeprations that need to be done every batch
  interval. So all the systems does is this - after every batch
  interval, put all the output operations (that will call RDD actions)
  in a job queue, and then keep executing stuff in the queue. If there
  is any failure in running the jobs, the streaming context will stop.
 
  2.  How does spark determines 

Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-11-27 Thread Harihar Nahak
Thanks Ankur, Its really help full. I've few queries on optimization
techniques. for the current I used RandomVertexCut partition.

But what partition should be used if have:
1. No. of edges in edgeList file are to large like 50,000,000; where
multiple edges to same pair of vertices are many
2. No of unique Vertex are to large suppose 10,000,000 in above edgeList
file
3. No of unique Vertex are small suppose less than 100,000 in above
edgeList file





On 27 November 2014 at 20:23, ankurdave [via Apache Spark User List] 
ml-node+s1001560n1995...@n3.nabble.com wrote:

 At 2014-11-24 19:02:08 -0800, Harihar Nahak [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19956i=0 wrote:

  According to documentation GraphX runs 10x faster than normal Spark. So
 I
  run Page Rank algorithm in both the applications:
  [...]
  Local Mode (Machine : 8 Core; 16 GB memory; 2.80 Ghz Intel i7; Executor
  Memory: 4Gb, No. of Partition: 50; No. of Iterations: 2);   ==
 
  *Spark Page Rank took - 21.29 mins
  GraphX Page Rank took - 42.01 mins *
 
  Cluster Mode (ubantu 12.4; spark 1.1/hadoop 2.4 cluster ; 3 workers , 1
  driver , 8 cores, 30 gb memory) (Executor memory 4gb; No. of edge
 partitions
  : 50, random vertex cut ; no. of iteration : 2) =
 
  *Spark Page Rank took - 10.54 mins
  GraphX Page Rank took - 7.54 mins *
 
  Could you please help me to determine, when to use Spark and GraphX ? If
  GraphX took same amount of time than Spark then its better to use Spark
  because spark has variey of operators to deal with any type of RDD.

 If you have a problem that's naturally expressible as a graph computation,
 it makes sense to use GraphX in my opinion. In addition to the
 optimizations that GraphX incorporates which you would otherwise have to
 implement manually, GraphX's programming model is likely a better fit. But
 even if you start off by using pure Spark, you'll still have the
 flexibility to use GraphX for other parts of the problem since it's part of
 the same system.

 To address the benchmark results you got:

 1. GraphX takes more time than Spark to load the graph, because it has to
 index it, but subsequent iterations should be faster. We benchmarked with
 20 iterations to show this effect, but you only used 2 iterations, which
 doesn't give much time to amortize the loading cost.

 2. The benchmarks in the GraphX OSDI paper are against a naive
 implementation of PageRank in Spark, while the version you benchmarked
 against has some of the same optimizations as GraphX does. I believe we
 found that the optimized Spark PageRank was only 3x slower than GraphX.

 3. When running those benchmarks, we used an experimental version of Spark
 with in-memory shuffle, which disproportionately benefits GraphX since its
 shuffle files are smaller due to specialized compression.

 4. We haven't optimized GraphX for local mode, so it's not surprising that
 it's slower there.

 Ankur

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19956i=1
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19956i=2



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19956.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Is Spark? or GraphX runs fast? a performance
 comparison on Page Rank, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=19710code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MTk3MTB8LTE4MTkxOTE5Mjk=
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
Regards,
Harihar Nahak
BigData Developer
Wynyard
Email:hna...@wynyardgroup.com | Extn: 8019




-
--Harihar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19986.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Harihar Nahak
When there is new data comes in a stream spark use streams classes to
convert it into RDD and as you mention its follow with transformation and
finally action. Till the time user doesn't destroy or application is alive
All RDD remain in Memory as far as I experienced.


On 26 November 2014 at 20:05, Mukesh Jha [via Apache Spark User List] 
ml-node+s1001560n19835...@n3.nabble.com wrote:

 Any pointers guys?

 On Tue, Nov 25, 2014 at 5:32 PM, Mukesh Jha [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19835i=0 wrote:

 Hey Experts,

 I wanted to understand in detail about the lifecycle of rdd(s) in a
 streaming app.

 From my current understanding
 - rdd gets created out of the realtime input stream.
 - Transform(s) functions are applied in a lazy fashion on the RDD to
 transform into another rdd(s).
 - Actions are taken on the final transformed rdds to get the data out of
 the system.

 Also rdd(s) are stored in the clusters RAM (disc if configured so) and
 are cleaned in LRU fashion.

 So I have the following questions on the same.
 - How spark (streaming) guarantees that all the actions are taken on each
 input rdd/batch.
 - How does spark determines that the life-cycle of a rdd is complete. Is
 there any chance that a RDD will be cleaned out of ram before all actions
 are taken on them?

 Thanks in advance for all your help. Also, I'm relatively new to scala 
 spark so pardon me in case these are naive questions/assumptions.

 --
 Thanks  Regards,

 *[hidden email] http://user/SendEmail.jtp?type=nodenode=19835i=1*




 --


 Thanks  Regards,

 *[hidden email] http://user/SendEmail.jtp?type=nodenode=19835i=2*


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Lifecycle-of-RDD-in-spark-streaming-tp19749p19835.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
Regards,
Harihar Nahak
BigData Developer
Wynyard
Email:hna...@wynyardgroup.com | Extn: 8019




-
--Harihar
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Lifecycle-of-RDD-in-spark-streaming-tp19749p19987.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

SVD Plus Plus in GraphX

2014-11-27 Thread Deep Pradhan
Hi,
I was just going through the two codes in GraphX namely SVDPlusPlus and
TriangleCount. In the first I see an RDD as an input to run ie, run(edges:
RDD[Edge[Double]],...) and in the other I see run(VD:..., ED:...)
Can anyone explain me the difference between these two? Infact SVDPlusPlus
is the only GraphX code in Spark-1.0.0 that I have seen RDD as an input.
Could anyone please explain to me?

Thank you


RE: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Cheng, Hao
Hi Jianshi,
I couldn’t reproduce that with latest MASTER, and I can always get the 
BroadcastHashJoin for managed tables (in .csv file) in my testing, are there 
any external tables in your case?

In general probably couple of things you can try first (with HiveContext):

1)  ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all of the 
tables);

2)  SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold as a 
greater value, it is 1024*1024*10 by default, just make sure the maximum 
dimension tables size (in bytes) is less than this)

3)  Always put the main table(the biggest table) in the left-most among the 
inner joins;

DESC EXTENDED tablename; -- this will print the detail information for the 
statistic table size (the field “totalSize”)
EXPLAIN EXTENDED query; -- this will print the detail physical plan.

Let me know if you still have problem.

Hao

From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Thursday, November 27, 2014 10:24 PM
To: Cheng, Hao
Cc: user
Subject: Re: Auto BroadcastJoin optimization failed in latest Spark

Hi Hao,

I'm using inner join as Broadcast join didn't work for left joins (thanks for 
the links for the latest improvements).

And I'm using HiveConext and it worked in a previous build (10/12) when joining 
15 dimension tables.

Jianshi

On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com wrote:
Are all of your join keys the same? and I guess the join type are all “Left” 
join, https://github.com/apache/spark/pull/3362 probably is what you need.

And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) 
currently,  https://github.com/apache/spark/pull/3270 should be another 
optimization for this.


From: Jianshi Huang 
[mailto:jianshi.hu...@gmail.commailto:jianshi.hu...@gmail.com]
Sent: Wednesday, November 26, 2014 4:36 PM
To: user
Subject: Auto BroadcastJoin optimization failed in latest Spark

Hi,

I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails 
optimizing auto broadcast join in my query. I have a query that joins a huge 
fact table with 15 tiny dimension tables.

I'm currently using an older version of Spark which was built on Oct. 12.

Anyone else has met similar situation?

--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/



--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Creating a SchemaRDD from an existing API

2014-11-27 Thread Niranda Perera
Hi,

I am evaluating Spark for an analytic component where we do batch
processing of data using SQL.

So, I am particularly interested in Spark SQL and in creating a SchemaRDD
from an existing API [1].

This API exposes elements in a database as datasources. Using the methods
allowed by this data source, we can access and edit data.

So, I want to create a custom SchemaRDD using the methods and provisions of
this API. I tried going through Spark documentation and the Java Docs, but
unfortunately, I was unable to come to a final conclusion if this was
actually possible.

I would like to ask the Spark Devs,
1. As of the current Spark release, can we make a custom SchemaRDD?
2. What is the extension point to a custom SchemaRDD? or are there
particular interfaces?
3. Could you please point me the specific docs regarding this matter?

Your help in this regard is highly appreciated.

Cheers

[1]
https://github.com/wso2-dev/carbon-analytics/tree/master/components/xanalytics

-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 https://twitter.com/N1R44


Re: read both local path and HDFS path

2014-11-27 Thread Prannoy
Hi,

The configuration you provide is just to access the HDFS when you give an
HDFS path. When you provide a HDFS path with the HDFS nameservice, like in
your case hmaster155:9000 it goes inside the HDFS to look for the file. For
accessing local file just give the local path of the file. Go to the file
in the local and do a pwd. This will give you the full path of the file.
Just give that path as your local path for the file and you will do good.

Thanks.

On Fri, Nov 28, 2014 at 8:57 AM, tuyuri [via Apache Spark User List] 
ml-node+s1001560n19990...@n3.nabble.com wrote:


 I have setup a Spark cluster config with HDFS and I know that default file
 path will be read by Spark all in HDFS example :

 /ad-cpc/2014-11-28/ Spark will read in :
 hdfs://hmaster155:9000/ad-cpc/2014-11-28/
 sometimes I wonder how can i force Spark read a file in local without
 reConfig my cluster ( to not use hdfs).

 please help me !!!

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/read-both-local-path-and-HDFS-path-tp19990.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/read-both-local-path-and-HDFS-path-tp19990p19995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Unable to compile spark 1.1.0 on windows 8.1

2014-11-27 Thread Ishwardeep Singh
Hi,

I am trying to compile spark 1.1.0 on windows 8.1 but I get the following
exception. 

[info] Compiling 3 Scala sources to
D:\myworkplace\software\spark-1.1.0\project\target\scala-2.10\sbt0.13\classes...
[error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:26:
object sbt is not a member of package com.typesafe
[error] import com.typesafe.sbt.pom.{PomBuild, SbtPomKeys}
[error] ^
[error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:53: not
found: type PomBuild
[error] object SparkBuild extends PomBuild {
[error]   ^
[error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:121:
not found: value SbtPomKeys
[error] otherResolvers = SbtPomKeys.mvnLocalRepository(dotM2 =
Seq(Resolver.file(dotM2, dotM2))),
[error]^
[error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:165:
value projectDefinitions is not a member of AnyRef
[error] super.projectDefinitions(baseDirectory).map { x =
[error]   ^
[error] four errors found
[error] (plugins/compile:compile) Compilation failed

I have also setup scala 2.10.

Need help to resolve this issue.

Regards,
Ishwardeep 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-compile-spark-1-1-0-on-windows-8-1-tp19996.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org