RDD decouple store implementations
Hi guys I am playing with spark, and I was thinking if there is a way to share RDD across multiple implementations in a decoupled way , i.e assuming I have RDD that comes from a stream in spark streaming, I want to be able to store the same stream on two different s3 folders using two different formatters, let's say that one formatter write a JSON file and another CSV file. Is it possible? If it is possible is it possible also to change the implantation of the CSV formatter without stopping the JSON file writer? Thanks, Guy Doulberg
Re: RMSE in MovieLensALS increases or stays stable as iterations increase.
Ah of course. Great explanation. So I suppose you should see desired results with lambda = 0, although you don't generally want to set this to 0. On Wed, Nov 26, 2014 at 7:53 PM, Xiangrui Meng men...@gmail.com wrote: The training RMSE may increase due to regularization. Squared loss only represents part of the global loss. If you watch the sum of the squared loss and the regularization, it should be non-increasing. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unable to generate assembly jar which includes jdbc-thrift server
Hi, I setup maven environment on a Linux machine and able to build the pom file in spark home directory. Each module refreshed with corresponding target directory with jar files. In order to include all the libraries to classpath, what I need to do? earlier, I used single assembly jar file to include the same in classpath (without having hive profile by running pom file available in assembly folder). But now, I could see jar files generated in the individual module folders. Could you please advice. Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-generate-assembly-jar-which-includes-jdbc-thrift-server-tp19887p19963.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark 1.1.1 released but not available on maven repositories
I have just read on the website that spark 1.1.1 has been released but when I upgraded my project to use 1.1.1 I discovered that the artefacts are not on maven yet. [info] Resolving org.apache.spark#spark-streaming-kafka_2.10;1.1.1 ... [warn] module not found: org.apache.spark#spark-streaming-kafka_2.10;1.1.1 [warn] local: tried [warn] /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-streaming-kafka_2.10/1.1.1/ivys/ivy.xml [warn] public: tried [warn] https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka_2.10/1.1.1/spark-streaming-kafka_2.10-1.1.1.pom [warn] sonatype snapshots: tried [warn] https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-streaming-kafka_2.10/1.1.1/spark-streaming-kafka_2.10-1.1.1.pom [info] Resolving org.apache.spark#spark-core_2.10;1.1.1 ... [warn] module not found: org.apache.spark#spark-core_2.10;1.1.1 [warn] local: tried [warn] /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-core_2.10/1.1.1/ivys/ivy.xml [warn] public: tried [warn] https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.1.1/spark-core_2.10-1.1.1.pom [warn] sonatype snapshots: tried [warn] https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-core_2.10/1.1.1/spark-core_2.10-1.1.1.pom [info] Resolving org.apache.spark#spark-streaming_2.10;1.1.1 ... [warn] module not found: org.apache.spark#spark-streaming_2.10;1.1.1 [warn] local: tried [warn] /Users/luis.vicente/.ivy2/local/org.apache.spark/spark-streaming_2.10/1.1.1/ivys/ivy.xml [warn] public: tried [warn] https://repo1.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.1.1/spark-streaming_2.10-1.1.1.pom [warn] sonatype snapshots: tried [warn] https://oss.sonatype.org/content/repositories/snapshots/org/apache/spark/spark-streaming_2.10/1.1.1/spark-streaming_2.10-1.1.1.pom
Re: RMSE in MovieLensALS increases or stays stable as iterations increase.
Thanks a lot for your time guys and your quick replies! On Nov 26, 2014, at 7:53 PM, Xiangrui Meng men...@gmail.com wrote: The training RMSE may increase due to regularization. Squared loss only represents part of the global loss. If you watch the sum of the squared loss and the regularization, it should be non-increasing. -Xiangrui On Wed, Nov 26, 2014 at 9:53 AM, Sean Owen so...@cloudera.com wrote: I also modified the example to try 1, 5, 9, ... iterations as you did, and also ran with the same default parameters. I used the sample_movielens_data.txt file. Is that what you're using? My result is: Iteration 1 Test RMSE = 1.426079653593016 Train RMSE = 1.5013155094216357 Iteration 5 Test RMSE = 1.405598012724468 Train RMSE = 1.4847078708333596 Iteration 9 Test RMSE = 1.4055990901261632 Train RMSE = 1.484713206769993 Iteration 13 Test RMSE = 1.4055990999738366 Train RMSE = 1.4847132332994588 Iteration 17 Test RMSE = 1.40559910003368 Train RMSE = 1.48471323345531 Iteration 21 Test RMSE = 1.4055991000342158 Train RMSE = 1.4847132334567061 Iteration 25 Test RMSE = 1.4055991000342174 Train RMSE = 1.4847132334567108 Train error is higher than test error, consistently, which could be underfitting. A higher rank=50 gets a reasonable result: Iteration 1 Test RMSE = 1.5981883186995312 Train RMSE = 1.4841671360432005 Iteration 5 Test RMSE = 1.5745145659678204 Train RMSE = 1.4672341345080382 Iteration 9 Test RMSE = 1.5745147110505406 Train RMSE = 1.4672385714907996 Iteration 13 Test RMSE = 1.5745147108258577 Train RMSE = 1.4672385929631868 Iteration 17 Test RMSE = 1.5745147108246424 Train RMSE = 1.4672385930428344 Iteration 21 Test RMSE = 1.5745147108246367 Train RMSE = 1.4672385930431973 Iteration 25 Test RMSE = 1.5745147108246367 Train RMSE = 1.467238593043199 I'm not sure what the difference is. I looked at your modifications and they seem very similar. Is it the data you're using? On Wed, Nov 26, 2014 at 3:34 PM, Kostas Kloudas kklou...@gmail.com wrote: For the training I am using the code in the MovieLensALS example with trainImplicit set to false and for the training RMSE I use the val rmseTr = computeRmse(model, training, params.implicitPrefs). The computeRmse() method is provided in the MovieLensALS class. Thanks a lot, Kostas On Nov 26, 2014, at 2:41 PM, Sean Owen so...@cloudera.com wrote: How are you computing RMSE? and how are you training the model -- not with trainImplicit right? I wonder if you are somehow optimizing something besides RMSE. On Wed, Nov 26, 2014 at 2:36 PM, Kostas Kloudas kklou...@gmail.com wrote: Once again, the error even with the training dataset increases. The results are: Running 1 iterations For 1 iter.: Test RMSE = 1.2447121194304893 Training RMSE = 1.2394166987104076 (34.751317636 s). Running 5 iterations For 5 iter.: Test RMSE = 1.3253957117600659 Training RMSE = 1.3206317416138509 (37.69311802304 s). Running 9 iterations For 9 iter.: Test RMSE = 1.3255293380139364 Training RMSE = 1.3207661218210436 (41.046175661 s). Running 13 iterations For 13 iter.: Test RMSE = 1.3255295352665748 Training RMSE = 1.3207663201865092 (47.763619515 s). Running 17 iterations For 17 iter.: Test RMSE = 1.32552953555787 Training RMSE = 1.3207663204794406 (59.68236110305 s). Running 21 iterations For 21 iter.: Test RMSE = 1.3255295355583026 Training RMSE = 1.3207663204798756 (57.210578232 s). Running 25 iterations For 25 iter.: Test RMSE = 1.325529535558303 Training RMSE = 1.3207663204798765 (65.785485882 s). Thanks a lot, Kostas On Nov 26, 2014, at 12:04 PM, Nick Pentreath nick.pentre...@gmail.com wrote: copying user group - I keep replying directly vs reply all :) On Wed, Nov 26, 2014 at 2:03 PM, Nick Pentreath nick.pentre...@gmail.com wrote: ALS will be guaranteed to decrease the squared error (therefore RMSE) in each iteration, on the training set. This does not hold for the test set / cross validation. You would expect the test set RMSE to stabilise as iterations increase, since the algorithm converges - but not necessarily to decrease. On Wed, Nov 26, 2014 at 1:57 PM, Kostas Kloudas kklou...@gmail.com wrote: Hi all, I am getting familiarized with Mllib and a thing I noticed is that running the MovieLensALS example on the movieLens dataset for increasing number of iterations does not decrease the rmse. The results for 0.6% training set and 0.4% test are below. For training set to 0.8%, the results are almost identical. Shouldn’t it be normal to see a decreasing error? Especially going from 1 to 5 iterations. Running 1 iterations Test RMSE for 1 iter. = 1.2452964343277886 (52.75712592704 s). Running 5 iterations Test RMSE for 5 iter. = 1.3258973764470259 (61.183927666 s). Running 9 iterations Test RMSE for 9 iter. = 1.3260308117704385 (61.8494887581 s). Running 13 iterations Test RMSE for 13 iter. =
Re: Lifecycle of RDD in spark-streaming
Hi TD, We also struggled with this error for a long while. The recurring scenario is when the job takes longer to compute than the job interval and a backlog starts to pile up. Hint: Check If the DStream storage level is set to MEMORY_ONLY_SER and memory runs out, then you will get a 'Cannot compute split: Missing block ...'. What I don't know ATM is whether the new data is dropped or the LRU policy removes data in the system in favor for the incoming data. In any case, the DStream processing still thinks the data is there at the moment the job is scheduled to run and fails to run. In our case, changing storage to MEMORY_AND_DISK_SER solved the problem and our streaming job can get through tought times without issues. Regularly checking 'scheduling delay' and 'total delay' on the Streaming tab in the UI is a must. (And soon we will have that on the metrics report as well!! :-) ) -kr, Gerard. On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi TD, I am using Spark Streaming to consume data from Kafka and do some aggregation and ingest the results into RDS. I do use foreachRDD in the program. I am planning to use Spark streaming in our production pipeline and it performs well in generating the results. Unfortunately, we plan to have a production pipeline 24/7 and Spark streaming job usually fails after 8-20 hours due to the exception cannot compute split. In other cases, the Kafka receiver has failure and the program runs without producing any result. In my pipeline, the batch size is 1 minute and the data volume per minute from Kafka is 3G. I have been struggling with this issue for more than a month. It will be great if you can provide some solutions for this. Thanks! Bill On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Can you elaborate on the usage pattern that lead to cannot compute split ? Are you using the RDDs generated by DStream, outside the DStream logic? Something like running interactive Spark jobs (independent of the Spark Streaming ones) on RDDs generated by DStreams? If that is the case, what is happening is that Spark Streaming is not aware that some of the RDDs (and the raw input data that it will need) will be used by Spark jobs unrelated to Spark Streaming. Hence Spark Streaming will actively clear off the raw data, leading to failures in the unrelated Spark jobs using that data. In case this is your use case, the cleanest way to solve this, is by asking Spark Streaming remember stuff for longer, by using streamingContext.remember(duration). This will ensure that Spark Streaming will keep around all the stuff for at least that duration. Hope this helps. TD On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Just add one more point. If Spark streaming knows when the RDD will not be used any more, I believe Spark will not try to retrieve data it will not use any more. However, in practice, I often encounter the error of cannot compute split. Based on my understanding, this is because Spark cleared out data that will be used again. In my case, the data volume is much smaller (30M/s, the batch size is 60 seconds) than the memory (20G each executor). If Spark will only keep RDD that are in use, I expect that this error may not happen. Bill On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Let me further clarify Lalit's point on when RDDs generated by DStreams are destroyed, and hopefully that will answer your original questions. 1. How spark (streaming) guarantees that all the actions are taken on each input rdd/batch. This is isnt hard! By the time you call streamingContext.start(), you have already set up the output operations (foreachRDD, saveAs***Files, etc.) that you want to do with the DStream. There are RDD actions inside the DStream output oeprations that need to be done every batch interval. So all the systems does is this - after every batch interval, put all the output operations (that will call RDD actions) in a job queue, and then keep executing stuff in the queue. If there is any failure in running the jobs, the streaming context will stop. 2. How does spark determines that the life-cycle of a rdd is complete. Is there any chance that a RDD will be cleaned out of ram before all actions are taken on them? Spark Streaming knows when the all the processing related to batch T has been completed. And also it keeps track of how much time of the previous RDDs does it need to remember and keep around in the cache based on what DStream operations have been done. For example, if you are using a window 1 minute, the system knows that it needs to keep around at least last 1 minute data in the memory. Accordingly, it cleans up the input data (actively unpersisted), and cached RDD (simply dereferenced from DStream metadata, and then
Re: Accessing posterior probability of Naive Baye's prediction
Hi, I have been running through some troubles while converting the code to Java. I have done the matrix operations as directed and tried to find the maximum score for each category. But the predicted category is mostly different from the prediction done by MLlib. I am fetching iterators of the pi, theta and testData to do my calculations. pi and theta are in log space while my testData vector is not, could that be a problem because I didn't see explicit conversion in Mllib also? For example, for two categories and 5 features, I am doing the following operation, [1,2] + [1 2 3 4 5 ] * [1,2,3,4,5] [6 7 8 9 10] These are simple element-wise matrix multiplication and addition operators. Following is the code, IteratorTuple2lt;Object, Object piIterator = piValue.iterator(); IteratorTuple2lt;Tuple2lt;Object, Object, Object thetaIterator = thetaValue.iterator(); IteratorTuple2lt;Object, Object testDataIterator = null; double[] scores = new double[piValue.size()]; while (piIterator.hasNext()) { double score = 0.0; // reset to index 0 testDataIterator = testData.toBreeze().iterator(); while (testDataIterator.hasNext()) { Tuple2Object, Object testTuple = testDataIterator.next(); Tuple2Tuple2lt;Object, Object, Object thetaTuple = thetaIterator.next(); score += ((double) testTuple2._2 * (double) thetaTuple2._2); } Tuple2Object, Object piTuple = piIterator.next(); score += (double) piTuple._2; scores[(int) piTuple._1] = score; if (maxScore score) { predictedCategory = (int) piTuple._1; maxScore = score; } } Where am I going wrong? Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p19968.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Exception while starting thrift server
Hi, When I'm starting thrift server, I'm getting the following exception. Could you any one help me on this. I placed hive-site.xml in $SPARK_HOME/conf folder and the property hive.metastore.sasl.enabled set to 'false'. org.apache.hive.service.ServiceException: Unable to login to kerberos with given principal/keytab at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIService.init(SparkSQLCLIService.scala:55) at org.apache.spark.sql.hive.thriftserver.ReflectedCompositeService$$anonfun$initCompositeService$1.apply(SparkSQLCLIService.scala:66) Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exception-while-starting-thrift-server-tp19970.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[graphx] failed to submit an application with java.lang.ClassNotFoundException
Hi, I just tried to submit an application from graphx examples directory, but it failed: yifan2:bin yifanli$ MASTER=local[*] ./run-example graphx.PPR_hubs java.lang.ClassNotFoundException: org.apache.spark.examples.graphx.PPR_hubs at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:249) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:318) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) and also, yifan2:bin yifanli$ ./spark-submit --class org.apache.spark.examples.graphx.PPR_hubs ../examples/target/scala-2.10/spark-examples-1.2.0-SNAPSHOT-hadoop1.0.4.jar java.lang.ClassNotFoundException: org.apache.spark.examples.graphx.PPR_hubs at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:249) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:318) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) anyone has some points on this? Best, Yifan LI
Best way to do a lookup in Spark
Hi, I'm looking to do an iterative algorithm implementation with data coming in from Cassandra. This might be a use case for GraphX, however the ids are non-integral, and I would like to avoid a mapping (for now). I'm doing a simple hubs and authorities HITS implementation, and the current implementation does a lot of db access. It's fine (one half of a full iteration is done in 25 minutes on 3M+ vertices), and use of Spark's cache() has achieved that. However, each full iteration is 50 minutes, and I would like to improve that. A high level overview of what I'm trying to do is: 1) Vertex structure (id, in, out, aScore, hScore). 2) Load all the vertices into memory (simple enough). 3) Have a lookup vertexid - (aScore, hScore) in memory (currently, this is where I need to do a lot of cassandra queries...which are very fast, but hoping to avoid). 4) Iterate n times in 2 statges: In the Hub Stage: a) Foreach vertex, get the sum of aScores for vertices it points to. Cache this. b) From the cache, get the max score. Divide each score in the cache by the max. c) Get rid of the cache. d) Update the lookup (in (3)) with the new hScores. In the Authority Stage: a) Foreach vertex, get the sum of hScores for vertices that point to it. Cache this. b) From the cache, get the max score. Divide each score in the cache by the max. c) Get rid of the cache. d) Update the lookup (in (3)) with the new aScores. 5) Update the final aScores and hScores from memory to Cassandra. The one bit that I don't have now is the in memory lookup (i.e. to get the hScores and aScores of neighbours in (4-a ). As such, I have to query cassandra for each vertex x times where x is the number of neighbours. And as those values are used in the next iteration, I also have to update cassandra for each run. Is it possibly to have this as an in memory distributed lookup so that I can deal with the data store at the start and end? One option is to identify clusters and run HITS for each cluster entirely in memory, however if there's a simpler way I'd prefer that. Regards, Ashic.
Re: Accessing posterior probability of Naive Baye's prediction
No, the feature vector is not converted. It contains count n_i of how often each term t_i occurs (or a TF-IDF transformation of those). You are finding the class c such that P(c) * P(t_1|c)^n_1 * ... is maximized. In log space it's log(P(c)) + n_1*log(P(t_1|c)) + ... So your n_1 counts (or TF-IDF values) are used as-is and this is where the dot product comes from. Your bug is probably something lower-level and simple. I'd debug the Spark example and print exactly its values for the log priors and conditional probabilities, and the matrix operations, and yours too, and see where the difference is. On Thu, Nov 27, 2014 at 11:37 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I have been running through some troubles while converting the code to Java. I have done the matrix operations as directed and tried to find the maximum score for each category. But the predicted category is mostly different from the prediction done by MLlib. I am fetching iterators of the pi, theta and testData to do my calculations. pi and theta are in log space while my testData vector is not, could that be a problem because I didn't see explicit conversion in Mllib also? For example, for two categories and 5 features, I am doing the following operation, [1,2] + [1 2 3 4 5 ] * [1,2,3,4,5] [6 7 8 9 10] These are simple element-wise matrix multiplication and addition operators. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Auto BroadcastJoin optimization failed in latest Spark
Hi Hao, I'm using inner join as Broadcast join didn't work for left joins (thanks for the links for the latest improvements). And I'm using HiveConext and it worked in a previous build (10/12) when joining 15 dimension tables. Jianshi On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao hao.ch...@intel.com wrote: Are all of your join keys the same? and I guess the join type are all “Left” join, https://github.com/apache/spark/pull/3362 probably is what you need. And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) currently, https://github.com/apache/spark/pull/3270 should be another optimization for this. *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Wednesday, November 26, 2014 4:36 PM *To:* user *Subject:* Auto BroadcastJoin optimization failed in latest Spark Hi, I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails optimizing auto broadcast join in my query. I have a query that joins a huge fact table with 15 tiny dimension tables. I'm currently using an older version of Spark which was built on Oct. 12. Anyone else has met similar situation? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Mesos killing Spark Driver
Hi, We are currently running our Spark + Spark Streaming jobs on Mesos, submitting our jobs through Marathon. We see with some regularity that the Spark Streaming driver gets killed by Mesos and then restarted on some other node by Marathon. I've no clue why Mesos is killing the driver and looking at both the Mesos and Spark logs didn't make me any wiser. On the Spark Streaming driver logs, I find this entry of Mesos signing off my driver: Shutting down Sending SIGTERM to process tree at pid 17845 Killing the following process trees: [ -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf \-+- 17846 sh ./run-mesos.sh application-ts.conf \--- 17847 java -cp core-compute-job.jar -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326 ] Command terminated with signal Terminated (pid: 17845) Have anybody seen something similar? Any hints on where to start digging? -kr, Gerard.
Percentile
Hi folks!, Anyone known how can I calculate for each elements of a variable in a RDD its percentile? I tried to calculate trough Spark SQL with subqueries but I think that is imposible in Spark SQL. Any idea will be welcome. Thanks in advance, Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com http://www.exalitica.com/ www.exalitica.com http://exalitica.com/web/img/frim.png
Using Breeze in the Scala Shell
Hi, I'm trying to use the breeze library in the spark scala shell, but I'm running into the same problem documented here: http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-td15748.html As I'm using the shell, I don't have a pom.xml, so the solution suggested in that thread doesn't work for me. I've tried the following: - adding commons-math3 using the --jars option - adding both breeze and commons-math3 using the --jar option - using the spark.executor.extraClassPath option on the cmd line as follows: --conf spark.executor.extraClassPath=commons-math3-3.2.jar None of these are working for me. Any thoughts on how I can get this working? Thanks, Dean. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using Breeze in the Scala Shell
I have used breeze fine with scala shell: scala -cp ./target/spark-mllib_2.10-1.3.0-SNAPSHOT. jar:/Users/v606014/.m2/repository/com/github/fommil/netlib/core/1.1.2/core-1.1.2.jar:/Users/v606014/.m2/repository/org/jblas/jblas/1.2.3/jblas-1.2.3.jar:/Users/v606014/.m2/repository/org/scalanlp/breeze_2.10/0.10/breeze_2.10-0.10.jar:/Users/v606014/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar:/Users/v606014/.m2/repository/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1.jar http://jar/Users/v606014/.m2/repository/com/github/fommil/netlib/core/1.1.2/core-1.1.2.jar:/Users/v606014/.m2/repository/org/jblas/jblas/1.2.3/jblas-1.2.3.jar:/Users/v606014/.m2/repository/org/scalanlp/breeze_2.10/0.10/breeze_2.10-0.10.jar:/Users/v606014/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar:/Users/v606014/.m2/repository/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1.jarorg.apache.spark.mllib.optimization.QuadraticMinimizer 100 1 1.0 0.99 For spark-shell my assumption is spark-shell -cp option should work fine On Thu, Nov 27, 2014 at 9:15 AM, Dean Jones dean.m.jo...@gmail.com wrote: Hi, I'm trying to use the breeze library in the spark scala shell, but I'm running into the same problem documented here: http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-td15748.html As I'm using the shell, I don't have a pom.xml, so the solution suggested in that thread doesn't work for me. I've tried the following: - adding commons-math3 using the --jars option - adding both breeze and commons-math3 using the --jar option - using the spark.executor.extraClassPath option on the cmd line as follows: --conf spark.executor.extraClassPath=commons-math3-3.2.jar None of these are working for me. Any thoughts on how I can get this working? Thanks, Dean. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
ALS failure with size Integer.MAX_VALUE
We're training a recommender with ALS in mllib 1.1 against a dataset of 150M users and 4.5K items, with the total number of training records being 1.2 Billion (~30GB data). The input data is spread across 1200 partitions on HDFS. For the training, rank=10, and we've configured {number of user data blocks = number of item data blocks}. The number of user/item blocks was varied between 50 to 1200. Irrespective of the block size (e.g. at 1200 blocks each), there are atleast a couple of tasks that end up shuffle reading 9.7G each in the aggregate stage (ALS.scala:337) and failing with the following exception: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108) at org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124) at org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204) As for the data, on the user side, the degree of a node in the connectivity graph is relatively small. However, on the item side, 3.8K out of the 4.5K items are connected to 10^5 users each on an average, with 100 items being connected to nearly 10^8 users. The rest of the items are connected to less than 10^5 users. With such a skew in the connectivity graph, I'm unsure if additional memory or variation in the block sizes would help (considering my limited understanding of the implementation in mllib). Any suggestion to address the problem? The test is being run on a standalone cluster of 3 hosts, each with 100G RAM 24 cores dedicated to the application. The additional configs I made specific to the shuffle and task failure reduction are as follows: spark.core.connection.ack.wait.timeout=600 spark.shuffle.consolidateFiles=true spark.shuffle.manager=SORT The job execution summary is as follows: Active Stages: Stage id 2, aggregate at ALS.scala:337, duration 55 min, Tasks 1197/1200 (3 failed), Shuffle Read : 141.6 GB Completed Stages (5) Stage IdDescriptionDuration Tasks: Succeeded/TotalInputShuffle ReadShuffle Write 6org.apache.spark.rdd.RDD.flatMap(RDD.scala:277) 12 min 1200/120029.9 GB1668.4 MB186.8 GB 5mapPartitionsWithIndex at ALS.scala:250 +details 3map at ALS.scala:231 0aggregate at ALS.scala:337 +details 1map at ALS.scala:228 +details Thanks, Bharath
Re: Lifecycle of RDD in spark-streaming
Gerard, That is a good observation. However, the strange thing I meet is if I use MEMORY_AND_DISK_SER, the job even fails earlier. In my case, it takes 10 seconds to process my data of every batch, which is one minute. It fails after 10 hours with the cannot compute split error. Bill On Thu, Nov 27, 2014 at 3:31 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi TD, We also struggled with this error for a long while. The recurring scenario is when the job takes longer to compute than the job interval and a backlog starts to pile up. Hint: Check If the DStream storage level is set to MEMORY_ONLY_SER and memory runs out, then you will get a 'Cannot compute split: Missing block ...'. What I don't know ATM is whether the new data is dropped or the LRU policy removes data in the system in favor for the incoming data. In any case, the DStream processing still thinks the data is there at the moment the job is scheduled to run and fails to run. In our case, changing storage to MEMORY_AND_DISK_SER solved the problem and our streaming job can get through tought times without issues. Regularly checking 'scheduling delay' and 'total delay' on the Streaming tab in the UI is a must. (And soon we will have that on the metrics report as well!! :-) ) -kr, Gerard. On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi TD, I am using Spark Streaming to consume data from Kafka and do some aggregation and ingest the results into RDS. I do use foreachRDD in the program. I am planning to use Spark streaming in our production pipeline and it performs well in generating the results. Unfortunately, we plan to have a production pipeline 24/7 and Spark streaming job usually fails after 8-20 hours due to the exception cannot compute split. In other cases, the Kafka receiver has failure and the program runs without producing any result. In my pipeline, the batch size is 1 minute and the data volume per minute from Kafka is 3G. I have been struggling with this issue for more than a month. It will be great if you can provide some solutions for this. Thanks! Bill On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Can you elaborate on the usage pattern that lead to cannot compute split ? Are you using the RDDs generated by DStream, outside the DStream logic? Something like running interactive Spark jobs (independent of the Spark Streaming ones) on RDDs generated by DStreams? If that is the case, what is happening is that Spark Streaming is not aware that some of the RDDs (and the raw input data that it will need) will be used by Spark jobs unrelated to Spark Streaming. Hence Spark Streaming will actively clear off the raw data, leading to failures in the unrelated Spark jobs using that data. In case this is your use case, the cleanest way to solve this, is by asking Spark Streaming remember stuff for longer, by using streamingContext.remember(duration). This will ensure that Spark Streaming will keep around all the stuff for at least that duration. Hope this helps. TD On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Just add one more point. If Spark streaming knows when the RDD will not be used any more, I believe Spark will not try to retrieve data it will not use any more. However, in practice, I often encounter the error of cannot compute split. Based on my understanding, this is because Spark cleared out data that will be used again. In my case, the data volume is much smaller (30M/s, the batch size is 60 seconds) than the memory (20G each executor). If Spark will only keep RDD that are in use, I expect that this error may not happen. Bill On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Let me further clarify Lalit's point on when RDDs generated by DStreams are destroyed, and hopefully that will answer your original questions. 1. How spark (streaming) guarantees that all the actions are taken on each input rdd/batch. This is isnt hard! By the time you call streamingContext.start(), you have already set up the output operations (foreachRDD, saveAs***Files, etc.) that you want to do with the DStream. There are RDD actions inside the DStream output oeprations that need to be done every batch interval. So all the systems does is this - after every batch interval, put all the output operations (that will call RDD actions) in a job queue, and then keep executing stuff in the queue. If there is any failure in running the jobs, the streaming context will stop. 2. How does spark determines that the life-cycle of a rdd is complete. Is there any chance that a RDD will be cleaned out of ram before all actions are taken on them? Spark Streaming knows when the all the processing related to batch T has been completed. And also it keeps track of how much time of the previous RDDs does
Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection
Yeah, only a few hours after I sent my message I saw some correspondence on this other thread: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-lt-string-map-lt-string-int-gt-gt-in-spark-sql-td19603.html, which is the exact same issue. Glad to find that this should be fixed in 1.2.0! I'll give that a try later. Thanks a lot, Jonathan From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com Date: Thursday, November 27, 2014 at 4:37 PM To: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection Hello Jonathan, There was a bug regarding casting data types before inserting into a Hive table. Hive does not have the notion of containsNull for array values. So, for a Hive table, the containsNull will be always true for an array and we should ignore this field for Hive. This issue has been fixed by https://issues.apache.org/jira/browse/SPARK-4245, which will be released with 1.2. Thanks, Yin On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan jonat...@amazon.commailto:jonat...@amazon.com wrote: After playing around with this a little more, I discovered that: 1. If test.json contains something like {values:[null,1,2,3]}, the schema auto-determined by SchemaRDD.jsonFile() will have element: integer (containsNull = true), and then SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course makes sense but doesn't really help). 2. If I specify the schema myself (e.g., sqlContext.jsonFile(test.json, StructType(Seq(StructField(values, ArrayType(IntegerType, true), true), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto() work, though as I mentioned before, this is less than ideal. Why don't saveAsTable/insertInto work when the containsNull properties don't match? I can understand how inserting data with containsNull=true into a column where containsNull=false might fail, but I think the other way around (which is the case here) should work. ~ Jonathan On 11/26/14, 5:23 PM, Kelly, Jonathan jonat...@amazon.commailto:jonat...@amazon.com wrote: I've noticed some strange behavior when I try to use SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file that contains elements with nested arrays. For example, with a file test.json that contains the single line: {values:[1,2,3]} and with code like the following: scala val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) scala val test = sqlContext.jsonFile(test.json) scala test.saveAsTable(test) it creates the table but fails when inserting the data into it. Here¹s the exception: scala.MatchError: ArrayType(IntegerType,true) (of class org.apache.spark.sql.catalyst.types.ArrayType) at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala: 2 47) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal a :84) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app l y(Projection.scala:66) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app l y(Projection.scala:50) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.orghttp://org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$s q l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc a la:149) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv e File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv e File$1.apply(InsertIntoHiveTable.scala:158) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: 1 145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java : 615) at java.lang.Thread.run(Thread.java:745) I'm guessing that this is due to the slight difference in the schemas of these tables: scala test.printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = false) scala sqlContext.table(test).printSchema root |-- values: array (nullable = true) ||-- element: integer (containsNull = true) If I reload the file using the schema that was created for the Hive table
Re: Lifecycle of RDD in spark-streaming
If it regularly fails after 8 hours then could you get me the log4j logs? To limit the size, set default log level to Warn and the level of logs for all classes in package o.a.s.streaming to Debug. Then I can take a look. On Nov 27, 2014 11:01 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Gerard, That is a good observation. However, the strange thing I meet is if I use MEMORY_AND_DISK_SER, the job even fails earlier. In my case, it takes 10 seconds to process my data of every batch, which is one minute. It fails after 10 hours with the cannot compute split error. Bill On Thu, Nov 27, 2014 at 3:31 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi TD, We also struggled with this error for a long while. The recurring scenario is when the job takes longer to compute than the job interval and a backlog starts to pile up. Hint: Check If the DStream storage level is set to MEMORY_ONLY_SER and memory runs out, then you will get a 'Cannot compute split: Missing block ...'. What I don't know ATM is whether the new data is dropped or the LRU policy removes data in the system in favor for the incoming data. In any case, the DStream processing still thinks the data is there at the moment the job is scheduled to run and fails to run. In our case, changing storage to MEMORY_AND_DISK_SER solved the problem and our streaming job can get through tought times without issues. Regularly checking 'scheduling delay' and 'total delay' on the Streaming tab in the UI is a must. (And soon we will have that on the metrics report as well!! :-) ) -kr, Gerard. On Thu, Nov 27, 2014 at 8:14 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi TD, I am using Spark Streaming to consume data from Kafka and do some aggregation and ingest the results into RDS. I do use foreachRDD in the program. I am planning to use Spark streaming in our production pipeline and it performs well in generating the results. Unfortunately, we plan to have a production pipeline 24/7 and Spark streaming job usually fails after 8-20 hours due to the exception cannot compute split. In other cases, the Kafka receiver has failure and the program runs without producing any result. In my pipeline, the batch size is 1 minute and the data volume per minute from Kafka is 3G. I have been struggling with this issue for more than a month. It will be great if you can provide some solutions for this. Thanks! Bill On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Can you elaborate on the usage pattern that lead to cannot compute split ? Are you using the RDDs generated by DStream, outside the DStream logic? Something like running interactive Spark jobs (independent of the Spark Streaming ones) on RDDs generated by DStreams? If that is the case, what is happening is that Spark Streaming is not aware that some of the RDDs (and the raw input data that it will need) will be used by Spark jobs unrelated to Spark Streaming. Hence Spark Streaming will actively clear off the raw data, leading to failures in the unrelated Spark jobs using that data. In case this is your use case, the cleanest way to solve this, is by asking Spark Streaming remember stuff for longer, by using streamingContext.remember(duration). This will ensure that Spark Streaming will keep around all the stuff for at least that duration. Hope this helps. TD On Wed, Nov 26, 2014 at 5:07 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Just add one more point. If Spark streaming knows when the RDD will not be used any more, I believe Spark will not try to retrieve data it will not use any more. However, in practice, I often encounter the error of cannot compute split. Based on my understanding, this is because Spark cleared out data that will be used again. In my case, the data volume is much smaller (30M/s, the batch size is 60 seconds) than the memory (20G each executor). If Spark will only keep RDD that are in use, I expect that this error may not happen. Bill On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Let me further clarify Lalit's point on when RDDs generated by DStreams are destroyed, and hopefully that will answer your original questions. 1. How spark (streaming) guarantees that all the actions are taken on each input rdd/batch. This is isnt hard! By the time you call streamingContext.start(), you have already set up the output operations (foreachRDD, saveAs***Files, etc.) that you want to do with the DStream. There are RDD actions inside the DStream output oeprations that need to be done every batch interval. So all the systems does is this - after every batch interval, put all the output operations (that will call RDD actions) in a job queue, and then keep executing stuff in the queue. If there is any failure in running the jobs, the streaming context will stop. 2. How does spark determines
Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank
Thanks Ankur, Its really help full. I've few queries on optimization techniques. for the current I used RandomVertexCut partition. But what partition should be used if have: 1. No. of edges in edgeList file are to large like 50,000,000; where multiple edges to same pair of vertices are many 2. No of unique Vertex are to large suppose 10,000,000 in above edgeList file 3. No of unique Vertex are small suppose less than 100,000 in above edgeList file On 27 November 2014 at 20:23, ankurdave [via Apache Spark User List] ml-node+s1001560n1995...@n3.nabble.com wrote: At 2014-11-24 19:02:08 -0800, Harihar Nahak [hidden email] http://user/SendEmail.jtp?type=nodenode=19956i=0 wrote: According to documentation GraphX runs 10x faster than normal Spark. So I run Page Rank algorithm in both the applications: [...] Local Mode (Machine : 8 Core; 16 GB memory; 2.80 Ghz Intel i7; Executor Memory: 4Gb, No. of Partition: 50; No. of Iterations: 2); == *Spark Page Rank took - 21.29 mins GraphX Page Rank took - 42.01 mins * Cluster Mode (ubantu 12.4; spark 1.1/hadoop 2.4 cluster ; 3 workers , 1 driver , 8 cores, 30 gb memory) (Executor memory 4gb; No. of edge partitions : 50, random vertex cut ; no. of iteration : 2) = *Spark Page Rank took - 10.54 mins GraphX Page Rank took - 7.54 mins * Could you please help me to determine, when to use Spark and GraphX ? If GraphX took same amount of time than Spark then its better to use Spark because spark has variey of operators to deal with any type of RDD. If you have a problem that's naturally expressible as a graph computation, it makes sense to use GraphX in my opinion. In addition to the optimizations that GraphX incorporates which you would otherwise have to implement manually, GraphX's programming model is likely a better fit. But even if you start off by using pure Spark, you'll still have the flexibility to use GraphX for other parts of the problem since it's part of the same system. To address the benchmark results you got: 1. GraphX takes more time than Spark to load the graph, because it has to index it, but subsequent iterations should be faster. We benchmarked with 20 iterations to show this effect, but you only used 2 iterations, which doesn't give much time to amortize the loading cost. 2. The benchmarks in the GraphX OSDI paper are against a naive implementation of PageRank in Spark, while the version you benchmarked against has some of the same optimizations as GraphX does. I believe we found that the optimized Spark PageRank was only 3x slower than GraphX. 3. When running those benchmarks, we used an experimental version of Spark with in-memory shuffle, which disproportionately benefits GraphX since its shuffle files are smaller due to specialized compression. 4. We haven't optimized GraphX for local mode, so it's not surprising that it's slower there. Ankur - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19956i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19956i=2 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19956.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Is Spark? or GraphX runs fast? a performance comparison on Page Rank, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=19710code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MTk3MTB8LTE4MTkxOTE5Mjk= . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-or-GraphX-runs-fast-a-performance-comparison-on-Page-Rank-tp19710p19986.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Lifecycle of RDD in spark-streaming
When there is new data comes in a stream spark use streams classes to convert it into RDD and as you mention its follow with transformation and finally action. Till the time user doesn't destroy or application is alive All RDD remain in Memory as far as I experienced. On 26 November 2014 at 20:05, Mukesh Jha [via Apache Spark User List] ml-node+s1001560n19835...@n3.nabble.com wrote: Any pointers guys? On Tue, Nov 25, 2014 at 5:32 PM, Mukesh Jha [hidden email] http://user/SendEmail.jtp?type=nodenode=19835i=0 wrote: Hey Experts, I wanted to understand in detail about the lifecycle of rdd(s) in a streaming app. From my current understanding - rdd gets created out of the realtime input stream. - Transform(s) functions are applied in a lazy fashion on the RDD to transform into another rdd(s). - Actions are taken on the final transformed rdds to get the data out of the system. Also rdd(s) are stored in the clusters RAM (disc if configured so) and are cleaned in LRU fashion. So I have the following questions on the same. - How spark (streaming) guarantees that all the actions are taken on each input rdd/batch. - How does spark determines that the life-cycle of a rdd is complete. Is there any chance that a RDD will be cleaned out of ram before all actions are taken on them? Thanks in advance for all your help. Also, I'm relatively new to scala spark so pardon me in case these are naive questions/assumptions. -- Thanks Regards, *[hidden email] http://user/SendEmail.jtp?type=nodenode=19835i=1* -- Thanks Regards, *[hidden email] http://user/SendEmail.jtp?type=nodenode=19835i=2* -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Lifecycle-of-RDD-in-spark-streaming-tp19749p19835.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=aG5haGFrQHd5bnlhcmRncm91cC5jb218MXwtMTgxOTE5MTkyOQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Regards, Harihar Nahak BigData Developer Wynyard Email:hna...@wynyardgroup.com | Extn: 8019 - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Lifecycle-of-RDD-in-spark-streaming-tp19749p19987.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
SVD Plus Plus in GraphX
Hi, I was just going through the two codes in GraphX namely SVDPlusPlus and TriangleCount. In the first I see an RDD as an input to run ie, run(edges: RDD[Edge[Double]],...) and in the other I see run(VD:..., ED:...) Can anyone explain me the difference between these two? Infact SVDPlusPlus is the only GraphX code in Spark-1.0.0 that I have seen RDD as an input. Could anyone please explain to me? Thank you
RE: Auto BroadcastJoin optimization failed in latest Spark
Hi Jianshi, I couldn’t reproduce that with latest MASTER, and I can always get the BroadcastHashJoin for managed tables (in .csv file) in my testing, are there any external tables in your case? In general probably couple of things you can try first (with HiveContext): 1) ANALYZE TABLE xxx COMPUTE STATISTICS NOSCAN; (apply that to all of the tables); 2) SET spark.sql.autoBroadcastJoinThreshold=xxx; (Set the threshold as a greater value, it is 1024*1024*10 by default, just make sure the maximum dimension tables size (in bytes) is less than this) 3) Always put the main table(the biggest table) in the left-most among the inner joins; DESC EXTENDED tablename; -- this will print the detail information for the statistic table size (the field “totalSize”) EXPLAIN EXTENDED query; -- this will print the detail physical plan. Let me know if you still have problem. Hao From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Thursday, November 27, 2014 10:24 PM To: Cheng, Hao Cc: user Subject: Re: Auto BroadcastJoin optimization failed in latest Spark Hi Hao, I'm using inner join as Broadcast join didn't work for left joins (thanks for the links for the latest improvements). And I'm using HiveConext and it worked in a previous build (10/12) when joining 15 dimension tables. Jianshi On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao hao.ch...@intel.commailto:hao.ch...@intel.com wrote: Are all of your join keys the same? and I guess the join type are all “Left” join, https://github.com/apache/spark/pull/3362 probably is what you need. And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) currently, https://github.com/apache/spark/pull/3270 should be another optimization for this. From: Jianshi Huang [mailto:jianshi.hu...@gmail.commailto:jianshi.hu...@gmail.com] Sent: Wednesday, November 26, 2014 4:36 PM To: user Subject: Auto BroadcastJoin optimization failed in latest Spark Hi, I've confirmed that the latest Spark with either Hive 0.12 or 0.13.1 fails optimizing auto broadcast join in my query. I have a query that joins a huge fact table with 15 tiny dimension tables. I'm currently using an older version of Spark which was built on Oct. 12. Anyone else has met similar situation? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Creating a SchemaRDD from an existing API
Hi, I am evaluating Spark for an analytic component where we do batch processing of data using SQL. So, I am particularly interested in Spark SQL and in creating a SchemaRDD from an existing API [1]. This API exposes elements in a database as datasources. Using the methods allowed by this data source, we can access and edit data. So, I want to create a custom SchemaRDD using the methods and provisions of this API. I tried going through Spark documentation and the Java Docs, but unfortunately, I was unable to come to a final conclusion if this was actually possible. I would like to ask the Spark Devs, 1. As of the current Spark release, can we make a custom SchemaRDD? 2. What is the extension point to a custom SchemaRDD? or are there particular interfaces? 3. Could you please point me the specific docs regarding this matter? Your help in this regard is highly appreciated. Cheers [1] https://github.com/wso2-dev/carbon-analytics/tree/master/components/xanalytics -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44
Re: read both local path and HDFS path
Hi, The configuration you provide is just to access the HDFS when you give an HDFS path. When you provide a HDFS path with the HDFS nameservice, like in your case hmaster155:9000 it goes inside the HDFS to look for the file. For accessing local file just give the local path of the file. Go to the file in the local and do a pwd. This will give you the full path of the file. Just give that path as your local path for the file and you will do good. Thanks. On Fri, Nov 28, 2014 at 8:57 AM, tuyuri [via Apache Spark User List] ml-node+s1001560n19990...@n3.nabble.com wrote: I have setup a Spark cluster config with HDFS and I know that default file path will be read by Spark all in HDFS example : /ad-cpc/2014-11-28/ Spark will read in : hdfs://hmaster155:9000/ad-cpc/2014-11-28/ sometimes I wonder how can i force Spark read a file in local without reConfig my cluster ( to not use hdfs). please help me !!! -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/read-both-local-path-and-HDFS-path-tp19990.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/read-both-local-path-and-HDFS-path-tp19990p19995.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Unable to compile spark 1.1.0 on windows 8.1
Hi, I am trying to compile spark 1.1.0 on windows 8.1 but I get the following exception. [info] Compiling 3 Scala sources to D:\myworkplace\software\spark-1.1.0\project\target\scala-2.10\sbt0.13\classes... [error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:26: object sbt is not a member of package com.typesafe [error] import com.typesafe.sbt.pom.{PomBuild, SbtPomKeys} [error] ^ [error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:53: not found: type PomBuild [error] object SparkBuild extends PomBuild { [error] ^ [error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:121: not found: value SbtPomKeys [error] otherResolvers = SbtPomKeys.mvnLocalRepository(dotM2 = Seq(Resolver.file(dotM2, dotM2))), [error]^ [error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:165: value projectDefinitions is not a member of AnyRef [error] super.projectDefinitions(baseDirectory).map { x = [error] ^ [error] four errors found [error] (plugins/compile:compile) Compilation failed I have also setup scala 2.10. Need help to resolve this issue. Regards, Ishwardeep -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-compile-spark-1-1-0-on-windows-8-1-tp19996.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org