Re: Running Spark on Kubernetes (GKE) - failing on spark-submit
The configuration of ‘…file.upload.path’ is wrong. it means a distributed fs path to store your archives/resource/jars temporarily, then distributed by spark to drivers/executors. For your cases, you don’t need to set this configuration.Sent from my iPhoneOn Feb 14, 2023, at 5:43 AM, karan alang wrote:Hello All,I'm trying to run a simple application on GKE (Kubernetes), and it is failing:Note : I have spark(bitnami spark chart) installed on GKE using helm install Here is what is done :1. created a docker image using DockerfileDockerfile :```FROM python:3.7-slimRUN apt-get update && \apt-get install -y default-jre && \apt-get install -y openjdk-11-jre-headless && \apt-get cleanENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64RUN pip install pysparkRUN mkdir -p /myexample && chmod 755 /myexampleWORKDIR /myexampleCOPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.pyCMD ["pyspark"]```Simple pyspark application :```from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate()data = "" style="color:rgb(0,128,0);font-weight:bold">'k1', 123000), ('k2', 234000), ('k3', 456000)]df = spark.createDataFrame(data, ('id', 'salary'))df.show(5, False)```Spark-submit command :``` spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode cluster --name pyspark-example --conf spark.kubernetes.container.image=pyspark-example:0.1 --conf spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py```Error i get :``` 23/02/13 13:18:27 INFO KubernetesUtils: Uploading file: /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py to dest: /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py... Exception in thread "main" org.apache.spark.SparkException: Uploading file /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py failed... at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:296) at org.apache.spark.deploy.k8s.KubernetesUtils$.renameMainAppResource(KubernetesUtils.scala:270) at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configureForPython(DriverCommandFeatureStep.scala:109) at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configurePod(DriverCommandFeatureStep.scala:44) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:59) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:89) at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58) at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.SparkException: Error uploading file StructuredStream-on-gke.py at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:319) at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:292) ... 21 more Caused by: java.io.IOException: Mkdirs failed to create /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:317) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2369) at org.apache.hadoop.fs.FilterFileSystem.copyFromLocalFile(FilterFileSystem.java:368) at
Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS
Hi AlexG: Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 = 11.4TB. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote: > I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster > with 16.73 Tb storage, using > distcp. The dataset is a collection of tar files of about 1.7 Tb each. > Nothing else was stored in the HDFS, but after completing the download, the > namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I see > that the dataset only takes up 3.8 Tb as expected. I navigated through the > entire HDFS hierarchy from /, and don't see where the missing space is. Any > ideas what is going on and how to rectify it? > > I'm using the spark-ec2 script to launch, with the command > > spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge > --placement-group=pcavariants --copy-aws-credentials > --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch > conversioncluster > > and am not modifying any configuration files for Hadoop. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html > Sent from the Apache Spark User List mailing list archive at Nabble.com > (http://Nabble.com). > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > (mailto:user-unsubscr...@spark.apache.org) > For additional commands, e-mail: user-h...@spark.apache.org > (mailto:user-h...@spark.apache.org) > >
Re: An interesting and serious problem I encountered
Hi, I believe SizeOf.jar may calculate the wrong size for you. Spark has a util call SizeEstimator located in org.apache.spark.util.SizeEstimator. And some one extracted it out in https://github.com/phatak-dev/java-sizeof/blob/master/src/main/scala/com/madhukaraphatak/sizeof/SizeEstimator.scala You can try that out in the scala repl. The size for Array[Int](43) is 192bytes (12 bytes object size + 4 bytes length variable + (43 * 4 round to 176 bytes)) And the size for (1, Array[Int](43)) is 240 bytes { Tuple2 Object: 12 bytes object size + 4 bytes filed _1 + 4 byes field _2 = round to 24 bytes 1 = java.lang.Number 12 bytes = round to 16 bytes - java.lang.Integer: 16 bytes + 4 bytes int = round to 24 bytes ( Integer extends Number. I thought Scala Tuple2 will specialized Int and this should be 4, but it seems not) Array = 192 bytes } So, 24 + 24 + 192 = 240 bytes. This is my calculation based on the spark SizeEstimator. However I am not sure what an Integer will occupy for 64 bits JVM with compressedOps on. It should be 12 + 4 = 16 bytes, then that means the SizeEstimator gives the wrong result. @Sean what do you think? -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Friday, February 13, 2015 at 2:26 PM, Landmark wrote: Hi foks, My Spark cluster has 8 machines, each of which has 377GB physical memory, and thus the total maximum memory can be used for Spark is more than 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs, where the key is an integer and the value is an integer array with 43 elements. Therefore, the memory cost of this raw dataset is [(1+43) * 10 * 4] / (1024 * 1024 * 1024) = 164GB. Since I have to use this dataset repeatedly, I have to cache it in memory. Some key parameter settings are: spark.storage.fraction=0.6 spark.driver.memory=30GB spark.executor.memory=310GB. But it failed on running a simple countByKey() and the error message is java.lang.OutOfMemoryError: Java heap space Does this mean a Spark cluster of 2400+GB memory cannot keep 164GB raw data in memory? The codes of my program is as follows: def main(args: Array[String]):Unit = { val sc = new SparkContext(new SparkConfig()); val rdd = sc.parallelize(0 until 10, 25600).map(i = (i, new Array[Int](43))).cache(); println(The number of keys is + rdd.countByKey()); //some other operations following here ... } To figure out the issue, I evaluated the memory cost of key-value pairs and computed their memory cost using SizeOf.jar. The codes are as follows: val arr = new Array[Int](43); println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr))); val tuple = (1, arr.clone); println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple))); The output is: 192.0b 992.0b *Hard to believe, but it is true!! This result means, to store a key-value pair, Tuple2 needs more than 5+ times memory than the simplest method with array. Even though it may take 5+ times memory, its size is less than 1000GB, which is still much less than the total memory size of my cluster, i.e., 2400+GB. I really do not understand why this happened.* BTW, if the number of pairs is 1 million, it works well. If the arr contains only 1 integer, to store a pair, Tuples needs around 10 times memory. So I have some questions: 1. Why does Spark choose such a poor data structure, Tuple2, for key-value pairs? Is there any better data structure for storing (key, value) pairs with less memory cost ? 2. Given a dataset with size of M, in general Spark how many times of memory to handle it? Best, Landmark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html Sent from the Apache Spark User List mailing list archive at Nabble.com (http://Nabble.com). - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org)
Re: Can't run Spark java code from command line
There is no binding issue here. Spark picks the right ip 10.211.55.3 for you. The printed message is just an indication. However I have no idea why spark-shell hangs or stops. 发自我的 iPhone 在 2015年1月14日,上午5:10,Akhil Das ak...@sigmoidanalytics.com 写道: It just a binding issue with the hostnames in your /etc/hosts file. You can set SPARK_LOCAL_IP and SPARK_MASTER_IP in your conf/spark-env.sh file and restart your cluster. (in that case the spark://myworkstation:7077 will change to the ip address that you provided eg: spark://10.211.55.3). Thanks Best Regards On Tue, Jan 13, 2015 at 11:15 PM, jeremy p athomewithagroove...@gmail.com wrote: Hello all, I wrote some Java code that uses Spark, but for some reason I can't run it from the command line. I am running Spark on a single node (my workstation). The program stops running after this line is executed : SparkContext sparkContext = new SparkContext(spark://myworkstation:7077, sparkbase); When that line is executed, this is printed to the screen : 15/01/12 15:56:19 WARN util.Utils: Your hostname, myworkstation resolves to a loopback address: 127.0.1.1; using 10.211.55.3 instead (on interface eth0) 15/01/12 15:56:19 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/01/12 15:56:19 INFO spark.SecurityManager: Changing view acls to: myusername 15/01/12 15:56:19 INFO spark.SecurityManager: Changing modify acls to: myusername 15/01/12 15:56:19 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(myusername); users with modify permissions: Set(myusername) After it writes this to the screen, the program stops executing without reporting an exception. What's odd is that when I run this code from Eclipse, the same lines are printed to the screen, but the program keeps executing. Don't know if it matters, but I'm using the maven assembly plugin, which includes the dependencies in the JAR. Here are the versions I'm using : Cloudera : 2.5.0-cdh5.2.1 Hadoop : 2.5.0-cdh5.2.1 HBase : HBase 0.98.6-cdh5.2.1 Java : 1.7.0_65 Ubuntu : 14.04.1 LTS Spark : 1.2
Re: Is it safe to use Scala 2.11 for Spark build?
$$withChannelRetries$1(Locks.scala:78) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97) at xsbt.boot.Using$.withResource(Using.scala:10) at xsbt.boot.Using$.apply(Using.scala:9) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48) commit c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:64) at sbt.IvySbt.withIvy(Ivy.scala:119) at sbt.IvySbt.withIvy(Ivy.scala:116) at sbt.IvySbt$Module.withModule(Ivy.scala:147) at sbt.IvyActions$.updateEither(IvyActions.scala:156) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1282) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1279) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$84.apply(Defaults.scala:1309) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$84.apply(Defaults.scala:1307) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1312) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1306) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1324) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1264) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1242) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40) at sbt.std.Transform$$anon$4.work(System.scala:63) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17) at sbt.Execute.work(Execute.scala:235) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226) at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159) at sbt.CompletionService$$anon$2.call(CompletionService.scala:28) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) [error] (streaming-kafka/*:update) sbt.ResolveException: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.0: not found [error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: org.scalamacros#quasiquotes_2.11;2.0.1: not found -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Tuesday, November 18, 2014 at 3:27 PM, Prashant Sharma wrote: It is safe in the sense we would help you with the fix if you run into issues. I have used it, but since I worked on the patch the opinion can be biased. I am using scala 2.11 for day to day development. You should checkout the build instructions here : https://github.com/ScrapCodes/spark-1/blob/patch-3/docs/building-spark.md Prashant Sharma On Tue, Nov 18, 2014 at 12:19 PM, Jianshi Huang jianshi.hu...@gmail.com (mailto:jianshi.hu...@gmail.com) wrote: Any notable issues for using Scala 2.11? Is it stable now? Or can I use Scala 2.11 in my spark application and use Spark dist build with 2.10 ? I'm looking forward to migrate to 2.11 for some quasiquote features. Couldn't make it run in 2.10... Cheers,-- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: groupBy gives non deterministic results
Great. And you should ask question in user@spark.apache.org mail list. I believe many people don't subscribe the incubator mail list now. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, September 10, 2014 at 6:03 PM, redocpot wrote: Hi, I am using spark 1.0.0. The bug is fixed by 1.0.1. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html Sent from the Apache Spark User List mailing list archive at Nabble.com (http://Nabble.com). - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org)
Re: groupBy gives non deterministic results
| Do the two mailing lists share messages ? I don't think so. I didn't receive this message from the user list. I am not in databricks, so I can't answer your other questions. Maybe Davies Liu dav...@databricks.com can answer you? -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote: Hi, Xianjin I checked user@spark.apache.org (mailto:user@spark.apache.org), and found my post there: http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser I am using nabble to send this mail, which indicates that the mail will be sent from my email address to the u...@spark.incubator.apache.org (mailto:u...@spark.incubator.apache.org) mailing list. Do the two mailing lists share messages ? Do we have a nabble interface for user@spark.apache.org (mailto:user@spark.apache.org) mail list ? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html Sent from the Apache Spark User List mailing list archive at Nabble.com (http://Nabble.com). - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org)
Re: groupBy gives non deterministic results
Well, That's weird. I don't see this thread in my mail box as sending to user list. Maybe because I also subscribe the incubator mail list? I do see mails sending to incubator mail list and no one replies. I thought it was because people don't subscribe the incubator now. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Thursday, September 11, 2014 at 12:12 AM, Davies Liu wrote: I think the mails to spark.incubator.apache.org (http://spark.incubator.apache.org) will be forwarded to spark.apache.org (http://spark.apache.org). Here is the header of the first mail: from: redocpot julien19890...@gmail.com (mailto:julien19890...@gmail.com) to: u...@spark.incubator.apache.org (mailto:u...@spark.incubator.apache.org) date: Mon, Sep 8, 2014 at 7:29 AM subject: groupBy gives non deterministic results mailing list: user.spark.apache.org (http://user.spark.apache.org) Filter messages from this mailing list mailed-by: spark.apache.org (http://spark.apache.org) I only subscribe spark.apache.org (http://spark.apache.org), and I do see all the mails from he. On Wed, Sep 10, 2014 at 6:29 AM, Ye Xianjin advance...@gmail.com (mailto:advance...@gmail.com) wrote: | Do the two mailing lists share messages ? I don't think so. I didn't receive this message from the user list. I am not in databricks, so I can't answer your other questions. Maybe Davies Liu dav...@databricks.com (mailto:dav...@databricks.com) can answer you? -- Ye Xianjin Sent with Sparrow On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote: Hi, Xianjin I checked user@spark.apache.org (mailto:user@spark.apache.org), and found my post there: http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser I am using nabble to send this mail, which indicates that the mail will be sent from my email address to the u...@spark.incubator.apache.org (mailto:u...@spark.incubator.apache.org) mailing list. Do the two mailing lists share messages ? Do we have a nabble interface for user@spark.apache.org (mailto:user@spark.apache.org) mail list ? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html Sent from the Apache Spark User List mailing list archive at Nabble.com (http://Nabble.com). - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org)
Re: groupBy gives non deterministic results
Can you provide small sample or test data that reproduce this problem? and what's your env setup? single node or cluster? Sent from my iPhone On 2014年9月8日, at 22:29, redocpot julien19890...@gmail.com wrote: Hi, I have a key-value RDD called rdd below. After a groupBy, I tried to count rows. But the result is not unique, somehow non deterministic. Here is the test code: val step1 = ligneReceipt_cleTable.persist val step2 = step1.groupByKey val s1size = step1.count val s2size = step2.count val t = step2 // rdd after groupBy val t1 = t.count val t2 = t.count val t3 = t.count val t4 = t.count val t5 = t.count val t6 = t.count val t7 = t.count val t8 = t.count println(s1size = + s1size) println(s2size = + s2size) println(1 = + t1) println(2 = + t2) println(3 = + t3) println(4 = + t4) println(5 = + t5) println(6 = + t6) println(7 = + t7) println(8 = + t8) Here are the results: s1size = 5338864 s2size = 5268001 1 = 5268002 2 = 5268001 3 = 5268001 4 = 5268002 5 = 5268001 6 = 5268002 7 = 5268002 8 = 5268001 Even if the difference is just one row, that's annoying. Any idea ? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: distcp on ec2 standalone spark cluster
what did you see in the log? was there anything related to mapreduce? can you log into your hdfs (data) node, use jps to list all java process and confirm whether there is a tasktracker process (or nodemanager) running with datanode process -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote: Still no luck, even when running stop-all.sh (http://stop-all.sh) followed by start-all.sh (http://start-all.sh). On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas nicholas.cham...@gmail.com (mailto:nicholas.cham...@gmail.com) wrote: Tomer, Did you try start-all.sh (http://start-all.sh)? It worked for me the last time I tried using distcp, and it worked for this guy too. Nick On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com (mailto:tomer@gmail.com) wrote: ~/ephemeral-hdfs/sbin/start-mapred.sh (http://start-mapred.sh) does not exist on spark-1.0.2; I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh (http://stop-dfs.sh) and ~/ephemeral-hdfs/sbin/start-dfs.sh (http://start-dfs.sh), but still getting the same error when trying to run distcp: ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name (http://mapreduce.framework.name) and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76) at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146) at org.apache.hadoop.tools.DistCp.run(DistCp.java:118) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) Any idea? Thanks! Tomer On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com (mailto:rosenvi...@gmail.com) wrote: If I recall, you should be able to start Hadoop MapReduce using ~/ephemeral-hdfs/sbin/start-mapred.sh (http://start-mapred.sh). On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com (mailto:tomer@gmail.com) wrote: Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess mapred is not running on the cluster - I'm getting the exception below. Is there a way to activate it, or is there a spark alternative to distcp? Thanks, Tomer mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid mapreduce.jobtracker.address configuration value for LocalJobRunner : XXX:9001 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name (http://mapreduce.framework.name) and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76) at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146) at org.apache.hadoop.tools.DistCp.run(DistCp.java:118) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org)
Re: distcp on ec2 standalone spark cluster
well, this means you didn't start a compute cluster. Most likely because the wrong value of mapreduce.jobtracker.address cause the slave node cannot start the node manager. ( I am not familiar with the ec2 script, so I don't know whether the slave node has node manager installed or not.) Can you check the slave node the hadoop daemon log to see whether you started the nodemanager but failed or there is no nodemanager to start? The log file location defaults to /var/log/hadoop-xxx if my memory is correct. Sent from my iPhone On 2014年9月9日, at 0:08, Tomer Benyamini tomer@gmail.com wrote: No tasktracker or nodemanager. This is what I see: On the master: org.apache.hadoop.yarn.server.resourcemanager.ResourceManager org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode org.apache.hadoop.hdfs.server.namenode.NameNode On the data node (slave): org.apache.hadoop.hdfs.server.datanode.DataNode On Mon, Sep 8, 2014 at 6:39 PM, Ye Xianjin advance...@gmail.com wrote: what did you see in the log? was there anything related to mapreduce? can you log into your hdfs (data) node, use jps to list all java process and confirm whether there is a tasktracker process (or nodemanager) running with datanode process -- Ye Xianjin Sent with Sparrow On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote: Still no luck, even when running stop-all.sh followed by start-all.sh. On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Tomer, Did you try start-all.sh? It worked for me the last time I tried using distcp, and it worked for this guy too. Nick On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote: ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2; I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error when trying to run distcp: ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76) at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146) at org.apache.hadoop.tools.DistCp.run(DistCp.java:118) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) Any idea? Thanks! Tomer On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote: If I recall, you should be able to start Hadoop MapReduce using ~/ephemeral-hdfs/sbin/start-mapred.sh. On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote: Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess mapred is not running on the cluster - I'm getting the exception below. Is there a way to activate it, or is there a spark alternative to distcp? Thanks, Tomer mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid mapreduce.jobtracker.address configuration value for LocalJobRunner : XXX:9001 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76) at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146) at org.apache.hadoop.tools.DistCp.run(DistCp.java:118) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail
Re: Too many open files
Ops,the last reply didn't go to the user list. Mail app's fault. Shuffling happens in the cluster, so you need change all the nodes in the cluster. Sent from my iPhone On 2014年8月30日, at 3:10, Sudha Krishna skrishna...@gmail.com wrote: Hi, Thanks for your response. Do you know if I need to change this limit on all the cluster nodes or just the master? Thanks On Aug 29, 2014 11:43 AM, Ye Xianjin advance...@gmail.com wrote: 1024 for the number of file limit is most likely too small for Linux Machines on production. Try to set to 65536 or unlimited if you can. The too many open files error occurs because there are a lot of shuffle files(if wrong, please correct me): Sent from my iPhone On 2014年8月30日, at 2:06, SK skrishna...@gmail.com wrote: Hi, I am having the same problem reported by Michael. I am trying to open 30 files. ulimit -n shows the limit is 1024. So I am not sure why the program is failing with Too many open files error. The total size of all the 30 files is 230 GB. I am running the job on a cluster with 10 nodes, each having 16 GB. The error appears to be happening at the distinct() stage. Here is my program. In the following code, are all the 10 nodes trying to open all of the 30 files or are the files distributed among the 30 nodes? val baseFile = /mapr/mapr_dir/files_2013apr* valx = sc.textFile(baseFile)).map { line = val fields = line.split(\t) (fields(11), fields(6)) }.distinct().countByKey() val xrdd = sc.parallelize(x.toSeq) xrdd.saveAsTextFile(...) Instead of using the glob *, I guess I can try using a for loop to read the files one by one if that helps, but not sure if there is a more efficient solution. The following is the error transcript: Job aborted due to stage failure: Task 1.0:201 failed 4 times, most recent failure: Exception failure in TID 902 on host 192.168.13.11: java.io.FileNotFoundException: /tmp/spark-local-20140829131200-0bb7/08/shuffle_0_201_999 (Too many open files) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177) org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161) org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158) scala.collection.Iterator$class.foreach(Iterator.scala:727) org.apache.spark.util.collection.AppendOnlyMap$$anon$1.foreach(AppendOnlyMap.scala:159) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-tp1464p13144.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: defaultMinPartitions in textFile
well, I think you miss this line of code in SparkContext.scala line 1242-1243(master): /** Default min number of partitions for Hadoop RDDs when not given by user */ def defaultMinPartitions: Int = math.min(defaultParallelism, 2) so the defaultMinPartitions will be 2 unless the defaultParallelism is less than 2... -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Tuesday, July 22, 2014 at 10:18 AM, Wang, Jensen wrote: Hi, I started to use spark on yarn recently and found a problem while tuning my program. When SparkContext is initialized as sc and ready to read text file from hdfs, the textFile(path, defaultMinPartitions) method is called. I traced down the second parameter in the spark source code and finally found this: conf.getInt(spark.default.parallelism, math.max(totalCoreCount.get(), 2)) in CoarseGrainedSchedulerBackend.scala I do not specify the property “spark.default.parallelism” anywhere so the getInt will return value from the larger one between totalCoreCount and 2. When I submit the application using spark-submit and specify the parameter: --num-executors 2 --executor-cores 6, I suppose the totalCoreCount will be 2*6 = 12, so defaultMinPartitions will be 12. But when I print the value of defaultMinPartitions in my program, I still get 2 in return, How does this happen, or where do I make a mistake?
Re: Where to set proxy in order to run ./install-dev.sh for SparkR
You can try setting your HTTP_PROXY environment variable. export HTTP_PROXY=host:port But I don't use maven. If the env variable doesn't work, please search google for maven proxy. I am sure there will be a lot of related results. Sent from my iPhone On 2014年7月2日, at 19:04, Stuti Awasthi stutiawas...@hcl.com wrote: Hi, I wanted to build SparkR from source but running the script behind the proxy. Where shall I set proxy host and port in order to build the source. Issue is not able to download dependencies from Maven Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
Re: Set comparison
If you want string with quotes, you have to escape it with '\'. It's exactly what you did in the modified version. Sent from my iPhone On 2014年6月17日, at 5:43, SK skrishna...@gmail.com wrote: In Line 1, I have expected_res as a set of strings with quotes. So I thought it would include the quotes during comparison. Anyway I modified expected_res = Set(\ID1\, \ID2\, \ID3\) and that seems to work. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Set-comparison-tp7696p7699.html Sent from the Apache Spark User List mailing list archive at Nabble.com.