Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-14 Thread Ye Xianjin
The configuration of ‘…file.upload.path’ is wrong. it means a distributed fs path to store your archives/resource/jars temporarily, then distributed by spark to drivers/executors. For your cases, you don’t need to set this configuration.Sent from my iPhoneOn Feb 14, 2023, at 5:43 AM, karan alang  wrote:Hello All,I'm trying to run a simple application on GKE (Kubernetes), and it is failing:Note : I have spark(bitnami spark chart) installed on GKE using helm install  Here is what is done :1. created a docker image using DockerfileDockerfile :```FROM python:3.7-slimRUN apt-get update && \apt-get install -y default-jre && \apt-get install -y openjdk-11-jre-headless && \apt-get cleanENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64RUN pip install pysparkRUN mkdir -p /myexample && chmod 755 /myexampleWORKDIR /myexampleCOPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.pyCMD ["pyspark"]```Simple pyspark application :```from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate()data = "" style="color:rgb(0,128,0);font-weight:bold">'k1', 123000), ('k2', 234000), ('k3', 456000)]df = spark.createDataFrame(data, ('id', 'salary'))df.show(5, False)```Spark-submit command :```





spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode cluster --name pyspark-example --conf spark.kubernetes.container.image=pyspark-example:0.1 --conf spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py```Error i get :```





23/02/13 13:18:27 INFO KubernetesUtils: Uploading file: /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py to dest: /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py...
Exception in thread "main" org.apache.spark.SparkException: Uploading file /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py failed...
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:296)
	at org.apache.spark.deploy.k8s.KubernetesUtils$.renameMainAppResource(KubernetesUtils.scala:270)
	at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configureForPython(DriverCommandFeatureStep.scala:109)
	at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configurePod(DriverCommandFeatureStep.scala:44)
	at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:59)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:89)
	at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
	at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Error uploading file StructuredStream-on-gke.py
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:319)
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:292)
	... 21 more
Caused by: java.io.IOException: Mkdirs failed to create /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:317)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)
	at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2369)
	at org.apache.hadoop.fs.FilterFileSystem.copyFromLocalFile(FilterFileSystem.java:368)
	at 

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Ye Xianjin
Hi AlexG:

Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 = 
11.4TB.  

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote:

> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster
> with 16.73 Tb storage, using
> distcp. The dataset is a collection of tar files of about 1.7 Tb each.
> Nothing else was stored in the HDFS, but after completing the download, the
> namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I see
> that the dataset only takes up 3.8 Tb as expected. I navigated through the
> entire HDFS hierarchy from /, and don't see where the missing space is. Any
> ideas what is going on and how to rectify it?
> 
> I'm using the spark-ec2 script to launch, with the command
> 
> spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge
> --placement-group=pcavariants --copy-aws-credentials
> --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch
> conversioncluster
> 
> and am not modifying any configuration files for Hadoop.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> (mailto:user-unsubscr...@spark.apache.org)
> For additional commands, e-mail: user-h...@spark.apache.org 
> (mailto:user-h...@spark.apache.org)
> 
> 




Re: An interesting and serious problem I encountered

2015-02-13 Thread Ye Xianjin
Hi, 

I believe SizeOf.jar may calculate the wrong size for you.
 Spark has a util call SizeEstimator located in 
org.apache.spark.util.SizeEstimator. And some one extracted it out in 
https://github.com/phatak-dev/java-sizeof/blob/master/src/main/scala/com/madhukaraphatak/sizeof/SizeEstimator.scala
You can try that out in the scala repl. 
The size for Array[Int](43) is 192bytes (12 bytes object size + 4 bytes length 
variable + (43 * 4 round to 176 bytes))
 And the size for (1, Array[Int](43)) is 240 bytes {
   Tuple2 Object: 12 bytes object size + 4 bytes filed _1 + 4 byes field _2 = 
round to 24 bytes
   1 =  java.lang.Number 12  bytes = round to 16 bytes - java.lang.Integer: 
16 bytes + 4 bytes int = round to 24 bytes ( Integer extends Number. I thought 
Scala Tuple2 will specialized Int and this should be 4, but it seems not)
   Array = 192 bytes
}

So, 24 + 24 + 192 = 240 bytes.
This is my calculation based on the spark SizeEstimator. 

However I am not sure what an Integer will occupy for 64 bits JVM with 
compressedOps on. It should be 12 + 4 = 16 bytes, then that means the 
SizeEstimator gives the wrong result. @Sean what do you think?
-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Friday, February 13, 2015 at 2:26 PM, Landmark wrote:

 Hi foks,
 
 My Spark cluster has 8 machines, each of which has 377GB physical memory,
 and thus the total maximum memory can be used for Spark is more than
 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs,
 where the key is an integer and the value is an integer array with 43
 elements. Therefore, the memory cost of this raw dataset is [(1+43) *
 10 * 4] / (1024 * 1024 * 1024) = 164GB. 
 
 Since I have to use this dataset repeatedly, I have to cache it in memory.
 Some key parameter settings are: 
 spark.storage.fraction=0.6
 spark.driver.memory=30GB
 spark.executor.memory=310GB.
 
 But it failed on running a simple countByKey() and the error message is
 java.lang.OutOfMemoryError: Java heap space Does this mean a Spark
 cluster of 2400+GB memory cannot keep 164GB raw data in memory? 
 
 The codes of my program is as follows:
 
 def main(args: Array[String]):Unit = {
 val sc = new SparkContext(new SparkConfig());
 
 val rdd = sc.parallelize(0 until 10, 25600).map(i = (i, new
 Array[Int](43))).cache();
 println(The number of keys is  + rdd.countByKey());
 
 //some other operations following here ...
 }
 
 
 
 
 To figure out the issue, I evaluated the memory cost of key-value pairs and
 computed their memory cost using SizeOf.jar. The codes are as follows:
 
 val arr = new Array[Int](43);
 println(SizeOf.humanReadable(SizeOf.deepSizeOf(arr)));
 
 val tuple = (1, arr.clone);
 println(SizeOf.humanReadable(SizeOf.deepSizeOf(tuple)));
 
 The output is:
 192.0b
 992.0b
 
 
 *Hard to believe, but it is true!! This result means, to store a key-value
 pair, Tuple2 needs more than 5+ times memory than the simplest method with
 array. Even though it may take 5+ times memory, its size is less than
 1000GB, which is still much less than the total memory size of my cluster,
 i.e., 2400+GB. I really do not understand why this happened.*
 
 BTW, if the number of pairs is 1 million, it works well. If the arr contains
 only 1 integer, to store a pair, Tuples needs around 10 times memory.
 
 So I have some questions:
 1. Why does Spark choose such a poor data structure, Tuple2, for key-value
 pairs? Is there any better data structure for storing (key, value) pairs
 with less memory cost ?
 2. Given a dataset with size of M, in general Spark how many times of memory
 to handle it?
 
 
 Best,
 Landmark
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/An-interesting-and-serious-problem-I-encountered-tp21637.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: Can't run Spark java code from command line

2015-01-13 Thread Ye Xianjin
There is no binding issue here. Spark picks the right ip 10.211.55.3 for you. 
The printed message is just an indication.
 However I have no idea why spark-shell hangs or stops.

发自我的 iPhone

 在 2015年1月14日,上午5:10,Akhil Das ak...@sigmoidanalytics.com 写道:
 
 It just a binding issue with the hostnames in your /etc/hosts file. You can 
 set SPARK_LOCAL_IP and SPARK_MASTER_IP in your conf/spark-env.sh file and 
 restart your cluster. (in that case the spark://myworkstation:7077 will 
 change to the ip address that you provided eg: spark://10.211.55.3).
 
 Thanks
 Best Regards
 
 On Tue, Jan 13, 2015 at 11:15 PM, jeremy p athomewithagroove...@gmail.com 
 wrote:
 Hello all,
 
 I wrote some Java code that uses Spark, but for some reason I can't run it 
 from the command line.  I am running Spark on a single node (my 
 workstation). The program stops running after this line is executed :
 
 SparkContext sparkContext = new SparkContext(spark://myworkstation:7077, 
 sparkbase);
 
 When that line is executed, this is printed to the screen : 
 15/01/12 15:56:19 WARN util.Utils: Your hostname, myworkstation resolves to 
 a loopback address: 127.0.1.1; using 10.211.55.3 instead (on interface eth0)
 15/01/12 15:56:19 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 15/01/12 15:56:19 INFO spark.SecurityManager: Changing view acls to: 
 myusername
 15/01/12 15:56:19 INFO spark.SecurityManager: Changing modify acls to: 
 myusername
 15/01/12 15:56:19 INFO spark.SecurityManager: SecurityManager: 
 authentication disabled; ui acls disabled; users with view permissions: 
 Set(myusername); users with modify permissions: Set(myusername)
 
 After it writes this to the screen, the program stops executing without 
 reporting an exception.
 
 What's odd is that when I run this code from Eclipse, the same lines are 
 printed to the screen, but the program keeps executing.
 
 Don't know if it matters, but I'm using the maven assembly plugin, which 
 includes the dependencies in the JAR.
 
 Here are the versions I'm using :
 Cloudera : 2.5.0-cdh5.2.1
 Hadoop : 2.5.0-cdh5.2.1
 HBase : HBase 0.98.6-cdh5.2.1
 Java : 1.7.0_65
 Ubuntu : 14.04.1 LTS
 Spark : 1.2
 


Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Ye Xianjin
$$withChannelRetries$1(Locks.scala:78)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
commit c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:64)
at sbt.IvySbt.withIvy(Ivy.scala:119)
at sbt.IvySbt.withIvy(Ivy.scala:116)
at sbt.IvySbt$Module.withModule(Ivy.scala:147)
at sbt.IvyActions$.updateEither(IvyActions.scala:156)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1282)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1279)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$84.apply(Defaults.scala:1309)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$84.apply(Defaults.scala:1307)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1312)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1306)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1324)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1264)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1242)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:235)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[error] (streaming-kafka/*:update) sbt.ResolveException: unresolved dependency: 
org.apache.kafka#kafka_2.11;0.8.0: not found
[error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: 
org.scalamacros#quasiquotes_2.11;2.0.1: not found



-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, November 18, 2014 at 3:27 PM, Prashant Sharma wrote:

 It is safe in the sense we would help you with the fix if you run into 
 issues. I have used it, but since I worked on the patch the opinion can be 
 biased. I am using scala 2.11 for day to day development. You should checkout 
 the build instructions here : 
 https://github.com/ScrapCodes/spark-1/blob/patch-3/docs/building-spark.md 
 
 Prashant Sharma
 
 
 
 On Tue, Nov 18, 2014 at 12:19 PM, Jianshi Huang jianshi.hu...@gmail.com 
 (mailto:jianshi.hu...@gmail.com) wrote:
  Any notable issues for using Scala 2.11? Is it stable now?
  
  Or can I use Scala 2.11 in my spark application and use Spark dist build 
  with 2.10 ?
  
  I'm looking forward to migrate to 2.11 for some quasiquote features. 
  Couldn't make it run in 2.10...
  
  Cheers,-- 
  Jianshi Huang
  
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 



Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Great. And you should ask question in user@spark.apache.org mail list.  I 
believe many people don't subscribe the incubator mail list now.

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, September 10, 2014 at 6:03 PM, redocpot wrote:

 Hi, 
 
 I am using spark 1.0.0. The bug is fixed by 1.0.1.
 
 Hao
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
|  Do the two mailing lists share messages ?
I don't think so.  I didn't receive this message from the user list. I am not 
in databricks, so I can't answer your other questions. Maybe Davies Liu 
dav...@databricks.com can answer you?

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote:

 Hi, Xianjin
 
 I checked user@spark.apache.org (mailto:user@spark.apache.org), and found my 
 post there:
 http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser
 
 I am using nabble to send this mail, which indicates that the mail will be
 sent from my email address to the u...@spark.incubator.apache.org 
 (mailto:u...@spark.incubator.apache.org) mailing
 list.
 
 Do the two mailing lists share messages ?
 
 Do we have a nabble interface for user@spark.apache.org 
 (mailto:user@spark.apache.org) mail list ?
 
 Thank you.
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Well, That's weird. I don't see this thread in my mail box as sending to user 
list. Maybe because I also subscribe the incubator mail list? I do see mails 
sending to incubator mail list and no one replies. I thought it was because 
people don't subscribe the incubator now.

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Thursday, September 11, 2014 at 12:12 AM, Davies Liu wrote:

 I think the mails to spark.incubator.apache.org 
 (http://spark.incubator.apache.org) will be forwarded to
 spark.apache.org (http://spark.apache.org).
 
 Here is the header of the first mail:
 
 from: redocpot julien19890...@gmail.com (mailto:julien19890...@gmail.com)
 to: u...@spark.incubator.apache.org (mailto:u...@spark.incubator.apache.org)
 date: Mon, Sep 8, 2014 at 7:29 AM
 subject: groupBy gives non deterministic results
 mailing list: user.spark.apache.org (http://user.spark.apache.org) Filter 
 messages from this mailing list
 mailed-by: spark.apache.org (http://spark.apache.org)
 
 I only subscribe spark.apache.org (http://spark.apache.org), and I do see all 
 the mails from he.
 
 On Wed, Sep 10, 2014 at 6:29 AM, Ye Xianjin advance...@gmail.com 
 (mailto:advance...@gmail.com) wrote:
  | Do the two mailing lists share messages ?
  I don't think so. I didn't receive this message from the user list. I am
  not in databricks, so I can't answer your other questions. Maybe Davies Liu
  dav...@databricks.com (mailto:dav...@databricks.com) can answer you?
  
  --
  Ye Xianjin
  Sent with Sparrow
  
  On Wednesday, September 10, 2014 at 9:05 PM, redocpot wrote:
  
  Hi, Xianjin
  
  I checked user@spark.apache.org (mailto:user@spark.apache.org), and found 
  my post there:
  http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser
  
  I am using nabble to send this mail, which indicates that the mail will be
  sent from my email address to the u...@spark.incubator.apache.org 
  (mailto:u...@spark.incubator.apache.org) mailing
  list.
  
  Do the two mailing lists share messages ?
  
  Do we have a nabble interface for user@spark.apache.org 
  (mailto:user@spark.apache.org) mail list ?
  
  Thank you.
  
  
  
  
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13876.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com 
  (http://Nabble.com).
  
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
  (mailto:user-unsubscr...@spark.apache.org)
  For additional commands, e-mail: user-h...@spark.apache.org 
  (mailto:user-h...@spark.apache.org)
  
 
 
 




Re: groupBy gives non deterministic results

2014-09-09 Thread Ye Xianjin
Can you provide small sample or test data that reproduce this problem? and 
what's your env setup? single node or cluster?

Sent from my iPhone

 On 2014年9月8日, at 22:29, redocpot julien19890...@gmail.com wrote:
 
 Hi,
 
 I have a key-value RDD called rdd below. After a groupBy, I tried to count
 rows.
 But the result is not unique, somehow non deterministic.
 
 Here is the test code:
 
  val step1 = ligneReceipt_cleTable.persist
  val step2 = step1.groupByKey
 
  val s1size = step1.count
  val s2size = step2.count
 
  val t = step2 // rdd after groupBy
 
  val t1 = t.count
  val t2 = t.count
  val t3 = t.count
  val t4 = t.count
  val t5 = t.count
  val t6 = t.count
  val t7 = t.count
  val t8 = t.count
 
  println(s1size =  + s1size)
  println(s2size =  + s2size)
  println(1 =  + t1)
  println(2 =  + t2)
  println(3 =  + t3)
  println(4 =  + t4)
  println(5 =  + t5)
  println(6 =  + t6)
  println(7 =  + t7)
  println(8 =  + t8)
 
 Here are the results:
 
 s1size = 5338864
 s2size = 5268001
 1 = 5268002
 2 = 5268001
 3 = 5268001
 4 = 5268002
 5 = 5268001
 6 = 5268002
 7 = 5268002
 8 = 5268001
 
 Even if the difference is just one row, that's annoying.  
 
 Any idea ?
 
 Thank you.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
what did you see in the log? was there anything related to mapreduce?
can you log into your hdfs (data) node, use jps to list all java process and 
confirm whether there is a tasktracker process (or nodemanager) running with 
datanode process


-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote:

 Still no luck, even when running stop-all.sh (http://stop-all.sh) followed by 
 start-all.sh (http://start-all.sh).
 
 On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas
 nicholas.cham...@gmail.com (mailto:nicholas.cham...@gmail.com) wrote:
  Tomer,
  
  Did you try start-all.sh (http://start-all.sh)? It worked for me the last 
  time I tried using
  distcp, and it worked for this guy too.
  
  Nick
  
  
  On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com 
  (mailto:tomer@gmail.com) wrote:
   
   ~/ephemeral-hdfs/sbin/start-mapred.sh (http://start-mapred.sh) does not 
   exist on spark-1.0.2;
   
   I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh 
   (http://stop-dfs.sh) and
   ~/ephemeral-hdfs/sbin/start-dfs.sh (http://start-dfs.sh), but still 
   getting the same error
   when trying to run distcp:
   
   ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
   
   java.io.IOException: Cannot initialize Cluster. Please check your
   configuration for mapreduce.framework.name 
   (http://mapreduce.framework.name) and the correspond server
   addresses.
   
   at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
   
   at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
   
   at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
   
   at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
   
   at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
   
   at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
   
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
   
   at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
   
   Any idea?
   
   Thanks!
   Tomer
   
   On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com 
   (mailto:rosenvi...@gmail.com) wrote:
If I recall, you should be able to start Hadoop MapReduce using
~/ephemeral-hdfs/sbin/start-mapred.sh (http://start-mapred.sh).

On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com 
(mailto:tomer@gmail.com)
wrote:
 
 Hi,
 
 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.
 
 Is there a way to activate it, or is there a spark alternative to
 distcp?
 
 Thanks,
 Tomer
 
 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001
 
 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name 
 (http://mapreduce.framework.name) and the correspond server
 addresses.
 
 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
 at 
 org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 


   
   
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
   (mailto:user-unsubscr...@spark.apache.org)
   For additional commands, e-mail: user-h...@spark.apache.org 
   (mailto:user-h...@spark.apache.org)
   
  
  
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
well, this means you didn't start a compute cluster. Most likely because the 
wrong value of mapreduce.jobtracker.address cause the slave node cannot start 
the node manager. ( I am not familiar with the ec2 script, so I don't know 
whether the slave node has node manager installed or not.) 
Can you check the slave node the hadoop daemon log to see whether you started 
the nodemanager  but failed or there is no nodemanager to start? The log file 
location defaults to
/var/log/hadoop-xxx if my memory is correct.

Sent from my iPhone

 On 2014年9月9日, at 0:08, Tomer Benyamini tomer@gmail.com wrote:
 
 No tasktracker or nodemanager. This is what I see:
 
 On the master:
 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
 org.apache.hadoop.hdfs.server.namenode.NameNode
 
 On the data node (slave):
 
 org.apache.hadoop.hdfs.server.datanode.DataNode
 
 
 
 On Mon, Sep 8, 2014 at 6:39 PM, Ye Xianjin advance...@gmail.com wrote:
 what did you see in the log? was there anything related to mapreduce?
 can you log into your hdfs (data) node, use jps to list all java process and
 confirm whether there is a tasktracker process (or nodemanager) running with
 datanode process
 
 --
 Ye Xianjin
 Sent with Sparrow
 
 On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote:
 
 Still no luck, even when running stop-all.sh followed by start-all.sh.
 
 On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
 
 Tomer,
 
 Did you try start-all.sh? It worked for me the last time I tried using
 distcp, and it worked for this guy too.
 
 Nick
 
 
 On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote:
 
 
 ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;
 
 I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
 ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
 when trying to run distcp:
 
 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.
 
 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
 Any idea?
 
 Thanks!
 Tomer
 
 On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote:
 
 If I recall, you should be able to start Hadoop MapReduce using
 ~/ephemeral-hdfs/sbin/start-mapred.sh.
 
 On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com
 wrote:
 
 
 Hi,
 
 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.
 
 Is there a way to activate it, or is there a spark alternative to
 distcp?
 
 Thanks,
 Tomer
 
 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001
 
 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.
 
 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail

Re: Too many open files

2014-08-29 Thread Ye Xianjin
Ops,the last reply didn't go to the user list.  Mail app's fault.

Shuffling happens in the cluster, so you need change all the nodes in the 
cluster.



Sent from my iPhone

 On 2014年8月30日, at 3:10, Sudha Krishna skrishna...@gmail.com wrote:
 
 Hi,
 
 Thanks for your response. Do you know if I need to change this limit on all 
 the cluster nodes or just the master?
 Thanks
 
 On Aug 29, 2014 11:43 AM, Ye Xianjin advance...@gmail.com wrote:
 1024 for the number of file limit is most likely too small for Linux 
 Machines on production. Try to set to 65536 or unlimited if you can. The too 
 many open files error occurs because there are a lot of shuffle files(if 
 wrong, please correct me):
 
 Sent from my iPhone
 
  On 2014年8月30日, at 2:06, SK skrishna...@gmail.com wrote:
 
  Hi,
 
  I am having the same problem reported by Michael. I am trying to open 30
  files. ulimit -n  shows the limit is 1024. So I am not sure why the program
  is failing with  Too many open files error. The total size of all the 30
  files is 230 GB.
  I am running the job on a cluster with 10 nodes, each having 16 GB. The
  error appears to be happening at the distinct() stage.
 
  Here is my program. In the following code, are all the 10 nodes trying to
  open all of the 30 files or are the files distributed among the 30 nodes?
 
 val baseFile = /mapr/mapr_dir/files_2013apr*
 valx = sc.textFile(baseFile)).map { line =
 val
  fields = line.split(\t)
 
  (fields(11), fields(6))
 
  }.distinct().countByKey()
 val xrdd = sc.parallelize(x.toSeq)
 xrdd.saveAsTextFile(...)
 
  Instead of using the glob *, I guess I can try using a for loop to read the
  files one by one if that helps, but not sure if there is a more efficient
  solution.
 
  The following is the error transcript:
 
  Job aborted due to stage failure: Task 1.0:201 failed 4 times, most recent
  failure: Exception failure in TID 902 on host 192.168.13.11:
  java.io.FileNotFoundException:
  /tmp/spark-local-20140829131200-0bb7/08/shuffle_0_201_999 (Too many open
  files)
  java.io.FileOutputStream.open(Native Method)
  java.io.FileOutputStream.init(FileOutputStream.java:221)
  org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116)
  org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177)
  org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
  org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
  scala.collection.Iterator$class.foreach(Iterator.scala:727)
  org.apache.spark.util.collection.AppendOnlyMap$$anon$1.foreach(AppendOnlyMap.scala:159)
  org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
  org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  org.apache.spark.scheduler.Task.run(Task.scala:51)
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  java.lang.Thread.run(Thread.java:744) Driver stacktrace:
 
 
 
 
 
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-tp1464p13144.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 


Re: defaultMinPartitions in textFile

2014-07-21 Thread Ye Xianjin
well, I think you miss this line of code in SparkContext.scala
line 1242-1243(master):
 /** Default min number of partitions for Hadoop RDDs when not given by user */
  def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

so the defaultMinPartitions will be 2 unless the defaultParallelism is less 
than 2...


--  
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, July 22, 2014 at 10:18 AM, Wang, Jensen wrote:

 Hi,  
 I started to use spark on yarn recently and found a problem while 
 tuning my program.
   
 When SparkContext is initialized as sc and ready to read text file from hdfs, 
 the textFile(path, defaultMinPartitions) method is called.
 I traced down the second parameter in the spark source code and finally found 
 this:
conf.getInt(spark.default.parallelism, math.max(totalCoreCount.get(), 
 2))  in  CoarseGrainedSchedulerBackend.scala
   
 I do not specify the property “spark.default.parallelism” anywhere so the 
 getInt will return value from the larger one between totalCoreCount and 2.
   
 When I submit the application using spark-submit and specify the parameter: 
 --num-executors  2   --executor-cores 6, I suppose the totalCoreCount will be 
  
 2*6 = 12, so defaultMinPartitions will be 12.
   
 But when I print the value of defaultMinPartitions in my program, I still get 
 2 in return,  How does this happen, or where do I make a mistake?
  
  
  




Re: Where to set proxy in order to run ./install-dev.sh for SparkR

2014-07-02 Thread Ye Xianjin
You can try setting your HTTP_PROXY environment variable.

export HTTP_PROXY=host:port

But I don't use maven. If the env variable doesn't work, please search google 
for maven proxy. I am sure there will be a lot of related results.

Sent from my iPhone

 On 2014年7月2日, at 19:04, Stuti Awasthi stutiawas...@hcl.com wrote:
 
 Hi,
  
 I wanted to build SparkR from source but running the script behind the proxy. 
 Where shall I set proxy host and port in order to build the source. Issue is 
 not able to download dependencies from Maven
  
 Thanks
 Stuti Awasthi
  
 
 
 ::DISCLAIMER::
 
 The contents of this e-mail and any attachment(s) are confidential and 
 intended for the named recipient(s) only.
 E-mail transmission is not guaranteed to be secure or error-free as 
 information could be intercepted, corrupted, 
 lost, destroyed, arrive late or incomplete, or may contain viruses in 
 transmission. The e mail and its contents 
 (with or without referred errors) shall therefore not attach any liability on 
 the originator or HCL or its affiliates. 
 Views or opinions, if any, presented in this email are solely those of the 
 author and may not necessarily reflect the 
 views or opinions of HCL or its affiliates. Any form of reproduction, 
 dissemination, copying, disclosure, modification, 
 distribution and / or publication of this message without the prior written 
 consent of authorized representative of 
 HCL is strictly prohibited. If you have received this email in error please 
 delete it and notify the sender immediately. 
 Before opening any email and/or attachments, please check them for viruses 
 and other defects.
 


Re: Set comparison

2014-06-16 Thread Ye Xianjin
If you want string with quotes, you have to escape it with '\'. It's exactly 
what you did in the modified version.

Sent from my iPhone

 On 2014年6月17日, at 5:43, SK skrishna...@gmail.com wrote:
 
 In Line 1, I have expected_res as a set of strings with quotes. So I thought
 it would include the quotes during comparison.
 
 Anyway I modified expected_res = Set(\ID1\, \ID2\, \ID3\) and
 that seems to work.
 
 thanks.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Set-comparison-tp7696p7699.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.