Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-14 Thread Ye Xianjin
The configuration of ‘…file.upload.path’ is wrong. it means a distributed fs path to store your archives/resource/jars temporarily, then distributed by spark to drivers/executors. For your cases, you don’t need to set this configuration.Sent from my iPhoneOn Feb 14, 2023, at 5:43 AM, karan alang

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Ye Xianjin
Hi AlexG: Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 = 11.4TB. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote: > I downloaded a 3.8 T dataset from S3 to a freshly launched sp

Re: An interesting and serious problem I encountered

2015-02-13 Thread Ye Xianjin
. This is my calculation based on the spark SizeEstimator. However I am not sure what an Integer will occupy for 64 bits JVM with compressedOps on. It should be 12 + 4 = 16 bytes, then that means the SizeEstimator gives the wrong result. @Sean what do you think? -- Ye Xianjin Sent with Sparrow

Re: Can't run Spark java code from command line

2015-01-13 Thread Ye Xianjin
There is no binding issue here. Spark picks the right ip 10.211.55.3 for you. The printed message is just an indication. However I have no idea why spark-shell hangs or stops. 发自我的 iPhone 在 2015年1月14日,上午5:10,Akhil Das ak...@sigmoidanalytics.com 写道: It just a binding issue with the

Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Ye Xianjin
: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.0: not found [error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: org.scalamacros#quasiquotes_2.11;2.0.1: not found -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Tuesday, November 18, 2014

Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Great. And you should ask question in user@spark.apache.org mail list. I believe many people don't subscribe the incubator mail list now. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, September 10, 2014 at 6:03 PM, redocpot wrote: Hi, I am using

Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
| Do the two mailing lists share messages ? I don't think so. I didn't receive this message from the user list. I am not in databricks, so I can't answer your other questions. Maybe Davies Liu dav...@databricks.com can answer you? -- Ye Xianjin Sent with Sparrow (http

Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Well, That's weird. I don't see this thread in my mail box as sending to user list. Maybe because I also subscribe the incubator mail list? I do see mails sending to incubator mail list and no one replies. I thought it was because people don't subscribe the incubator now. -- Ye Xianjin Sent

Re: groupBy gives non deterministic results

2014-09-09 Thread Ye Xianjin
Can you provide small sample or test data that reproduce this problem? and what's your env setup? single node or cluster? Sent from my iPhone On 2014年9月8日, at 22:29, redocpot julien19890...@gmail.com wrote: Hi, I have a key-value RDD called rdd below. After a groupBy, I tried to count

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
what did you see in the log? was there anything related to mapreduce? can you log into your hdfs (data) node, use jps to list all java process and confirm whether there is a tasktracker process (or nodemanager) running with datanode process -- Ye Xianjin Sent with Sparrow (http

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
): org.apache.hadoop.hdfs.server.datanode.DataNode On Mon, Sep 8, 2014 at 6:39 PM, Ye Xianjin advance...@gmail.com wrote: what did you see in the log? was there anything related to mapreduce? can you log into your hdfs (data) node, use jps to list all java process and confirm whether there is a tasktracker

Re: Too many open files

2014-08-29 Thread Ye Xianjin
need to change this limit on all the cluster nodes or just the master? Thanks On Aug 29, 2014 11:43 AM, Ye Xianjin advance...@gmail.com wrote: 1024 for the number of file limit is most likely too small for Linux Machines on production. Try to set to 65536 or unlimited if you can. The too

Re: defaultMinPartitions in textFile

2014-07-21 Thread Ye Xianjin
the defaultParallelism is less than 2... -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Tuesday, July 22, 2014 at 10:18 AM, Wang, Jensen wrote: Hi, I started to use spark on yarn recently and found a problem while tuning my program. When SparkContext is initialized

Re: Where to set proxy in order to run ./install-dev.sh for SparkR

2014-07-02 Thread Ye Xianjin
You can try setting your HTTP_PROXY environment variable. export HTTP_PROXY=host:port But I don't use maven. If the env variable doesn't work, please search google for maven proxy. I am sure there will be a lot of related results. Sent from my iPhone On 2014年7月2日, at 19:04, Stuti Awasthi

Re: Set comparison

2014-06-16 Thread Ye Xianjin
If you want string with quotes, you have to escape it with '\'. It's exactly what you did in the modified version. Sent from my iPhone On 2014年6月17日, at 5:43, SK skrishna...@gmail.com wrote: In Line 1, I have expected_res as a set of strings with quotes. So I thought it would include the