RE: Broadcasting a parquet file using spark and python

2015-12-07 Thread Shuai Zheng
join but I want to use broadcast join). Or is there any ticket or roadmap for this feature? Regards, Shuai From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Saturday, December 05, 2015 4:11 PM To: Shuai Zheng Cc: Jitesh chandra Mishra; user Subject: Re: Broadcasting

RE: Broadcasting a parquet file using spark and python

2015-12-04 Thread Shuai Zheng
Hi all, Sorry to re-open this thread. I have a similar issue, one big parquet file left outer join quite a few smaller parquet files. But the running is extremely slow and even OOM sometimes (with 300M , I have two questions here: 1, If I use outer join, will Spark SQL auto use

RE: Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-24 Thread Shuai Zheng
[mailto:jonathaka...@gmail.com] Sent: Thursday, November 19, 2015 6:54 PM To: Shuai Zheng Cc: user Subject: Re: Spark Tasks on second node never return in Yarn when I have more than 1 task node I don't know if this actually has anything to do with why your job is hanging, but since you are using EMR you

Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-19 Thread Shuai Zheng
Hi All, I face a very weird case. I have already simplify the scenario to the most so everyone can replay the scenario. My env: AWS EMR 4.1.0, Spark1.5 My code can run without any problem when I run it in a local mode, and it has no problem when it run on a EMR cluster with one

RE: [Spark 1.5]: Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java heap space -- Work in 1.4, but 1.5 doesn't

2015-11-04 Thread Shuai Zheng
of Spark. So I want to know any new setup I should set in Spark 1.5 to make it work? Regards, Shuai From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Wednesday, November 04, 2015 3:22 PM To: user@spark.apache.org Subject: [Spark 1.5]: Exception in thread "broadcast-hash-j

[Spark 1.5]: Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java heap space

2015-11-04 Thread Shuai Zheng
Hi All, I have a program which actually run a bit complex business (join) in spark. And I have below exception: I running on Spark 1.5, and with parameter: spark-submit --deploy-mode client --executor-cores=24 --driver-memory=2G --executor-memory=45G -class . Some other setup:

How to Take the whole file as a partition

2015-09-03 Thread Shuai Zheng
Hi All, I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark can read them as partition on the file level. Which means want the FileSplit turn off. I know there are some solutions, but not very good in my case: 1, I can't use WholeTextFiles method, because my file is

RE: How to Take the whole file as a partition

2015-09-03 Thread Shuai Zheng
, 2015 11:07 AM To: Shuai Zheng Cc: user Subject: Re: How to Take the whole file as a partition You situation is special. It seems to me Spark may not fit well in your case. You want to process the individual files (500M~2G) as a whole, you want good performance. You may want to write our

org.apache.hadoop.security.AccessControlException: Permission denied when access S3

2015-08-20 Thread Shuai Zheng
Hi All, I try to access S3 file from S3 in Hadoop file format: Below is my code: Configuration hadoopConf = ctx.hadoopConfiguration(); hadoopConf.set(fs.s3n.awsAccessKeyId, this.getAwsAccessKeyId());

RE: [SPARK-6330] 1.4.0/1.5.0 Bug to access S3 -- AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyI

2015-06-10 Thread Shuai Zheng
to replicate my issue locally (the code doesn’t need to run on EC2, I run it directly from my local windows pc). Regards, Shuai From: Aaron Davidson [mailto:ilike...@gmail.com] Sent: Wednesday, June 10, 2015 12:28 PM To: Shuai Zheng Subject: Re: [SPARK-6330] 1.4.0/1.5.0 Bug to access

[SPARK-6330] 1.4.0/1.5.0 Bug to access S3 -- AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or

2015-06-09 Thread Shuai Zheng
Hi All, I have some code to access s3 from Spark. The code is as simple as: JavaSparkContext ctx = new JavaSparkContext(sparkConf); Configuration hadoopConf = ctx.hadoopConfiguration(); // aws.secretKey=Zqhjim3GB69hMBvfjh+7NX84p8sMF39BHfXwO3Hs

[SQL][1.3.1][JAVA]Use UDF in DataFrame join

2015-04-27 Thread Shuai Zheng
Hi All, I want to ask how to use UDF when I use join function on DataFrame. It looks like always give me the cannot solve the column name error. Anyone can give me an example on how to run this in java? My code is like: edmData.join(yb_lookup,

[SQL][1.3.1][JAVA] UDF in java cause Task not serializable

2015-04-27 Thread Shuai Zheng
Hi All, Basically I try to define a simple UDF and use it in the query, but it gives me Task not serializable public void test() { RiskGroupModelDefinition model = registeredRiskGroupMap.get(this.modelId); RiskGroupModelDefinition edm =

RE: Slower performance when bigger memory?

2015-04-27 Thread Shuai Zheng
PM To: Dean Wampler Cc: Shuai Zheng; user@spark.apache.org Subject: Re: Slower performance when bigger memory? FWIW, I ran into a similar issue on r3.8xlarge nodes and opted for more/smaller executors. Another observation was that one large executor results in less overall read throughput from

Slower performance when bigger memory?

2015-04-23 Thread Shuai Zheng
Hi All, I am running some benchmark on r3*8xlarge instance. I have a cluster with one master (no executor on it) and one slave (r3*8xlarge). My job has 1000 tasks in stage 0. R3*8xlarge has 244G memory and 32 cores. If I create 4 executors, each has 8 core+50G memory, each task

RE: Bug? Can't reference to the column by name after join two DataFrame on a same name key

2015-04-23 Thread Shuai Zheng
Got it. Thanks! J From: Yin Huai [mailto:yh...@databricks.com] Sent: Thursday, April 23, 2015 2:35 PM To: Shuai Zheng Cc: user Subject: Re: Bug? Can't reference to the column by name after join two DataFrame on a same name key Hi Shuai, You can use as to create a table alias

Bug? Can't reference to the column by name after join two DataFrame on a same name key

2015-04-23 Thread Shuai Zheng
Hi All, I use 1.3.1 When I have two DF and join them on a same name key, after that, I can't get the common key by name. Basically: select * from t1 inner join t2 on t1.col1 = t2.col1 And I am using purely DataFrame, spark SqlContext not HiveContext DataFrame df3 =

RE: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Shuai Zheng
Below is my code to access s3n without problem (only for 1.3.1. there is a bug in 1.3.0). Configuration hadoopConf = ctx.hadoopConfiguration(); hadoopConf.set(fs.s3n.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem);

RE:RE:maven compile error

2015-04-21 Thread Shuai Zheng
I have similar issue (I failed on the spark core project but with same exception as you). Then I follow the below steps (I am working on windows): Delete the maven repository, and re-download all dependency. The issue sounds like a corrupted jar can’t be opened by maven. Other than this,

Exception in thread main java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds] when create context

2015-04-08 Thread Shuai Zheng
Hi All, In some cases, I have below exception when I run spark in local mode (I haven't see this in a cluster). This is weird but also affect my local unit test case (it is not always happen, but usually one per 4-5 times run). From the stack, looks like error happen when create the context,

RE: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-07 Thread Shuai Zheng
, have no time to dig it out. But if you want me to provide more information. I will be happy to do that. Regards, Shuai -Original Message- From: Bozeman, Christopher [mailto:bozem...@amazon.com] Sent: Wednesday, April 01, 2015 4:59 PM To: Shuai Zheng; 'Sean Owen' Cc: 'Akhil Das'; user

RE: [BUG]Broadcast value return empty after turn to org.apache.spark.serializer.KryoSerializer

2015-04-07 Thread Shuai Zheng
, not NULL), I am not sure why this happen (the code without any problem when run with default java serializer). So I think this is a bug, but I am not sure it is a bug of spark or a bug of Kryo. Regards, Shuai From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Monday, April 06, 2015 5

set spark.storage.memoryFraction to 0 when no cached RDD and memory area for broadcast value?

2015-04-07 Thread Shuai Zheng
Hi All, I am a bit confused on spark.storage.memoryFraction, this is used to set the area for RDD usage, will this RDD means only for cached and persisted RDD? So if my program has no cached RDD at all (means that I have no .cache() or .persist() call on any RDD), then I can set this

RE: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-01 Thread Shuai Zheng
[mailto:ak...@sigmoidanalytics.com] Sent: Wednesday, April 01, 2015 2:40 AM To: Shuai Zheng Cc: user@spark.apache.org Subject: Re: --driver-memory parameter doesn't work for spark-submmit on yarn? Once you submit the job do a ps aux | grep spark-submit and see how much is the heap space

RE: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-01 Thread Shuai Zheng
, Shuai -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, April 01, 2015 10:51 AM To: Shuai Zheng Cc: Akhil Das; user@spark.apache.org Subject: Re: --driver-memory parameter doesn't work for spark-submmit on yarn? I feel like I recognize that problem, and it's

--driver-memory parameter doesn't work for spark-submmit on yarn?

2015-03-31 Thread Shuai Zheng
Hi All, Below is the my shell script: /home/hadoop/spark/bin/spark-submit --driver-memory=5G --executor-memory=40G --master yarn-client --class com.***.FinancialEngineExecutor /home/hadoop/lib/my.jar s3://bucket/vriscBatchConf.properties My driver will load some resources and then

When will 1.3.1 release?

2015-03-30 Thread Shuai Zheng
Hi All, I am waiting the spark 1.3.1 to fix the bug to work with S3 file system. Anyone knows the release date for 1.3.1? I can't downgrade to 1.2.1 because there is jar compatible issue to work with AWS SDK. Regards, Shuai

What is the jvm size when start spark-submit through local mode

2015-03-20 Thread Shuai Zheng
Hi, I am curious, when I start a spark program in local mode, which parameter will be used to decide the jvm memory size for executor? In theory should be: --executor-memory 20G But I remember local mode will only start spark executor in the same process of driver, then should be:

RE: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-20 Thread Shuai Zheng
[mailto:ilike...@gmail.com] Sent: Tuesday, March 17, 2015 3:06 PM To: Imran Rashid Cc: Shuai Zheng; user@spark.apache.org Subject: Re: Spark will process _temporary folder on S3 is very slow and always cause failure Actually, this is the more relevant JIRA (which is resolved): https

RE: com.esotericsoftware.kryo.KryoException: java.io.IOException: File too large vs FileNotFoundException (Too many open files) on spark 1.2.1

2015-03-20 Thread Shuai Zheng
Below is the output: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1967947 max locked memory (kbytes, -l) 64 max memory

com.esotericsoftware.kryo.KryoException: java.io.IOException: File too large vs FileNotFoundException (Too many open files) on spark 1.2.1

2015-03-20 Thread Shuai Zheng
Hi All, I try to run a simple sort by on 1.2.1. And it always give me below two errors: 1, 15/03/20 17:48:29 WARN TaskSetManager: Lost task 2.0 in stage 1.0 (TID 35, ip-10-169-217-47.ec2.internal): java.io.FileNotFoundException:

[SPARK-3638 ] java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.

2015-03-16 Thread Shuai Zheng
Hi All, I am running Spark 1.2.1 and AWS SDK. To make sure AWS compatible on the httpclient 4.2 (which I assume spark use?), I have already downgrade to the version 1.9.0 But even that, I still got an error: Exception in thread main java.lang.NoSuchMethodError:

RE: Process time series RDD after sortByKey

2015-03-16 Thread Shuai Zheng
Hi Imran, I am a bit confused here. Assume I have RDD a with 1000 partition and also has been sorted. How can I control when creating RDD b (with 20 partitions) to make sure 1-50 partition of RDD a map to 1st partition of RDD b? I don’t see any control code/logic here? You code below:

RE: [SPARK-3638 ] java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.

2015-03-16 Thread Shuai Zheng
...@gmail.com] Sent: Monday, March 16, 2015 1:06 PM To: Shuai Zheng Cc: user Subject: Re: [SPARK-3638 ] java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator. From my local maven repo: $ jar tvf ~/.m2/repository/org/apache/httpcomponents/httpclient/4.2.5

sqlContext.parquetFile doesn't work with s3n in version 1.3.0

2015-03-16 Thread Shuai Zheng
Hi All, I just upgrade the system to use version 1.3.0, but then the sqlContext.parquetFile doesn't work with s3n. I have test the same code with 1.2.1 and it works. A simple test running in spark-shell: val parquetFile = sqlContext.parquetFile(s3n:///test/2.parq )

RE: sqlContext.parquetFile doesn't work with s3n in version 1.3.0

2015-03-16 Thread Shuai Zheng
. Regards, Shuai From: Kelly, Jonathan [mailto:jonat...@amazon.com] Sent: Monday, March 16, 2015 2:54 PM To: Shuai Zheng; user@spark.apache.org Subject: Re: sqlContext.parquetFile doesn't work with s3n in version 1.3.0 See https://issues.apache.org/jira/browse/SPARK-6351 ~ Jonathan

Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-13 Thread Shuai Zheng
Hi All, I try to run a sorting on a r3.2xlarge instance on AWS. I just try to run it as a single node cluster for test. The data I use to sort is around 4GB and sit on S3, output will also on S3. I just connect spark-shell to the local cluster and run the code in the script (because I just

RE: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-13 Thread Shuai Zheng
? Regards, Shuai From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Friday, March 13, 2015 6:51 PM To: user@spark.apache.org Subject: Spark will process _temporary folder on S3 is very slow and always cause failure Hi All, I try to run a sorting on a r3.2xlarge instance on AWS

jar conflict with Spark default packaging

2015-03-13 Thread Shuai Zheng
Hi All, I am running spark to deal with AWS. And aws sdk latest version is working with httpclient 3.4+. Then but spark-assembly-*-.jar file has packaged an old httpclient version which cause me: ClassNotFoundException for org/apache/http/client/methods/HttpPatch Even when I put the

How to pass parameter to spark-shell when choose client mode --master yarn-client

2015-03-10 Thread Shuai Zheng
Hi All, I try to pass parameter to the spark-shell when I do some test: spark-shell --driver-memory 512M --executor-memory 4G --master spark://:7077 --conf spark.sql.parquet.compression.codec=snappy --conf spark.sql.parquet.binaryAsString=true This works fine on my local pc. And

Read Parquet file from scala directly

2015-03-09 Thread Shuai Zheng
Hi All, I have a lot of parquet files, and I try to open them directly instead of load them into RDD in driver (so I can optimize some performance through special logic). But I do some research online and can't find any example to access parquet directly from scala, anyone has done this

How to preserve/preset partition information when load time series data?

2015-03-09 Thread Shuai Zheng
Hi All, If I have a set of time series data files, they are in parquet format and the data for each day are store in naming convention, but I will not know how many files for one day. 20150101a.parq 20150101b.parq 20150102a.parq 20150102b.parq 20150102c.parq . 201501010a.parq .

Process time series RDD after sortByKey

2015-03-09 Thread Shuai Zheng
Hi All, I am processing some time series data. For one day, it might has 500GB, then for each hour, it is around 20GB data. I need to sort the data before I start process. Assume I can sort them successfully dayRDD.sortByKey but after that, I might have thousands of partitions (to

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-24 Thread Shuai Zheng
, 2015 6:00 PM To: Shuai Zheng Cc: Shao, Saisai; user@spark.apache.org Subject: Re: Union and reduceByKey will trigger shuffle even same partition? I think you're getting tripped up lazy evaluation and the way stage boundaries work (admittedly its pretty confusing in this case). It is true

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shuai Zheng
enough put in memory), how can I access other RDD's local partition in the mapParitition method? Is it anyway to do this in Spark? From: Shao, Saisai [mailto:saisai.s...@intel.com] Sent: Monday, February 23, 2015 3:13 PM To: Shuai Zheng Cc: user@spark.apache.org Subject: RE: Union and reduceByKey

Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shuai Zheng
Hi All, I am running a simple page rank program, but it is slow. And I dig out part of reason is there is shuffle happen when I call an union action even both RDD share the same partition: Below is my test code in spark shell: import org.apache.spark.HashPartitioner

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shuai Zheng
: Monday, February 23, 2015 3:13 PM To: Shuai Zheng Cc: user@spark.apache.org Subject: RE: Union and reduceByKey will trigger shuffle even same partition? If you call reduceByKey(), internally Spark will introduce a shuffle operations, not matter the data is already partitioned locally, Spark

RE: How to design a long live spark application

2015-02-06 Thread Shuai Zheng
on it? How Spark to control/maintain/detect the live of the client spark context? Do I need to setup something special? Regards, Shuai From: Eugen Cepoi [mailto:cepoi.eu...@gmail.com] Sent: Thursday, February 05, 2015 5:39 PM To: Shuai Zheng Cc: Corey Nolet; Charles Feduke; user

How to design a long live spark application

2015-02-05 Thread Shuai Zheng
Hi All, I want to develop a server side application: User submit request à Server run spark application and return (this might take a few seconds). So I want to host the server to keep the long-live context, I don’t know whether this is reasonable or not. Basically I try to have a

Use Spark as multi-threading library and deprecate web UI

2015-02-05 Thread Shuai Zheng
Hi All, It might sounds weird, but I think spark is perfect to be used as a multi-threading library in some cases. The local mode will naturally boost multiple thread when required. Because it is more restrict and less chance to have potential bug in the code (because it is more data oriental,

RE: How to design a long live spark application

2015-02-05 Thread Shuai Zheng
), so create two different DAG for different method. Anyone can confirm this? This is not something I can easily test with code. Thanks! Regards, Shuai From: Corey Nolet [mailto:cjno...@gmail.com] Sent: Thursday, February 05, 2015 11:55 AM To: Charles Feduke Cc: Shuai Zheng; user

RE: Use Spark as multi-threading library and deprecate web UI

2015-02-05 Thread Shuai Zheng
Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, February 05, 2015 10:53 AM To: Shuai Zheng Cc: user@spark.apache.org Subject: Re: Use Spark as multi-threading library and deprecate web UI Do you mean disable the web UI? spark.ui.enabled=false Sure, it's useful with master

Re: Executor vs Mapper in Hadoop

2015-01-16 Thread Shuai Zheng
have 1 executor per machine per application. There are cases where you would make more than 1, but these are unusual. On Thu, Jan 15, 2015 at 8:16 PM, Shuai Zheng szheng.c...@gmail.com wrote: Hi All, I try to clarify some behavior in the spark for executor. Because I am from Hadoop

RE: Determine number of running executors

2015-01-16 Thread Shuai Zheng
Hi Tobias, Can you share more information about how do you do that? I also have similar question about this. Thanks a lot, Regards, Shuai From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Wednesday, November 26, 2014 12:25 AM To: Sandy Ryza Cc: Yanbo Liang; user Subject:

Executor parameter doesn't work for Spark-shell on EMR Yarn

2015-01-15 Thread Shuai Zheng
Hi All, I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has 32 vCore and 244G memory. But the command line I use to start up spark-shell, it can't work. For example: ~/spark/bin/spark-shell --jars /home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar

Executor vs Mapper in Hadoop

2015-01-15 Thread Shuai Zheng
Hi All, I try to clarify some behavior in the spark for executor. Because I am from Hadoop background, so I try to compare it to the Mapper (or reducer) in hadoop. 1, Each node can have multiple executors, each run in its own process? This is same as mapper process. 2, I thought the

RE: Executor parameter doesn't work for Spark-shell on EMR Yarn

2015-01-15 Thread Shuai Zheng
behavior - because executor will run the whole lifecycle of the applications? Although this may not have any real value in practice J But I still need help for my first question. Thanks a lot. Regards, Shuai From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Thursday, January 15

RE: Why always spilling to disk and how to improve it?

2015-01-14 Thread Shuai Zheng
Thanks a lot! I just realize the spark is not a really in-memory version of mapreduce J From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, January 13, 2015 3:53 PM To: Shuai Zheng Cc: user@spark.apache.org Subject: Re: Why always spilling to disk and how to improve

Why always spilling to disk and how to improve it?

2015-01-13 Thread Shuai Zheng
Hi All, I am trying with some small data set. It is only 200m, and what I am doing is just do a distinct count on it. But there are a lot of spilling happen in the log (I attached in the end of the email). Basically I use 10G memory, run on a one-node EMR cluster with r3*8xlarge instance

Re: S3 files , Spark job hungsup

2014-12-22 Thread Shuai Zheng
Is it possible too many connections open to read from s3 from one node? I have this issue before because I open a few hundreds of files on s3 to read from one node. It just block itself without error until timeout later. On Monday, December 22, 2014, durga durgak...@gmail.com wrote: Hi All, I

Network file input cannot be recognized?

2014-12-21 Thread Shuai Zheng
Hi, I am running a code which takes a network file (not HDFS) location as input. But sc.textFile(networklocation\\README.md) can't recognize the network location start with as a valid location, because it only accept HDFS and local like file name format? Anyone has idea how can I use a

Find the file info of when load the data into RDD

2014-12-21 Thread Shuai Zheng
Hi All, When I try to load a folder into the RDDs, any way for me to find the input file name of particular partitions? So I can track partitions from which file. In the hadoop, I can find this information through the code: FileSplit fileSplit = (FileSplit) context.getInputSplit(); String

Re: Find the file info of when load the data into RDD

2014-12-21 Thread Shuai Zheng
PM, Shuai Zheng szheng.c...@gmail.com wrote: Hi All, When I try to load a folder into the RDDs, any way for me to find the input file name of particular partitions? So I can track partitions from which file. In the hadoop, I can find this information through the code: FileSplit fileSplit

Any potentiail issue if I create a SparkContext in executor

2014-12-19 Thread Shuai Zheng
Hi All, I notice if we create a spark context in driver, we need to call stop method to clear it. SparkConf sparkConf = new SparkConf().setAppName(FinancialEngineExecutor); JavaSparkContext ctx = new JavaSparkContext(sparkConf); . String

RE: Control default partition when load a RDD from HDFS

2014-12-18 Thread Shuai Zheng
] Sent: Wednesday, December 17, 2014 11:04 AM To: Shuai Zheng; 'Sun, Rui'; user@spark.apache.org Subject: RE: Control default partition when load a RDD from HDFS Why not is a good option to create a RDD per each 200Mb file and then apply the pre-calculations before merging them? I think

RE: Control default partition when load a RDD from HDFS

2014-12-17 Thread Shuai Zheng
Nice, that is the answer I want. Thanks! From: Sun, Rui [mailto:rui@intel.com] Sent: Wednesday, December 17, 2014 1:30 AM To: Shuai Zheng; user@spark.apache.org Subject: RE: Control default partition when load a RDD from HDFS Hi, Shuai, How did you turn off the file split

Control default partition when load a RDD from HDFS

2014-12-16 Thread Shuai Zheng
Hi All, My application load 1000 files, each file from 200M - a few GB, and combine with other data to do calculation. Some pre-calculation must be done on each file level, then after that, the result need to combine to do further calculation. In Hadoop, it is simple because I can

RE: Any Replicated RDD in Spark?

2014-11-06 Thread Shuai Zheng
-Original Message- From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Wednesday, November 05, 2014 6:27 PM To: Shuai Zheng Cc: user@spark.apache.org Subject: Re: Any Replicated RDD in Spark? If you start with an RDD, you do have to collect to the driver and broadcast to do

RE: Any Replicated RDD in Spark?

2014-11-05 Thread Shuai Zheng
(in theory, either way works, but in real world, which one is better?). Regards, Shuai -Original Message- From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Monday, November 03, 2014 4:15 PM To: Shuai Zheng Cc: user@spark.apache.org Subject: Re: Any Replicated RDD in Spark? You

RE: Any Replicated RDD in Spark?

2014-11-05 Thread Shuai Zheng
! -Original Message- From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Wednesday, November 05, 2014 3:32 PM To: 'Matei Zaharia' Cc: 'user@spark.apache.org' Subject: RE: Any Replicated RDD in Spark? Nice. Then I have another question, if I have a file (or a set of files: part-0, part-1

Any Replicated RDD in Spark?

2014-11-03 Thread Shuai Zheng
Hi All, I have spent last two years on hadoop but new to spark. I am planning to move one of my existing system to spark to get some enhanced features. My question is: If I try to do a map side join (something similar to Replicated key word in Pig), how can I do it? Is it anyway to declare a

Memory limitation on EMR Node?

2014-11-03 Thread Shuai Zheng
Hi, I am planning to run spark on EMR. And because my application might take a lot of memory. On EMR, I know there is a hard limit 16G physical memory on individual mapper/reducer (otherwise I will have an exception and this is confirmed by AWS EMR team, at least it is the spec at this moment).