Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
, 2017 at 2:07 PM, Pradeep Gollakota wrote: > Usually this kind of thing can be done at a lower level in the InputFormat > usually by specifying the max split size. Have you looked into that > possibility with your InputFormat? > > On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu wrote:

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
don’t think > RDD number of partitions will be increased. > > > > Thanks, > > Jasbir > > > > *From:* Fei Hu [mailto:hufe...@gmail.com] > *Sent:* Sunday, January 15, 2017 10:10 PM > *To:* zouz...@cs.toronto.edu > *Cc:* user @spark ; d...@spark.apache.org >

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
CoalescedRDD code to implement your > requirement. > > Good luck! > Cheers, > Anastasios > > > On Sun, Jan 15, 2017 at 5:39 PM, Fei Hu wrote: > >> Hi Anastasios, >> >> Thanks for your reply. If I just increase the numPartitions to be twice >> larger

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
iting to HDFS, but it might still be a narrow dependency (satisfying your > requirements) if you increase the # of partitions. > > Best, > Anastasios > > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu wrote: > >> Dear all, >> >> I want to equally divide a RDD partition

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
locality. Thanks, Fei On Sun, Jan 15, 2017 at 2:33 AM, Rishi Yadav wrote: > Can you provide some more details: > 1. How many partitions does RDD have > 2. How big is the cluster > On Sat, Jan 14, 2017 at 3:59 PM Fei Hu wrote: > >> Dear all, >> >> I want to equall

Equally split a RDD partition into two partition at the same node

2017-01-14 Thread Fei Hu
Dear all, I want to equally divide a RDD partition into two partitions. That means, the first half of elements in the partition will create a new partition, and the second half of elements in the partition will generate another new partition. But the two new partitions are required to be at the sa

Re: RDD Location

2016-12-30 Thread Fei Hu
Job inside getPreferredLocations(). > You can take a look at the source code of HadoopRDD to help you implement > getPreferredLocations() > appropriately. > > On Dec 31, 2016, at 09:48, Fei Hu wrote: > > That is a good idea. > > I tried add the following code to get ge

context.runJob() was suspended in getPreferredLocations() function

2016-12-30 Thread Fei Hu
Dear all, I tried to customize my own RDD. In the getPreferredLocations() function, I used the following code to query anonter RDD, which was used as an input to initialize this customized RDD: * val results: Array[Array[DataChunkPartition]] = context.runJob(partitionsRDD, (con

Re: RDD Location

2016-12-30 Thread Fei Hu
de the > getPreferredLocations() to implement the logic of dynamic changing of the > locations. > > On Dec 30, 2016, at 12:06, Fei Hu wrote: > > > > Dear all, > > > > Is there any way to change the host location for a certain partition of > RDD? > > > > "

RDD Location

2016-12-29 Thread Fei Hu
Dear all, Is there any way to change the host location for a certain partition of RDD? "protected def getPreferredLocations(split: Partition)" can be used to initialize the location, but how to change it after the initialization? Thanks, Fei

Kryo on Zeppelin

2016-10-10 Thread Fei Hu
Hi All, I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version 1.5.0). I customized the Spark interpreter to use org.apache.spark. serializer.KryoSerializer as spark.serializer. And in the dependency I added Kyro-3.0.3 as following: com.esotericsoftware:kryo:3.0.3 When I wro

[no subject]

2016-10-10 Thread Fei Hu
Hi All, I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version 1.5.0). I customized the Spark interpreter to use org.apache.spark.serializer.KryoSerializer as spark.serializer. And in the dependency I added Kyro-3.0.3 as following: com.esotericsoftware:kryo:3.0.3 When I wrot

Spark application Runtime Measurement

2016-07-09 Thread Fei Hu
Dear all, I have a question about how to measure the runtime for a Spak application. Here is an example: - On the Spark UI: the total duration time is 2.0 minutes = 120 seconds as following [image: Screen Shot 2016-07-09 at 11.45.44 PM.png] - However, when I check the jobs launched by

Re: run spark job

2016-03-29 Thread Fei Hu
Thanks for your link. It will be helpful for me. Best, Fei > On Mar 29, 2016, at 11:14 AM, Steve Loughran wrote: > > >> On 29 Mar 2016, at 14:30, Fei Hu > <mailto:hufe...@gmail.com>> wrote: >> >> Hi Jeff, >> >> Thanks for your info! I

Re: run spark job

2016-03-29 Thread Fei Hu
scala > > <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala> > > What's your purpose for using "java -cp", for local development, IDE should > be sufficient. > > > > > > On Tue, M

run spark job

2016-03-28 Thread Fei Hu
Hi, I am wondering how to run the spark job by java command, such as: java -cp spark.jar mainclass. When running/debugging the spark program in IntelliJ IDEA, it uses java command to run spark main class, so I think it should be able to run the spark job by java command besides the spark-submit

Spark Serializer VS Hadoop Serializer

2016-03-11 Thread Fei Hu
Hi, I am trying to migrate the program from Hadoop to Spark, but I met a problem about the serialization. In the Hadoop program, the key and value classes implement org.apache.hadoop.io.WritableComparable, which are for the serialization. Now in the spark program, I used newAPIHadoopRDD to read