Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
, 2017 at 2:07 PM, Pradeep Gollakota <pradeep...@gmail.com> wrote: > Usually this kind of thing can be done at a lower level in the InputFormat > usually by specifying the max split size. Have you looked into that > possibility with your InputFormat? > > On Sun, Jan 15, 2017 at

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
greater than the current partition, I don’t think > RDD number of partitions will be increased. > > > > Thanks, > > Jasbir > > > > *From:* Fei Hu [mailto:hufe...@gmail.com] > *Sent:* Sunday, January 15, 2017 10:10 PM > *To:* zouz...@cs.toronto.edu > *Cc:* user @sp

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
you could try to use the CoalescedRDD code to implement your > requirement. > > Good luck! > Cheers, > Anastasios > > > On Sun, Jan 15, 2017 at 5:39 PM, Fei Hu <hufe...@gmail.com> wrote: > >> Hi Anastasios, >> >> Thanks for your reply. If I ju

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
of partitions before > writing to HDFS, but it might still be a narrow dependency (satisfying your > requirements) if you increase the # of partitions. > > Best, > Anastasios > > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hufe...@gmail.com> wrote: > >> Dear all,

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
locality. Thanks, Fei On Sun, Jan 15, 2017 at 2:33 AM, Rishi Yadav <ri...@infoobjects.com> wrote: > Can you provide some more details: > 1. How many partitions does RDD have > 2. How big is the cluster > On Sat, Jan 14, 2017 at 3:59 PM Fei Hu <hufe...@gmail.com> wrote: >

Equally split a RDD partition into two partition at the same node

2017-01-14 Thread Fei Hu
Dear all, I want to equally divide a RDD partition into two partitions. That means, the first half of elements in the partition will create a new partition, and the second half of elements in the partition will generate another new partition. But the two new partitions are required to be at the

Re: RDD Location

2016-12-30 Thread Fei Hu
> You can’t call runJob inside getPreferredLocations(). > You can take a look at the source code of HadoopRDD to help you implement > getPreferredLocations() > appropriately. > > On Dec 31, 2016, at 09:48, Fei Hu <hufe...@gmail.com> wrote: > > That is a good idea. >

context.runJob() was suspended in getPreferredLocations() function

2016-12-30 Thread Fei Hu
Dear all, I tried to customize my own RDD. In the getPreferredLocations() function, I used the following code to query anonter RDD, which was used as an input to initialize this customized RDD: * val results: Array[Array[DataChunkPartition]] = context.runJob(partitionsRDD,

Re: RDD Location

2016-12-30 Thread Fei Hu
class of RDD and override the > getPreferredLocations() to implement the logic of dynamic changing of the > locations. > > On Dec 30, 2016, at 12:06, Fei Hu <hufe...@gmail.com> wrote: > > > > Dear all, > > > > Is there any way to change the host location fo

RDD Location

2016-12-29 Thread Fei Hu
Dear all, Is there any way to change the host location for a certain partition of RDD? "protected def getPreferredLocations(split: Partition)" can be used to initialize the location, but how to change it after the initialization? Thanks, Fei

Kryo on Zeppelin

2016-10-10 Thread Fei Hu
Hi All, I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version 1.5.0). I customized the Spark interpreter to use org.apache.spark. serializer.KryoSerializer as spark.serializer. And in the dependency I added Kyro-3.0.3 as following: com.esotericsoftware:kryo:3.0.3 When I

[no subject]

2016-10-10 Thread Fei Hu
Hi All, I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version 1.5.0). I customized the Spark interpreter to use org.apache.spark.serializer.KryoSerializer as spark.serializer. And in the dependency I added Kyro-3.0.3 as following: com.esotericsoftware:kryo:3.0.3 When I

Spark application Runtime Measurement

2016-07-09 Thread Fei Hu
Dear all, I have a question about how to measure the runtime for a Spak application. Here is an example: - On the Spark UI: the total duration time is 2.0 minutes = 120 seconds as following [image: Screen Shot 2016-07-09 at 11.45.44 PM.png] - However, when I check the jobs launched

Re: run spark job

2016-03-29 Thread Fei Hu
deploy/SparkSubmit.scala > > <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala> > > What's your purpose for using "java -cp", for local development, IDE should > be sufficient. > > > >

run spark job

2016-03-28 Thread Fei Hu
Hi, I am wondering how to run the spark job by java command, such as: java -cp spark.jar mainclass. When running/debugging the spark program in IntelliJ IDEA, it uses java command to run spark main class, so I think it should be able to run the spark job by java command besides the

Spark Serializer VS Hadoop Serializer

2016-03-11 Thread Fei Hu
Hi, I am trying to migrate the program from Hadoop to Spark, but I met a problem about the serialization. In the Hadoop program, the key and value classes implement org.apache.hadoop.io.WritableComparable, which are for the serialization. Now in the spark program, I used newAPIHadoopRDD to