Re: Array in broadcast can't be serialized

2015-02-16 Thread Tao Xiao
/WrappedArraySerializer.scala Cheers On Sat, Feb 14, 2015 at 6:36 PM, Tao Xiao xiaotao.cs@gmail.com wrote: I'm using Spark 1.1.0 and find that *ImmutableBytesWritable* can be serialized by Kryo but *Array[ImmutableBytesWritable] *can't be serialized even when I registered both of them in Kryo. The code

Array in broadcast can't be serialized

2015-02-15 Thread Tao Xiao
I'm using Spark 1.1.0 and find that *ImmutableBytesWritable* can be serialized by Kryo but *Array[ImmutableBytesWritable] *can't be serialized even when I registered both of them in Kryo. The code is as follows: val conf = new SparkConf() .setAppName(Hello Spark)

Can not write out data as snappy-compressed files

2014-12-08 Thread Tao Xiao
I'm using CDH 5.1.0 and Spark 1.0.0, and I'd like to write out data as snappy-compressed files but encounted a problem. My code is as follows: val InputTextFilePath = hdfs://ec2.hadoop.com:8020/xt/text/new.txt val OutputTextFilePath = hdfs://ec2.hadoop.com:8020/xt/compressedText/ val

A partitionBy problem

2014-11-18 Thread Tao Xiao
Hi all, I tested *partitionBy *feature in wordcount application, and I'm puzzled by a phenomenon. In this application, I created an rdd from some text files in HDFS(about 100GB in size), each of which has lines composed of words separated by a character #. I wanted to count the occurence for

Re: How to kill a Spark job running in cluster mode ?

2014-11-12 Thread Tao Xiao
a kill link. You can try using that. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Tue, Nov 11, 2014 at 7:28 PM, Tao Xiao xiaotao.cs@gmail.com wrote: I'm using Spark 1.0.0 and I'd like to kill a job running in cluster

How to kill a Spark job running in cluster mode ?

2014-11-11 Thread Tao Xiao
I'm using Spark 1.0.0 and I'd like to kill a job running in cluster mode, which means the driver is not running on local node. So how can I kill such a job? Is there a command like hadoop job -kill job-id which kills a running MapReduce job ? Thanks

Re: All executors run on just a few nodes

2014-10-20 Thread Tao Xiao
could just sleep a few seconds before run the job. or there are some patches related and providing other way to sync executors status before running applications, but I haven’t track the related status for a while. Raymond On 2014年10月20日, at 上午11:22, Tao Xiao xiaotao.cs@gmail.com wrote: Hi

All executors run on just a few nodes

2014-10-19 Thread Tao Xiao
Hi all, I have a Spark-0.9 cluster, which has 16 nodes. I wrote a Spark application to read data from an HBase table, which has 86 regions spreading over 20 RegionServers. I submitted the Spark app in Spark standalone mode and found that there were 86 executors running on just 3 nodes and it

Re: ClasssNotFoundExeception was thrown while trying to save rdd

2014-10-13 Thread Tao Xiao
...@sigmoidanalytics.com: Adding your application jar to the sparkContext will resolve this issue. Eg: sparkContext.addJar(./target/scala-2.10/myTestApp_2.10-1.0.jar) Thanks Best Regards On Mon, Oct 13, 2014 at 8:42 AM, Tao Xiao xiaotao.cs@gmail.com wrote: In the beginning I tried to read HBase

ClasssNotFoundExeception was thrown while trying to save rdd

2014-10-12 Thread Tao Xiao
Hi all, I'm using CDH 5.0.1 (Spark 0.9) and submitting a job in Spark Standalone Cluster mode. The job is quite simple as follows: object HBaseApp { def main(args:Array[String]) { testHBase(student, /test/xt/saveRDD) } def testHBase(tableName: String, outFile:String) {

Re: ClasssNotFoundExeception was thrown while trying to save rdd

2014-10-12 Thread Tao Xiao
did not change the object name and file name. 2014-10-13 0:00 GMT+08:00 Ted Yu yuzhih...@gmail.com: Your app is named scala.HBaseApp Does it read / write to HBase ? Just curious. On Sun, Oct 12, 2014 at 8:00 AM, Tao Xiao xiaotao.cs@gmail.com wrote: Hi all, I'm using CDH 5.0.1

Re: Reading from HBase is too slow

2014-10-08 Thread Tao Xiao
to submit my job can be seen in my second post. Please refer to that. 2014-10-08 13:44 GMT+08:00 Sean Owen so...@cloudera.com: How did you run your program? I don't see from your earlier post that you ever asked for more executors. On Wed, Oct 8, 2014 at 4:29 AM, Tao Xiao xiaotao.cs

Re: Reading from HBase is too slow

2014-10-08 Thread Tao Xiao
just read splits from HBase so would not make sense to determine it by something that the first line of the program does. On Oct 8, 2014 8:00 AM, Tao Xiao xiaotao.cs@gmail.com wrote: Hi Sean, Do I need to specify the number of executors when submitting the job? I suppose the number

Re: Reading from HBase is too slow

2014-10-07 Thread Tao Xiao
, Oct 1, 2014 at 8:17 AM, Tao Xiao xiaotao.cs@gmail.com wrote: I can submit a MapReduce job reading that table, although its processing rate is also a litter slower than I expected, but not that slow as Spark.

Re: Reading from HBase is too slow

2014-10-01 Thread Tao Xiao
? This would show whether the slowdown is in HBase code or somewhere else. Cheers On Mon, Sep 29, 2014 at 11:40 PM, Tao Xiao xiaotao.cs@gmail.com wrote: I checked HBase UI. Well, this table is not completely evenly spread across the nodes, but I think to some extent it can be seen

Re: Reading from HBase is too slow

2014-09-30 Thread Tao Xiao
can not achieve high level of parallelism unless you have 5-10 regions per RS at least. What does it mean? You probably have too few regions. You can verify that in HBase Web UI. -Vladimir Rodionov On Mon, Sep 29, 2014 at 7:21 PM, Tao Xiao xiaotao.cs@gmail.com wrote: I submitted a job

Reading from HBase is too slow

2014-09-29 Thread Tao Xiao
I submitted a job in Yarn-Client mode, which simply reads from a HBase table containing tens of millions of records and then does a *count *action. The job runs for a much longer time than I expected, so I wonder whether it was because the data to read was too much. Actually, there are 20 nodes in

Re: Reading from HBase is too slow

2014-09-29 Thread Tao Xiao
() .setAppName( Reading HBase ) val sc = new SparkContext(sparkConf) val rdd = sc.newAPIHadoopRDD(hbaseConf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) println(rdd.count) } } 2014-09-30 10:21 GMT+08:00 Tao Xiao xiaotao.cs

How to sort rdd filled with existing data structures?

2014-09-24 Thread Tao Xiao
Hi , I have the following rdd : val conf = new SparkConf() .setAppName( Testing Sorting ) val sc = new SparkContext(conf) val L = List( (new Student(XiaoTao, 80, 29), I'm Xiaotao), (new Student(CCC, 100, 24), I'm CCC), (new Student(Jack, 90, 25), I'm Jack),

Re: combineByKey throws ClassCastException

2014-09-16 Thread Tao Xiao
@ Tokyo On Mon, Sep 15, 2014 at 3:06 PM, Tao Xiao xiaotao.cs@gmail.com wrote: I followd an example presented in the tutorial Learning Spark http://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html to compute the per-key average as follows: val Array(appName

Re: What is the appropriate privileges needed for writting files into checkpoint directory?

2014-09-03 Thread Tao Xiao
I found the answer. Here the file system of the checkpoint should be a fault-tolerant file system like HDFS, so we should set it to a HDFS path. It is not a local file system path. 2014-09-03 10:28 GMT+08:00 Tao Xiao xiaotao.cs@gmail.com: I tried to run KafkaWordCount in a Spark

What is the appropriate privileges needed for writting files into checkpoint directory?

2014-09-02 Thread Tao Xiao
I tried to run KafkaWordCount in a Spark standalone cluster. In this application, the checkpoint directory was set as follows : val sparkConf = new SparkConf().setAppName(KafkaWordCount) val ssc = new StreamingContext(sparkConf, Seconds(2)) ssc.checkpoint(checkpoint) After

What does appMasterRpcPort: -1 indicate ?

2014-08-31 Thread Tao Xiao
I'm using CDH 5.1.0, which bundles Spark 1.0.0 with it. Following How-to: Run a Simple Apache Spark App in CDH 5 , I tried to submit my job in local mode, Spark Standalone mode and YARN mode. I successfully submitted my job in local mode and Standalone mode, however, I noticed the following

Re: What does appMasterRpcPort: -1 indicate ?

2014-08-31 Thread Tao Xiao
(appMasterRpcPort: 0). 2014-08-31 23:10 GMT+08:00 Yi Tian tianyi.asiai...@gmail.com: I think -1 means your application master has not been started yet. 在 2014年8月31日,23:02,Tao Xiao xiaotao.cs@gmail.com 写道: I'm using CDH 5.1.0, which bundles Spark 1.0.0 with it. Following How-to: Run

Fwd: What does appMasterRpcPort: -1 indicate ?

2014-08-30 Thread Tao Xiao
I'm using CDH 5.1.0, which bundles Spark 1.0.0 with it. Following How-to: Run a Simple Apache Spark App in CDH 5 http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/ , I tried to submit my job in local mode, Spark Standalone mode and YARN mode. I successfully

How can a deserialized Java object be stored on disk?

2014-08-30 Thread Tao Xiao
Reading about RDD Persistency https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence, I learned that the storage level MEMORY_AND_DISK means that Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk,

How to provide a custom Comparator to sortByKey?

2014-02-28 Thread Tao Xiao
I am using Spark 0.9 I have an array of tuples, and I want to sort these tuples using the *sortByKey *API as follows in Spark shell: val A:Array[(String, String)] = Array((1, One), (9, Nine), (3, three), (5, five), (4, four)) val P = sc.parallelize(A) // MyComparator is an example, maybe I have

Re: Need some tutorials and examples about customized partitioner

2014-02-25 Thread Tao Xiao
the driver, write it to disk in that case. Scala Mayur Rustagi Ph: +919632149971 h https://twitter.com/mayur_rustagittp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Tue, Feb 25, 2014 at 1:19 AM, Tao Xiao xiaotao.cs@gmail.comwrote: I am a newbie to Spark and I