Read a TextFile(1 record contains 4 lines) into a RDD

2014-10-25 Thread Parthus
Hi, It might be a naive question, but I still wish that somebody could help me handle it. I have a textFile, in which every 4 lines represent a record. Since SparkContext.textFile() API deems of one line as a record, it does not fit into my case. I know that SparkContext.hadoopFile or

How to transform large local files into Parquet format and write into HDFS?

2014-08-14 Thread Parthus
Hi there, I have several large files (500GB per file) to transform into Parquet format and write to HDFS. The problems I encountered can be described as follows: 1) At first, I tried to load all the records in a file and then used sc.parallelize(data) to generate RDD and finally used

Create a new object by given classtag

2014-08-04 Thread Parthus
Hi there, I was wondering if somebody could tell me how to create an object with given classtag so as to make the function below work. The only thing to do is just to write one line to create an object of Class T. I tried new T but it does not work. Would it possible to give me one scala line to

What if there are large, read-only variables shared by all map functions?

2014-07-22 Thread Parthus
Hi there, I was wondering if anybody could help me find an efficient way to make a MapReduce program like this: 1) For each map function, it need access some huge files, which is around 6GB 2) These files are READ-ONLY. Actually they are like some huge look-up table, which will not change