Hello,
I have been trying to play with the Google ngram dataset provided by
Amazon in form of LZO compressed files.
I am having trouble understanding what is going on ;). I have added the
compression jar and native library to the underlying Hadoop/HDFS
installation, restarted the name node and the datanodes, Spark can
obviously see the file but I get gibberish on a read. Any ideas?
See output below:
14/07/13 14:39:19 INFO SparkContext: Added JAR
file:/home/ec2-user/hadoop/lib/hadoop-gpl-compression-0.1.0.jar at
http://10.10.0.100:40100/jars/hadoop-gpl-compression-0.1.0.jar with
timestamp 1405262359777
14/07/13 14:39:20 INFO SparkILoop: Created spark context..
Spark context available as sc.
scala> val f = sc.textFile("hdfs://10.10.0.98:54310/data/1gram.lzo")
14/07/13 14:39:34 INFO MemoryStore: ensureFreeSpace(163793) called with
curMem=0, maxMem=311387750
14/07/13 14:39:34 INFO MemoryStore: Block broadcast_0 stored as values
to memory (estimated size 160.0 KB, free 296.8 MB)
f: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
<console>:12
scala> f.take(10)
14/07/13 14:39:43 INFO SparkContext: Job finished: take at <console>:15,
took 0.419708348 s
res0: Array[String] =
Array(SEQ?!org.apache.hadoop.io.LongWritable?org.apache.hadoop.io.Text??#com.hadoop.compression.lzo.LzoCodec????���\<N�#^�??d^�k�������\<N�#^�??d^�k��3��??�3???�??????
?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????�?????�?�m??��??hx??????????�??�???�??�??�??�??�??�?
�?, �? �? �?, �??�??�??�??�??�??�??�??�??�??�??�??�??�??�? �? �? �? �?
�?!�?"�?#�?$�?%�?&�?'�?(�?)�?*�?+�?,�?-�?.�?/�?0�?1�?2�?3�?4�?5�?6�?7�?8�?9�?:�?;�?<�?=�?>�??�?@�?A�?B�?C�?D�?E�?F�?G�?H�?I�?J�?K�?L�?M�?N�?O�?P�?Q�?R�?S�?T�?U�?V�?W�?X�?Y�?Z�?[�?\�?]�?^�?_�?`�?a�?b�?c�?d�?e�?f�?g�?h�?i�?j�?k�?l�?m�?n�?o�?p�?q�?r�?s�?t�?u�?v�?w�?x�?y�?z�?{�?|�?}�?~�?�?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?��?...
Thanks!
Ognen